LinguaLibre

Technical board

Welcome to Lingua Libre Technical board !
Where to start?
  • Local developments are easy. You can customize your css and your js, including creating a local WikiJS script, even with limited edit rights.
  • LinguaLibre Bot (Python, github) is a high-impact project. Help is needed to authorize it on more wikis.
  • Join us on Phabricator and GitHub.
Skills we look for…
  • Developers: we especially look for Bot Masters (Python, NodeJS), SPARQL experts, VueJS developpers, issues coordinators, but everyone is welcome.
  • Projects coordinators: we also look for organizers of recording/hacking meet-ups, who are able to build a network with language learning, language conservation and NLP actors.
Happy Coding!
  • Please announce your hacking project here to raise awareness and gather feedbacks.
  • Most of our actions remain small in scope and volunteer-based. In case your project is large enough, you could learn about some of the funding options.
Development & Technical reports
Flash Technical News
  • January 25th, 2023: the latest Github revision has been pushed on the production server. Kurdish Wiktionary is now supported. Oriya Wiktionary will be very soon. Support of more Wiktionary versions should follow.

Please visit LinguaLibre:About to learn more about the project.

Migration of technical contents

Hello all, Please help migrate technical contents from the main LinguaLibre:Chat room to here. Yug (talk) 18:49, 12 February 2021 (UTC)

2021 Github refreshing : call for volunteers and discussion

See also Github.com/lingua-libre

Hello all,
Since November 2020 there is an ongoing effort to clean up, document, fix the 11 github repositories upon which LinguaLibre.org stands. A summary is available on the main forum and will be migrated here shortly. This section will focus on gathering users with development skills and discuss about possible fields of action (repositories). We especially look for Bot Masters (Python, NodeJS), Sparql expert, VueJS developpers, issues coordinators. Yug (talk) 15:57, 12 February 2021 (UTC)

Early 2021 codings : Wikivalley & volunteers communication board !

WikiValley have been selected to make a notable technical push on the LinguaLibre Suite where volunteer developers are not enough. They will coordinate with volunteers developers in order to smooth everyone's work, avoid duplicate efforts and git conflicts. The Start, End, and Repositories columns below are especially important, please keep them up to date, respect them, or change them whenever required. If you need to work on a repository under work, contact the developer listed there and organize as needed. Our objective here is to keep clarity and to progress smoothly. Please avoid emails and prefer communicating here within subsections so we can all be somehow aware of how are things going. Yug (talk) 15:56, 12 February 2021 (UTC)

Note: Volunteers started working around in December. WikiValley around Feb. 11th. Yug (talk) 15:56, 12 February 2021 (UTC)
Past developments
Start End Contacts/dev Team Repository Advancement & result so far.
2021/02/01 2021/02/10 Yug Volunteers SignIt Get back control (access right) ; fix video query ; test locally ; publish new version on Mozilla store
→ Fixed Firefox extension
2021/02/01 2021/02/16? Yug
Michael
Volunteers
WM-France
/operations
/CommonDownloadTool
Explore possible breakpoints ; identify likely cause ; fix ; deploy ; run
→ Fixed https://lingualibre.org/datasets/
2021/02/? 2021/02/11 Vigneron
WikiValley
Wikivalley QueryViz
Other?
Explore possible breakpoints ; identify cause ; fix ; deploy ; inquire on numbers differences
→ Fixed LinguaLibre:Stats
Current developments
2021/02/01 2022/01/01 Poslovitch Volunteers Lingua-Libre-Bot Maintain, update and operate the bot.
2021 Q1 [WIP]: Refactor the bot to ease implementations of additional Wiktionaries.
Planned developments
2021/02/01 2021/03/?? Poslovitch Volunteers /operations
/CommonDownloadTool
Project: Explore datasets scripts and queries. May require SPARQL assistance.
2021/02/? 2021/02/? WikiLucas00
Yug
Volunteers CustomSubtitle
BlueLL
Explore Subtitle's ribbon's bug ; identify cause.

@VIGNERON please keep us informed a bit on what your team is touching. Just edit above and ping us to notify us of an update. Yug (talk) 16:25, 15 February 2021 (UTC)

User box ?

Babel user information
mar
mr-N या सदस्याला मराठी चे स्थानिक स्तराचे ज्ञान आहे.
cmn-1 This user has basic knowledge of Mandarin Chinese.
Users by language

It may be cool to create an userbox "dev" {{Userbox-dev}}, on the model of {{Userbox-records}}, with Python, Javascript, PHP, VueJS, Wikimedia Bot as specific sub-categorization ? Yug (talk) 15:59, 12 February 2021 (UTC)

I kinda disagree with that. Lingua Libre is not meant to become a hub for techies. Sure, we need all the help that comes, yet the only usecase I foresee for these userboxes would be in the event something goes bad and we need someone with the good skills to take care of that. But, since it has to be added by oneself on one's user page, the same can be said of the page where we list who does what (I don't recall how it's called). Which one of these two 'systems" should be kept? --Poslovitch (talk) 21:53, 13 February 2021 (UTC)
In term of community I see ourselves as somewhere inbetween Wikipedia and Wikidata communities. We mainly deal with singleton : audio files, which are data units. People come, do a more or less powerful recording contributions, then sharply reduce their involvement and leave thousands files units here.
And like Wikidata, we need people giving life to these data units. This is done via reuse, bots, webapps, text-to-speech. Developers' creations.
So yes, developers have to become an important piece of our community. And we would gain to create some active dynamic gathered around languages and projects (repositories). Yug (talk) 22:30, 13 February 2021 (UTC)

Datasets has become super slow ?

I try to interpret and understand how /datasets are generate.

  • On April 2020, French dataset of about 100,000 audios is processed in 51 minutes.
  • On February 2021, Bengali dataset of about 50,000 audios is processed in 18 hours.

What do I miss ? Yug (talk) 00:09, 13 February 2021 (UTC)


Zip file Date Bits
lingualibre_full.zip 2019-May-17:01:18 1989664440
Q101-srr-Serer.zip 2019-Nov-05:03:09 14967
Q113-cmn-Mandarin_Chinese.zip 2019-Nov-05:03:09 112613
Q115107-bcl-Central_Bikol.zip 2019-Nov-05:03:09 166323
Q127-tam-Tamil.zip 2019-Nov-05:03:09 154352
Q130-zho-Chinese.zip 2019-Nov-05:03:10 2724328
Q131-hye-Armenian.zip 2019-Nov-05:03:10 824117
Q141-cym-Welsh.zip 2019-Nov-05:03:10 12905993
Q154-amh-Amharic.zip 2019-Nov-05:03:11 2653977
Q165-hat-Haitian_Creole.zip 2019-Nov-05:03:11 233588
Q169-tgl-Tagalog.zip 2019-Nov-05:03:11 77198
Q170137-mos-Mossi.zip 2019-Nov-05:03:11 1158142
Q205-gre-Greek.zip 2019-Nov-05:03:11 239390
Q231-myv-Erzya.zip 2019-Nov-05:03:21 205878
Q242-fon-Fon.zip 2019-Nov-05:03:21 1538614
Q258-nso-Northern_Sotho.zip 2019-Nov-05:03:24 774299
Q311-oci-Occitan.zip 2019-Nov-05:03:33 511332485
Q318-bam-Bambara.zip 2019-Nov-05:03:33 277786
Q321-gaa-Ga.zip 2019-Nov-05:03:33 3247380
Q336-ori-Odia.zip 2019-Nov-05:03:34 38697693
Q339-sat-Santali.zip 2019-Nov-05:03:34 128941
Q34-mar-Marathi.zip 2019-Nov-05:03:34 2274397
Q35-nld-Dutch.zip 2019-Nov-05:03:34 36279372
Q385-ita-Italian.zip 2019-Nov-05:03:34 3440247
Q388-que-Quechua.zip 2019-Nov-05:03:35 397476
Q39-tel-Telugu.zip 2019-Nov-05:03:35 85571
Q397-heb-Hebrew.zip 2019-Nov-05:03:35 1657223
Q405-bas-Basaa_language.zip 2019-Nov-05:03:35 1515700
Q437-mal-Malayalam.zip 2019-Nov-05:03:35 138601
Q446-pan-Punjabi.zip 2019-Nov-05:03:35 11004
Q4465-mis-Teochew_dialect.zip 2019-Nov-05:03:35 69734
Q45-nor-Norwegian.zip 2019-Nov-05:03:35 431566
Q46-ltz-Luxembourgish.zip 2019-Nov-05:03:35 1679618
Q51299-hav-Havu.zip 2019-Nov-05:03:37 56823
Q51302-tay-Atayal.zip 2019-Nov-05:03:37 65533
Q52067-bbj-Ghomala'_language.zip 2019-Nov-05:03:37 1765823
Q52068-bum-Bulu_language.zip 2019-Nov-05:03:37 1382789
Q52071-dua-Duala.zip 2019-Nov-05:03:37 1206427
Q52073-bdu-Oroko.zip 2019-Nov-05:03:37 1723960
Q52074-bzm-Londo.zip 2019-Nov-05:03:37 1750380
Q52295-atj-Atikamekw.zip 2019-Nov-05:03:37 7315215
Q74905-mis-Sursilvan.zip 2019-Nov-05:03:37 14618
Q83641-gcf-Guadeloupean_Creole_French.zip 2019-Nov-05:03:38 7412512
Q930-mis-Gascon_dialect.zip 2019-Nov-05:03:39 179656450
Q931-mis-Languedocien_dialect.zip 2019-Nov-05:03:40 191575650
Q123-hin-Hindi.zip 2020-Apr-25:03:30 1704401
Q126-por-Portuguese.zip 2020-Apr-25:03:31 43732966
Q129-rus-Russian.zip 2020-Apr-25:03:32 60844464
Q150-afr-Afrikaans.zip 2020-Apr-25:04:18 42363003
Q159-dyu-Dioula_language.zip 2020-Apr-25:04:18 784432
Q19858-bci-Baoulé.zip 2020-Apr-25:04:18 1268304
Q203-cat-Catalan.zip 2020-Apr-25:04:18 9738365
Q204940-ken-Nyang_language.zip 2020-Apr-25:04:18 483396
Q208-vie-Vietnamese.zip 2020-Apr-25:04:18 8822067
Q219-ara-Arabic.zip 2020-Apr-25:04:19 85373129
Q21-fra-French.zip 2020-Apr-25:05:10 2112950650
Q221062-mis-Cantonese.zip 2020-Apr-25:05:10 3895600
Q22-eng-English.zip 2020-Apr-25:05:12 131688602
Q25-epo-Esperanto.zip 2020-Apr-25:05:19 445662713
Q264201-ary-Moroccan_Arabic.zip 2020-Apr-25:05:19 1371064
Q273-kab-Kabyle.zip 2020-Apr-25:05:19 370876
Q298-pol-Polish.zip 2020-Apr-25:05:21 145009958
Q299-eus-Basque.zip 2020-Apr-25:05:21 46035866
Q33-fin-Finnish.zip 2020-Apr-25:05:46 19473062
Q386-spa-Spanish.zip 2020-Apr-25:05:46 28434220
Q389-jpn-Japanese.zip 2020-Apr-25:05:46 145688
Q392-ces-Czech.zip 2020-Apr-25:05:46 96844
Q44-swe-Swedish.zip 2020-Apr-25:05:46 166237
Q4901-shy-Shawiya_language.zip 2020-Apr-25:05:47 15804835
Q6714-arq-Algerian_Arabic.zip 2020-Apr-25:05:47 3420182
Q80-kan-Kannada.zip 2020-Apr-25:05:47 3662223
Q24-deu-German.zip 2021-Feb-11:15:32 258363332
Q307-ben-Bengali.zip 2021-Feb-12:07:28 1079637723
IMO, this can only be investigated through the logs. Maybe the requests to Commons are taking a longer time than they used to? Maybe the datasets server is under higher load (thus slowing it)? We need you, Michaël! --Poslovitch (talk) 21:41, 13 February 2021 (UTC)
@Poslovitch could it be that the script upload the "never uploaded yet" ? If so, the April 2020 French dataset was just the 3000 recent French audios whereas Feb 2021 Bengali dataset was like "Yo, there are the 50,000 bengali audio, deal with it B)" Yug (talk) 22:24, 13 February 2021 (UTC)
@Yug that might be it. I still don't fully understand what the script does and, well, we can say the documentation is clearly lacking there. I'm working on that too - but yeah, that might be why it's taking more time. We should let it run for a first time, and then force another dataset update a few days later so we can compare both. --Poslovitch (talk) 22:38, 13 February 2021 (UTC)
@Michael Barbereau WMFr Seems the script has finished running, right ? Any idea why it's so slow in 2021, is there some known overload or hardware issue ? Yug (talk) 12:07, 15 February 2021 (UTC)
Some of the root causes have been unearthed in this comment on GitHub: https://github.com/lingua-libre/CommonsDownloadTool/issues/2#issuecomment-780177124. --Poslovitch (talk) 23:16, 16 February 2021 (UTC)

Using Magic Word {#language:} and Extension:CLDR (?)

For general awareness. No real question asked.

I found out LL uses the MediaWiki Extension CLDR. Its data comes from the w:Common Locale Data Repository Project (CLDR), part of the Unicode Consortium. This extension automatizes translations from iso-639 codes to target languages name words, ex: {#language:it|en} → Italian. Coverage range is ~500 names in ~166 languages. Translate wiki has a tutorial on how to contribute to this CLDR website.

  • mw:Help:Magic_words#Miscellaneous > {{#language:language code|target language code}}
    → {{#language:ar|en}} → Arabic
    → {{#language:ar|hi}} → अरबी
    → {{#language:ja|hi}} → जापानी
    → {{#language:fr|he}} → צרפתית
    → {{#language:fra|he}} → fra (not available)
    → {{#language:fr-ca|he}} → Canadian French (falls back on English)
    → {{#language:mar|hi}} → mar (n.a)
    → {{#language:en|mar}} → English (falls back on English)
    → {{#language:mar|en}} → mar (n.a)
    → {{#language:mar|mar}} → mar (n.a)
    • mw:Extension:CLDR MediaWiki Extension : "Provides functions to localize the names of languages, countries, currencies, and time units based on their language code."
      • Github mirror > key folder : /CldrNames & Wikimedia corrections here.

While we would gain to stay focus on our own recording mission, it stays interesting to be aware of this project. cc @Poslovitch , for the Magic Word. Yug (talk) 12:07, 15 February 2021 (UTC)

Good to know, but I don't think we're going to use that in the foreseeable future. --Poslovitch (talk) 22:41, 16 February 2021 (UTC)

BlueLL theme might break when updating to MW 1.35

Hi @VIGNERON . According to this pending PR on GitHub (https://github.com/lingua-libre/BlueLL/pull/3), the BlueLL theme might not be compatible with MW 1.35. I have limited knowledge in MW themes, but I can merge the PR if needed. What's your opinion about it? --Poslovitch (talk) 10:49, 16 February 2021 (UTC)

Hi there, I also don't have the technical understanding of mediawiki themes to say much, but I encourage you to talk with jdlrobson to see what he think about his fix & 1.35. Yug (talk) 21:30, 16 February 2021 (UTC)
@VIGNERON I resumed the conversation with jdlrobson on his pull request. But I admit I don't have the technical capability to properly review his code submission. Also, I believe upgrading to MW 1,35 and checking on skin compatibility is within WikiValley's mission. Please clarify, and feel free to slow down this PR discussion if required. Yug (talk) 12:01, 17 February 2021 (UTC)
I did a preliminary test of BlueLL on 1.35 and did similar changes, although these complete as a first try; I didn’t see this PR. So I will test more extensivelly the PR and will accept it after review. Seb35 (talk) 10:41, 19 February 2021 (UTC)

Generate a summary.csv alongside the datasets ?

This idea just crossed my mind. Would it be interesting to generated a summary.csv file containing the list of available datasets, their generation date, their size in bytes with additional information such as amount of recordings, amount of speakers, total length of audio files... Any opinions? --Poslovitch (talk) 11:15, 16 February 2021 (UTC)

(Then why not a minimalist HTML5 webpage with a single table ? Would be more elegant. Yug (talk) 21:35, 16 February 2021 (UTC))
Also, only 13 zip have been updated. More languages have been active in the past month alone. The /datasets/ also doesn't display the 100+ language he should. So I suspect create_datasets.sh is still not doing the full thing.
Poslovitch, you see it too ? Yug (talk) 21:35, 16 February 2021 (UTC)

Where is LinguaImporter's code (admins only)

Hello, we have T233917 which request to edit the language importer tool. I checked Mediawiki:Common.js and github/lingua-libre with the UI's string search:LinguaImporter and search:Import a language, but nothing. Any idea where is this LanguageImporter tool coded ? Yug (talk) 21:48, 16 February 2021 (UTC)

Found it. It's a Gadget.
Yug (talk) 22:13, 16 February 2021 (UTC)
@WikiLucas00 & Pamputt I just want to know WHO know how to create a Property on lL's language Q-items ? It's still under discussion on phabricator, a possible language's Property qui aurait pour valeur Category:Lingua_Libre_pronunciation- + ISO639-3. Ex: Category:Lingua_Libre_pronunciation-yue or https://commons.wikimedia.org/wiki/Category:Lingua_Libre_pronunciation-yue. You are invited on phabricator:T233917 to give your input on this issue. Yug (talk)
To have a property "Commons category" (language) AND/OR "Commons category"(speaker) giving a link to the Commons category on each file is interesting.
Here is the link to create a new property, but unfortunately I never used it, maybe @VIGNERON or Pamputt will know better. — WikiLucas (🖋️) 23:34, 16 February 2021 (UTC)

Phabricator task priority

Hi Poslovitch and Yug. I saw you set priority levels to some Phabricator tasks (T264117, T251866, etc.). As already discussed quickly with Yug, I tink we should not decide by ourselves what is the priority of a bug report except if you claimed the report. The priority of the task should be discussed during Lingua Libre meeting gathering several people because what is you think is important may be less to me (and vice-versa). This is highly subjective and I do not see the benefits to set some priority if it has not been discussed collectively before. In addition, IMHO, it cannot be used to drive the WikiValley & VIGNERON's job because they are linked by their contract with WMFr, so if they want to work on additional bug report (not listed in the scope statement), then they will choose by themselves. So please, stop to change the priority or at least let us discuss that collectively before. Pamputt (talk) 09:14, 17 February 2021 (UTC)

Hi Pamputt, there is no will to dictate what WikiValley & VIGNERON have to do, we are two parallel teams, with different relations to the project. It would help if WikiValley or Wikimedia FR clarifies on phabricator via a tag which tasks WikiValley will take on so we, volunteers, may focus on OTHERS tasks. Be it code or community organization (Wiki Meet India coming).
Still, we volunteers are active now and we need to sort these tasks better for ourselves, so we see better, and act where we can.
Volunteers should not have to decipher and enter dozens tasks to understand what they are about, the task feasibility and importance. With lack of assessment we loose clarity, time and reduce our impact. We are volunteers. Volunteering, yes. But we must make the tasks easier to jump in and CODE/ACT. No assessment is perfect, and any assessment can be bypassed. So since Poslovitch, myself, and other are diving into these tasks, let's share our assessments if we have any view on it, reword the 1/3 not-so-transparent task titles and descriptions, group them as needed, and improve as we can. Common wiki clean up as we go. And each adult takes the task(s) he wants following his need, naturally.
As for myself, I made a push to organize task by column, then group by scope-and-repositories, and put most feasible scopes up, less feasible down. I will stop my cleaning today to focus on other things shortly. Yug (talk) 10:28, 17 February 2021 (UTC)
Hi. I understand that you can be worried by the fact this triage seems sudden. And it actually is. Most tasks were left untriaged for months, while common project management practices (at least the ones I follow for years now) call for a preliminary triage based on the "Urgent/Not urgent" & "Important/Not important" criteria (I think it's called the "Decision Matrix"). A task is deemed "Urgent" if it is a task, if left uncompleted, which is or would severely impede or even block the fulfilling of the project's main goal ; it is deemed "Important" if its completion contributes to improve or complete the project's capabilities to fulfil its mission(s). These are simple "Yes/No" questions that help provide a quick assessment of the priority of the task. And in most cases (out of my own experiences), this method gives the right priority. The slight "difference" between "main goal" and "mission(s)" is well defined for Lingua Libre; which is why I could do that assessment.
However, I think you're wrong on the fact that the guy who assigned himself to the task should evaluate its priority. Unless he's applying the decision matrix, he would obviously be biased about it, don't you think?
Yet, I personally kept myself from triaging issues on which I was not knowledgeable enough yet. But these issues should be triaged ASAP.
Finally: no, triaging is not a way to "force" Vigneron and WikiValley to work on stuff they aren't contractually meant to. That is, unless a specific task actually blocks them in their work, or is somehow linked to what they're doing or whatever. As Yug put it, triaging is one the first steps to help us and other volunteers to finally grasp the work that's to be done - without getting lost at the first glance.
It's not meant to be perfect, and it must be refined. But that would only be the case of a handful of tasks. Not all of them. --Poslovitch (talk) 12:07, 17 February 2021 (UTC)

Ratelimit still an issue for our users

Following review of Phabricator tasks, Pamputt & myself discussed ratelimit. It came back to me that we do have to warn our new users of this 380 upload/72mins issues. I wanted to see if our new users bumped into this wall without us knowing, so I examined closely the stream of Olaf, who made 1000+ recording today.

23:10-23:11: 23
22:54-23:08: 280
22:19-22:25: 61
[13mins pause here]
20:52-21:06: 222 Note: 222+158=380
[17mins pause here]
20:26-20:35: 158
[16mins pause here]
20:09-20-10: 18 Note: 361+18=379
19:14-19:58: 361

I find those numbers suspiciously close to the ratelimit. So I'am pretty confident Olaf runs into this upload fails, just found a workaround : he make pauses. Point is, we must notice our emerging active users better about this issue. I now inform users who made at at least one session with 100+ recordings with :

Code Result Comment

{{subst:user ratelimit|user=Olaf}}

Hello Olaf, your current userrights on Commons limits you to 380 recordings per 72mins. We can upgrade you rapidly via a request on Commons:Requests for rights. Interested ?

Template is a work in progress and still unstable. Be sure to always use subst: and to check the end result.

I just asked autopatrol userright for Olaf. Let's monitor this issue closely and ask faster for autopatrol userrights on Commons for emerging users. Yug (talk) 00:06, 18 February 2021 (UTC)

After analysis : VictorDtmtc (14:25-15:14 16 February 2021=380audios, then pause 30mins), Olaf, ‎Poemat where stopped by the 380 ratelimit this past few days. KlaudiuMihaila ‎did 340 recording in 40mins, but was not stopped. I misinterpreted API data for 2 users.
The 4 users above didn't have autopatrol rights, were limited to 380 audios per 72 minutes, and developed by their own "pause then click upload again" strategy, if lucky. Audios not uploaded may also get lost if the tab is closed in between.
We can assume the ratelimit wall to limit a good part of our active contributors rights now. I did not investigated the Marathi community.
I asked for autopatrol rights for Olaf, ‎Poemat and VictorDtmtc. I think we must ask the same useright for KlaudiuMihaila before s.he hits the wall.
Reminder: In June 2020, Luilui6666 lost few hundreds audios recordings by thinking the web browser tab had crash and required reboot, that's how we first investigated this issue. Yug (talk) 02:26, 18 February 2021 (UTC)