LinguaLibre

Technical board

Welcome to Lingua Libre Technical board !
Where to start?
  • Local developments are easy. You can customize your css and your js, including creating a local WikiJS script, even with limited edit rights.
  • LinguaLibre Bot (Python, github) is a high-impact project. Help is needed to authorize it on more wikis.
  • Join us on Phabricator and GitHub.
Skills we look for…
  • Developers: we especially look for Bot Masters (Python, NodeJS), SPARQL experts, VueJS developpers, issues coordinators, but everyone is welcome.
  • Projects coordinators: we also look for organizers of recording/hacking meet-ups, who are able to build a network with language learning, language conservation and NLP actors.
Happy Coding!
  • Please announce your hacking project here to raise awareness and gather feedbacks.
  • Most of our actions remain small in scope and volunteer-based. In case your project is large enough, you could learn about some of the funding options.
Development & Technical reports
Flash Technical News
  • January 25th, 2023: the latest Github revision has been pushed on the production server. Kurdish Wiktionary is now supported. Oriya Wiktionary will be very soon. Support of more Wiktionary versions should follow.

Please visit LinguaLibre:About to learn more about the project.

Migration of technical contents

Hello all, Please help migrate technical contents from the main LinguaLibre:Chat room to here. Yug (talk) 18:49, 12 February 2021 (UTC)

2021 Github refreshing : call for volunteers and discussion

See also Github.com/lingua-libre

Hello all,
Since November 2020 there is an ongoing effort to clean up, document, fix the 11 github repositories upon which LinguaLibre.org stands. A summary is available on the main forum and will be migrated here shortly. This section will focus on gathering users with development skills and discuss about possible fields of action (repositories). We especially look for Bot Masters (Python, NodeJS), Sparql expert, VueJS developpers, issues coordinators. Yug (talk) 15:57, 12 February 2021 (UTC)

Early 2021 codings : Wikivalley & volunteers communication board !

WikiValley have been selected to make a notable technical push on the LinguaLibre Suite where volunteer developers are not enough. They will coordinate with volunteers developers in order to smooth everyone's work, avoid duplicate efforts and git conflicts. The Start, End, and Repositories columns below are especially important, please keep them up to date, respect them, or change them whenever required. If you need to work on a repository under work, contact the developer listed there and organize as needed. Our objective here is to keep clarity and to progress smoothly. Please avoid emails and prefer communicating here within subsections so we can all be somehow aware of how are things going. Yug (talk) 15:56, 12 February 2021 (UTC)

Note: Volunteers started working around in December. WikiValley around Feb. 11th. Yug (talk) 15:56, 12 February 2021 (UTC)
Past developments
Start End Contacts/dev Team Repository Advancement & result so far.
2021/02/01 2021/02/10 Yug Volunteers SignIt Get back control (access right) ; fix video query ; test locally ; publish new version on Mozilla store
→ Fixed Firefox extension
2021/02/01 2021/02/16? Yug
Michael
Volunteers
WM-France
/operations
/CommonDownloadTool
Explore possible breakpoints ; identify likely cause ; fix ; deploy ; run
→ Fixed https://lingualibre.org/datasets/
2021/02/11 2021/02/11 VIGNERON
Wiki Valley
Wiki Valley Blazegraph (updater) Explore possible breakpoints ; identify cause ; fix ; deploy ; inquire on numbers differences
→ Fixed LinguaLibre:Stats
Current developments
2021/02/01 2022/01/01 Poslovitch Volunteers Lingua-Libre-Bot Maintain, update and operate the bot.
2021 Q1 [WIP]: Refactor the bot to ease implementations of additional Wiktionaries.
Planned developments
2021/02/01 2021/03/?? Poslovitch Volunteers /operations
/CommonDownloadTool
Project: Explore datasets scripts and queries. May require SPARQL assistance.
2021/02/? 2021/02/? WikiLucas00
Yug
Volunteers CustomSubtitle
BlueLL
Explore Subtitle's ribbon's bug ; identify cause.
2021/02/19 2021/02/? VIGNERON
Wiki Valley
Wiki Valley
VIGNERON
custom extensions + operations Update to MediaWiki 1.35

@VIGNERON please keep us informed a bit on what your team is touching. Just edit above and ping us to notify us of an update. Yug (talk) 16:25, 15 February 2021 (UTC)

User box ?

Babel user information
mar
mr-N या सदस्याला मराठी चे स्थानिक स्तराचे ज्ञान आहे.
cmn-1 This user has basic knowledge of Mandarin Chinese.
Users by language

It may be cool to create an userbox "dev" {{Userbox-dev}}, on the model of {{Userbox-records}}, with Python, Javascript, PHP, VueJS, Wikimedia Bot as specific sub-categorization ? Yug (talk) 15:59, 12 February 2021 (UTC)

I kinda disagree with that. Lingua Libre is not meant to become a hub for techies. Sure, we need all the help that comes, yet the only usecase I foresee for these userboxes would be in the event something goes bad and we need someone with the good skills to take care of that. But, since it has to be added by oneself on one's user page, the same can be said of the page where we list who does what (I don't recall how it's called). Which one of these two 'systems" should be kept? --Poslovitch (talk) 21:53, 13 February 2021 (UTC)
In term of community I see ourselves as somewhere inbetween Wikipedia and Wikidata communities. We mainly deal with singleton : audio files, which are data units. People come, do a more or less powerful recording contributions, then sharply reduce their involvement and leave thousands files units here.
And like Wikidata, we need people giving life to these data units. This is done via reuse, bots, webapps, text-to-speech. Developers' creations.
So yes, developers have to become an important piece of our community. And we would gain to create some active dynamic gathered around languages and projects (repositories). Yug (talk) 22:30, 13 February 2021 (UTC)

Datasets has become super slow ?

I try to interpret and understand how /datasets are generate.

  • On April 2020, French dataset of about 100,000 audios is processed in 51 minutes.
  • On February 2021, Bengali dataset of about 50,000 audios is processed in 18 hours.

What do I miss ? Yug (talk) 00:09, 13 February 2021 (UTC)


Zip file Date Bits
lingualibre_full.zip 2019-May-17:01:18 1989664440
Q101-srr-Serer.zip 2019-Nov-05:03:09 14967
Q113-cmn-Mandarin_Chinese.zip 2019-Nov-05:03:09 112613
Q115107-bcl-Central_Bikol.zip 2019-Nov-05:03:09 166323
Q127-tam-Tamil.zip 2019-Nov-05:03:09 154352
Q130-zho-Chinese.zip 2019-Nov-05:03:10 2724328
Q131-hye-Armenian.zip 2019-Nov-05:03:10 824117
Q141-cym-Welsh.zip 2019-Nov-05:03:10 12905993
Q154-amh-Amharic.zip 2019-Nov-05:03:11 2653977
Q165-hat-Haitian_Creole.zip 2019-Nov-05:03:11 233588
Q169-tgl-Tagalog.zip 2019-Nov-05:03:11 77198
Q170137-mos-Mossi.zip 2019-Nov-05:03:11 1158142
Q205-gre-Greek.zip 2019-Nov-05:03:11 239390
Q231-myv-Erzya.zip 2019-Nov-05:03:21 205878
Q242-fon-Fon.zip 2019-Nov-05:03:21 1538614
Q258-nso-Northern_Sotho.zip 2019-Nov-05:03:24 774299
Q311-oci-Occitan.zip 2019-Nov-05:03:33 511332485
Q318-bam-Bambara.zip 2019-Nov-05:03:33 277786
Q321-gaa-Ga.zip 2019-Nov-05:03:33 3247380
Q336-ori-Odia.zip 2019-Nov-05:03:34 38697693
Q339-sat-Santali.zip 2019-Nov-05:03:34 128941
Q34-mar-Marathi.zip 2019-Nov-05:03:34 2274397
Q35-nld-Dutch.zip 2019-Nov-05:03:34 36279372
Q385-ita-Italian.zip 2019-Nov-05:03:34 3440247
Q388-que-Quechua.zip 2019-Nov-05:03:35 397476
Q39-tel-Telugu.zip 2019-Nov-05:03:35 85571
Q397-heb-Hebrew.zip 2019-Nov-05:03:35 1657223
Q405-bas-Basaa_language.zip 2019-Nov-05:03:35 1515700
Q437-mal-Malayalam.zip 2019-Nov-05:03:35 138601
Q446-pan-Punjabi.zip 2019-Nov-05:03:35 11004
Q4465-mis-Teochew_dialect.zip 2019-Nov-05:03:35 69734
Q45-nor-Norwegian.zip 2019-Nov-05:03:35 431566
Q46-ltz-Luxembourgish.zip 2019-Nov-05:03:35 1679618
Q51299-hav-Havu.zip 2019-Nov-05:03:37 56823
Q51302-tay-Atayal.zip 2019-Nov-05:03:37 65533
Q52067-bbj-Ghomala'_language.zip 2019-Nov-05:03:37 1765823
Q52068-bum-Bulu_language.zip 2019-Nov-05:03:37 1382789
Q52071-dua-Duala.zip 2019-Nov-05:03:37 1206427
Q52073-bdu-Oroko.zip 2019-Nov-05:03:37 1723960
Q52074-bzm-Londo.zip 2019-Nov-05:03:37 1750380
Q52295-atj-Atikamekw.zip 2019-Nov-05:03:37 7315215
Q74905-mis-Sursilvan.zip 2019-Nov-05:03:37 14618
Q83641-gcf-Guadeloupean_Creole_French.zip 2019-Nov-05:03:38 7412512
Q930-mis-Gascon_dialect.zip 2019-Nov-05:03:39 179656450
Q931-mis-Languedocien_dialect.zip 2019-Nov-05:03:40 191575650
Q123-hin-Hindi.zip 2020-Apr-25:03:30 1704401
Q126-por-Portuguese.zip 2020-Apr-25:03:31 43732966
Q129-rus-Russian.zip 2020-Apr-25:03:32 60844464
Q150-afr-Afrikaans.zip 2020-Apr-25:04:18 42363003
Q159-dyu-Dioula_language.zip 2020-Apr-25:04:18 784432
Q19858-bci-Baoulé.zip 2020-Apr-25:04:18 1268304
Q203-cat-Catalan.zip 2020-Apr-25:04:18 9738365
Q204940-ken-Nyang_language.zip 2020-Apr-25:04:18 483396
Q208-vie-Vietnamese.zip 2020-Apr-25:04:18 8822067
Q219-ara-Arabic.zip 2020-Apr-25:04:19 85373129
Q21-fra-French.zip 2020-Apr-25:05:10 2112950650
Q221062-mis-Cantonese.zip 2020-Apr-25:05:10 3895600
Q22-eng-English.zip 2020-Apr-25:05:12 131688602
Q25-epo-Esperanto.zip 2020-Apr-25:05:19 445662713
Q264201-ary-Moroccan_Arabic.zip 2020-Apr-25:05:19 1371064
Q273-kab-Kabyle.zip 2020-Apr-25:05:19 370876
Q298-pol-Polish.zip 2020-Apr-25:05:21 145009958
Q299-eus-Basque.zip 2020-Apr-25:05:21 46035866
Q33-fin-Finnish.zip 2020-Apr-25:05:46 19473062
Q386-spa-Spanish.zip 2020-Apr-25:05:46 28434220
Q389-jpn-Japanese.zip 2020-Apr-25:05:46 145688
Q392-ces-Czech.zip 2020-Apr-25:05:46 96844
Q44-swe-Swedish.zip 2020-Apr-25:05:46 166237
Q4901-shy-Shawiya_language.zip 2020-Apr-25:05:47 15804835
Q6714-arq-Algerian_Arabic.zip 2020-Apr-25:05:47 3420182
Q80-kan-Kannada.zip 2020-Apr-25:05:47 3662223
Q24-deu-German.zip 2021-Feb-11:15:32 258363332
Q307-ben-Bengali.zip 2021-Feb-12:07:28 1079637723
IMO, this can only be investigated through the logs. Maybe the requests to Commons are taking a longer time than they used to? Maybe the datasets server is under higher load (thus slowing it)? We need you, Michaël! --Poslovitch (talk) 21:41, 13 February 2021 (UTC)
@Poslovitch could it be that the script upload the "never uploaded yet" ? If so, the April 2020 French dataset was just the 3000 recent French audios whereas Feb 2021 Bengali dataset was like "Yo, there are the 50,000 bengali audio, deal with it B)" Yug (talk) 22:24, 13 February 2021 (UTC)
@Yug that might be it. I still don't fully understand what the script does and, well, we can say the documentation is clearly lacking there. I'm working on that too - but yeah, that might be why it's taking more time. We should let it run for a first time, and then force another dataset update a few days later so we can compare both. --Poslovitch (talk) 22:38, 13 February 2021 (UTC)
@Michael Barbereau WMFr Seems the script has finished running, right ? Any idea why it's so slow in 2021, is there some known overload or hardware issue ? Yug (talk) 12:07, 15 February 2021 (UTC)
Some of the root causes have been unearthed in this comment on GitHub: https://github.com/lingua-libre/CommonsDownloadTool/issues/2#issuecomment-780177124. --Poslovitch (talk) 23:16, 16 February 2021 (UTC)

Using Magic Word {#language:} and Extension:CLDR (?)

For general awareness. No real question asked.

I found out LL uses the MediaWiki Extension CLDR. Its data comes from the w:Common Locale Data Repository Project (CLDR), part of the Unicode Consortium. This extension automatizes translations from iso-639 codes to target languages name words, ex: {#language:it|en} → Italian. Coverage range is ~500 names in ~166 languages. Translate wiki has a tutorial on how to contribute to this CLDR website.

  • mw:Help:Magic_words#Miscellaneous > {{#language:language code|target language code}}
    → {{#language:ar|en}} → Arabic
    → {{#language:ar|hi}} → अरबी
    → {{#language:ja|hi}} → जापानी
    → {{#language:fr|he}} → צרפתית
    → {{#language:fra|he}} → fra (not available)
    → {{#language:fr-ca|he}} → Canadian French (falls back on English)
    → {{#language:mar|hi}} → mar (n.a)
    → {{#language:en|mar}} → English (falls back on English)
    → {{#language:mar|en}} → mar (n.a)
    → {{#language:mr|en}} → Marathi
    → {{#language:mr|mr}} → मराठी
    • mw:Extension:CLDR MediaWiki Extension : "Provides functions to localize the names of languages, countries, currencies, and time units based on their language code."
      • Github mirror > key folder : /CldrNames & Wikimedia corrections here.

While we would gain to stay focus on our own recording mission, it stays interesting to be aware of this project. cc @Poslovitch , for the Magic Word. Yug (talk) 12:07, 15 February 2021 (UTC)

Good to know, but I don't think we're going to use that in the foreseeable future. --Poslovitch (talk) 22:41, 16 February 2021 (UTC)

BlueLL theme might break when updating to MW 1.35

Check-green.svg Done- close. A fix have been merged on github. Yug (talk) 22:58, 21 February 2021 (UTC)

Hi @VIGNERON . According to this pending PR on GitHub (https://github.com/lingua-libre/BlueLL/pull/3), the BlueLL theme might not be compatible with MW 1.35. I have limited knowledge in MW themes, but I can merge the PR if needed. What's your opinion about it? --Poslovitch (talk) 10:49, 16 February 2021 (UTC)

Hi there, I also don't have the technical understanding of mediawiki themes to say much, but I encourage you to talk with jdlrobson to see what he think about his fix & 1.35. Yug (talk) 21:30, 16 February 2021 (UTC)
@VIGNERON I resumed the conversation with jdlrobson on his pull request. But I admit I don't have the technical capability to properly review his code submission. Also, I believe upgrading to MW 1,35 and checking on skin compatibility is within WikiValley's mission. Please clarify, and feel free to slow down this PR discussion if required. Yug (talk) 12:01, 17 February 2021 (UTC)
I did a preliminary test of BlueLL on 1.35 and did similar changes, although these complete as a first try; I didn’t see this PR. So I will test more extensivelly the PR and will accept it after review. Seb35 (talk) 10:41, 19 February 2021 (UTC)

Generate a summary.csv alongside the datasets ?

To migrate to phabricator. Yug (talk) 22:58, 21 February 2021 (UTC)

This idea just crossed my mind. Would it be interesting to generated a summary.csv file containing the list of available datasets, their generation date, their size in bytes with additional information such as amount of recordings, amount of speakers, total length of audio files... Any opinions? --Poslovitch (talk) 11:15, 16 February 2021 (UTC)

(Then why not a minimalist HTML5 webpage with a single table ? Would be more elegant. Yug (talk) 21:35, 16 February 2021 (UTC))
Also, only 13 zip have been updated. More languages have been active in the past month alone. The /datasets/ also doesn't display the 100+ language he should. So I suspect create_datasets.sh is still not doing the full thing.
Poslovitch, you see it too ? Yug (talk) 21:35, 16 February 2021 (UTC)

Where is LinguaImporter's code (admins only)

Check-green.svg Done can be closed. Found the place. Yug (talk) 22:59, 21 February 2021 (UTC)

Hello, we have T233917 which request to edit the language importer tool. I checked Mediawiki:Common.js and github/lingua-libre with the UI's string search:LinguaImporter and search:Import a language, but nothing. Any idea where is this LanguageImporter tool coded ? Yug (talk) 21:48, 16 February 2021 (UTC)

Found it. It's a Gadget.
Yug (talk) 22:13, 16 February 2021 (UTC)
@WikiLucas00 & Pamputt I just want to know WHO know how to create a Property on lL's language Q-items ? It's still under discussion on phabricator, a possible language's Property qui aurait pour valeur Category:Lingua_Libre_pronunciation- + ISO639-3. Ex: Category:Lingua_Libre_pronunciation-yue or https://commons.wikimedia.org/wiki/Category:Lingua_Libre_pronunciation-yue. You are invited on phabricator:T233917 to give your input on this issue. Yug (talk)
To have a property "Commons category" (language) AND/OR "Commons category"(speaker) giving a link to the Commons category on each file is interesting.
Here is the link to create a new property, but unfortunately I never used it, maybe @VIGNERON or Pamputt will know better. — WikiLucas (🖋️) 23:34, 16 February 2021 (UTC)

Phabricator task priority

Hi Poslovitch and Yug. I saw you set priority levels to some Phabricator tasks (T264117, T251866, etc.). As already discussed quickly with Yug, I tink we should not decide by ourselves what is the priority of a bug report except if you claimed the report. The priority of the task should be discussed during Lingua Libre meeting gathering several people because what is you think is important may be less to me (and vice-versa). This is highly subjective and I do not see the benefits to set some priority if it has not been discussed collectively before. In addition, IMHO, it cannot be used to drive the WikiValley & VIGNERON's job because they are linked by their contract with WMFr, so if they want to work on additional bug report (not listed in the scope statement), then they will choose by themselves. So please, stop to change the priority or at least let us discuss that collectively before. Pamputt (talk) 09:14, 17 February 2021 (UTC)

Hi Pamputt, there is no will to dictate what WikiValley & VIGNERON have to do, we are two parallel teams, with different relations to the project. It would help if WikiValley or Wikimedia FR clarifies on phabricator via a tag which tasks WikiValley will take on so we, volunteers, may focus on OTHERS tasks. Be it code or community organization (Wiki Meet India coming).
Still, we volunteers are active now and we need to sort these tasks better for ourselves, so we see better, and act where we can.
Volunteers should not have to decipher and enter dozens tasks to understand what they are about, the task feasibility and importance. With lack of assessment we loose clarity, time and reduce our impact. We are volunteers. Volunteering, yes. But we must make the tasks easier to jump in and CODE/ACT. No assessment is perfect, and any assessment can be bypassed. So since Poslovitch, myself, and other are diving into these tasks, let's share our assessments if we have any view on it, reword the 1/3 not-so-transparent task titles and descriptions, group them as needed, and improve as we can. Common wiki clean up as we go. And each adult takes the task(s) he wants following his need, naturally.
As for myself, I made a push to organize task by column, then group by scope-and-repositories, and put most feasible scopes up, less feasible down. I will stop my cleaning today to focus on other things shortly. Yug (talk) 10:28, 17 February 2021 (UTC)
Hi. I understand that you can be worried by the fact this triage seems sudden. And it actually is. Most tasks were left untriaged for months, while common project management practices (at least the ones I follow for years now) call for a preliminary triage based on the "Urgent/Not urgent" & "Important/Not important" criteria (I think it's called the "Decision Matrix"). A task is deemed "Urgent" if it is a task, if left uncompleted, which is or would severely impede or even block the fulfilling of the project's main goal ; it is deemed "Important" if its completion contributes to improve or complete the project's capabilities to fulfil its mission(s). These are simple "Yes/No" questions that help provide a quick assessment of the priority of the task. And in most cases (out of my own experiences), this method gives the right priority. The slight "difference" between "main goal" and "mission(s)" is well defined for Lingua Libre; which is why I could do that assessment.
However, I think you're wrong on the fact that the guy who assigned himself to the task should evaluate its priority. Unless he's applying the decision matrix, he would obviously be biased about it, don't you think?
Yet, I personally kept myself from triaging issues on which I was not knowledgeable enough yet. But these issues should be triaged ASAP.
Finally: no, triaging is not a way to "force" Vigneron and WikiValley to work on stuff they aren't contractually meant to. That is, unless a specific task actually blocks them in their work, or is somehow linked to what they're doing or whatever. As Yug put it, triaging is one the first steps to help us and other volunteers to finally grasp the work that's to be done - without getting lost at the first glance.
It's not meant to be perfect, and it must be refined. But that would only be the case of a handful of tasks. Not all of them. --Poslovitch (talk) 12:07, 17 February 2021 (UTC)

Ratelimit still an issue for our users

Following review of Phabricator tasks, Pamputt & myself discussed ratelimit. It came back to me that we do have to warn our new users of this 380 upload/72mins issues. I wanted to see if our new users bumped into this wall without us knowing, so I examined closely the stream of Olaf, who made 1000+ recording today.

23:10-23:11: 23
22:54-23:08: 280
22:19-22:25: 61
[13mins pause here]
20:52-21:06: 222 Note: 222+158=380
[17mins pause here]
20:26-20:35: 158
[16mins pause here]
20:09-20-10: 18 Note: 361+18=379
19:14-19:58: 361

I find those numbers suspiciously close to the ratelimit. So I'am pretty confident Olaf runs into this upload fails, just found a workaround : he make pauses. Point is, we must notice our emerging active users better about this issue. I now inform users who made at at least one session with 100+ recordings with :

Code Result Comment

{{subst:user ratelimit|user=Olaf}}

Hello Olaf, your current userrights on Commons limits you to 380 recordings per 72mins. We can upgrade you rapidly via a request on Commons:Requests for rights. Interested ?

Template is a work in progress and still unstable. Be sure to always use subst: and to check the end result.

I just asked autopatrol userright for Olaf. Let's monitor this issue closely and ask faster for autopatrol userrights on Commons for emerging users. Yug (talk) 00:06, 18 February 2021 (UTC)

User Relevant timespan Recording done Forced pause ? Hit ratelimit ? Commons userrights
Olaf n.a. 380 recordings yes yes autoconfirmed → request autopatrol : granted
Poemat n.a. 380 recordings yes yes autoconfirmed → request autopatrol : granted
VictorDtmtc 14:25-15:14 16 February 2021 380 recordings 30mins yes autoconfirmed → request autopatrol : granted
Webfil 19:54-20:51 19 February 2021 380 recordings 17mins yes autoconfirmed → Recommend message to user, request autopatrol soon
KlaudiuMihaila 340 recording in 40mins no Got near ratelimit wall autoconfirmed → request autopatrol : granted
Eihel 11:04-11:24 1 March 2021 380 recordings (20mins) n.a. yes autoconfirmed → request autopatrol : under review
Gaurav Jhammat 25 February 2021 957 recordings (4hours) no Passed it autopatroller already.
SangeetaRH 7 March 2021 380 recordings (?) n.a. yes autoconfirmed → Recommend message to user, request autopatrol asap.
Unjoanqualsevol 7 March 2021 380 recordings (?) n.a. yes autoconfirmed → Recommend message to user (Check-green.svg Done), request autopatrol when 2000 audios uploaded.
I misinterpreted API contributions data for 2 other users, then removed my message to them.
The 4 users above didn't have autopatrol rights, were limited to 380 audios per 72 minutes, and developed by their own "pause then click upload again" strategy, if lucky. Audios not uploaded may also get lost if the tab is closed in between.
We can assume the ratelimit wall to limit a good part of our active contributors rights now. I did not investigated the Marathi community.
I asked for autopatrol rights for Olaf, ‎Poemat and VictorDtmtc. I think we must ask the same useright for KlaudiuMihaila before s.he hits the wall.
Reminder: In June 2020, Luilui6666 lost few hundreds audios recordings by thinking the web browser tab had crash and required reboot, that's how we first investigated this issue. Yug (talk) 02:26, 18 February 2021 (UTC)
Table above got updated on 19:17, 20 February 2021 (UTC). Yug (talk) 19:17, 20 February 2021 (UTC)

Poslovitch is now Lingua Libre Bot's operator

Thanks to Michaël, I now have access to the bot's account and will be able to test any changes made to its code. However, I'm wondering whether I have to "redo" the various bot agreements on the wikis or not. They were approved when 0x010C was the bot's operator, and now it's me - at least for a year, but it's definitely going to change again at some point -. So... What should be done ? --Poslovitch (talk) 16:42, 19 February 2021 (UTC)

@Poslovitch Would love to have the bot implemented on ku.wiktionary. I have already discussed it with other users. How can I help? :) --Balyozxane
Welcome Balyozxane, We are building a page (LinguaLibre:Bot) to make such request. We will need you to provide us "best practice" examples of audio integration on ku.wiktionary. The form will be as 1) a link to some correctly formatted ku.wiktionary pages, and to : some explication of the template used, which field is what (because we don't read kurdish). Yug (talk) 19:23, 20 February 2021 (UTC)
@Balyozxane hello, I moved your content on the dedicated page, LinguaLibre:Bot. We process such demands there. It's the first request, so bare with us. We improve our process on the go. Yug (talk) 18:47, 21 February 2021 (UTC)

CSS : Mediawiki:Common.css and BlueLL stylesheets

More I dig, more I find tiny CSS bugs all around. Today I noticed the Table of Content (TOC) of each page has its bullets too far on the left. I think the original designer's intent was to have no bullet. I also found out images' description content is screwed. Tables there also did not have border, because a css rule affecting both classes .toc and .commons-file-information-table had been netralized to remove the border on TOCs. I fixed it a bit, restoring border on .commons-file-information-table. We likely have more CSS bugs on various pages and sections. Please watchout and report on phabricator. You can login using your wikimedia account.

PS: I'am starting a wikislow for some time in order to push IRL, non-wikimedia issues. This past months I joined others to push on Github general cleanup, SignIt fix, Wiki Meet India, Lingualibre:Events, the Technical board creation, LinguaLibre:Bot, the ratelimit investigation above and a phabricator medium clean up. Most are stable by now but LinguaLibre:Bot page and the ratelimit issue are still to monitor closely. Help therefore welcome on these 2. I wish to focus on lighter Github coordination and event coordination with Eavq, Adelaide and Taiwan Universities. Yug (talk) 20:50, 21 February 2021 (UTC)

Testing the bot in "live" conditions

Yesterday, from 8 PM to midnight (UTC+1), I ran a test. The bot has this feature called "live mode": it adds the recordings in near-real time to Wikidata and Wiktionaries. And I must say it was both impressive and efficient. From what I can see now, it only missed a handful (~15) of recordings out of 899 recordings that were/could be added to either Wikidata or Wiktionaries. And, in all, it handled 2407 recordings. It crashed 4 times, each one of them being caused by Windows and not by an issue in the bot's code. What a satisfying sight seeing all of that happen!

While I would like to use the bot in this "live mode" from now on, we must weigh the pros and cons of this feature. The fact it was not used so far indicates that either 0x010C or the LL community was not ready for it at that time. I'm especially interested in knowing if this has caused an increased load on the BlazeGraph (it shouldn't have). With the bot's (planned) expanding capabilities, it might become more practical to us to have it run all the time in the background.

Finally, since I'm working on phab:T274511, I believe we could run the bot on this "live mode", and, each month, we could setup the bot to go through all the recordings, starting from the very beginning of LL. Alongside catching any "left-overs", this would have plenty of benefits: 1. put the recordings (even the oldest ones) on the Wiktionaries that will become served by the bot in the following months (e.g. the Kurdish Wiktionary) 2. put the oldest recordings on pages that maybe didn't exist at that time on the Wiktionaries/Wikidata 3. force contributors to report erroneous recordings, so that we can remove them. If they fail to do that, the recordings will be added again at the bot's next pass.

In the meantime, I'll keep doing tests with the bot.

This is probably something that I should've put on the Chat room - tell me if should. --Poslovitch (talk) 11:05, 24 February 2021 (UTC)

Hahahah. That's impressive. This level of activity changes from December 2020 ! Please note that Olaf has create user:Olafbot, to do some tests as well I guess. Yug (talk) 18:32, 24 February 2021 (UTC)

Lists : via Bot or RecordWizard implementation ?

This conversation follows the creation of User:Olafbot, the first user-created boot account. It inits both technical and bot policy questions.

Yes, it was me :-) I'm trying to use the OAuth authentication in my code to be able to generate lists I wrote about, and refresh them automatically in Lingua Libre. I hope using a bot for this is not against the rules here. The lists, generated for 50 languages, will consist of words without recorded pronunciation (including pronunciation from other sources in Commons, not only LiLi), and will be sorted by the number of wiktionaries with a corresponding language section describing this word (not just a number of interwikis). From my experience, this approach produces long stable lists of lemmas, more useful in Wiktionary context than the classic frequency lists, which tend to cover 20k lemmas at most, usually include inflected forms, and are of poor quality or non-existing for most of the languages.

BTW, there are as many as 185000 recordings in French, but still many basic words have no French recording at all, for example, "centreuropéen" or "Grégoire", because everybody records "eau" and "chien". It looks like a waste of time of many people. In fact, only 2/3 of French lemmas in Polish Wiktionary have the pronunciation recorded, even if we have just 26000 French lemmas. It's much worse in the case of less covered languages. I believe the lack of regularly refreshed lists of needed audio is a major block. Let me fix this first.

In general, I'm not good at Python. I use Java, JavaScript, and TypeScript every day at work, but I tried Python only once or twice in AI competitions on Kaggle.com. I would need probably a lot of learning to be able to contribute to your bot. Maybe in the future... Olaf (talk) 01:14, 25 February 2021 (UTC)

Hello @Olaf ,
Missing recordings: Thank you for attacking this "missing gaps" issues. It's indeed a topic of concern for us all, Wikt and LL. While we have 109 languages, only 22 have 2000 audios or more. Most languages approach recording sessions on a easy-peasy by topics of interest approach, more pleasant to the speaker. This cause very irregular coverage and leaves numerous gaps hard to fill. Leveraging frequency lists such UNILEX is a first response : it adds relevant priority for general usages, assessed by « online corpora's frequency ». Having lists of missing audios would be an other spot-on response. Your priority assessment « sorted by the number of wiktionaries with a corresponding language section describing this word » is likely of similar general relevance and higher wikt's relevance.
Bots policy ?: We currently have no policy on bots, we are leading by practice (Poslovitch, you) and building on the way. As long as you test respectfully it's ok I guess. Tips: Administrators have some batch revert tool if the need arises. :) There is also a WikiAPI JS bot raising, which could be interesting for you to explore if you like JS more. The maintainer is highly active and added a suggested new function within 24hours, was impressive.
Lists types: If you want to propose recording lists to users, there are two different approach :
  • Canonical lists : Japanese JLPT, Chinese HSK, SWADESH, UNILEX frequency lists. → create a List:{iso}/{list name}. Referent: Yug
  • Dynamic lists : Places near your, Category:Fruit on English wikipedia, etc. → create a Sparql query. Referents: VIGNERON (Sparql) then Poslovich (implementation on github).
RecordWizard queries button ?: I'am not sure, but your project may be more in line with dynamic lists fetched via Sparql queries. It would take few months but we can head toward adding a Record words without pronunciation button to the RecordingWizard (github), when we load a list.
Lists policies ?: Canonical lists vs Dynamic lists have been a long de facto complementarity duo within RecordWizard's list system. This distinction now emerges as a needed policy, to be honest. As I'am keeping an eyes on Marathi lists (62and growing, mainly mini-lists and likely one-shoots), it becomes clear that we will have to put in place some better practices, guidelines, and mentoring of new users for lists creation. Yug (talk) 10:45, 25 February 2021 (UTC)
Does the dynamic list require a category in some wiki? Then I'm afraid they are not suitable for my purpose, because there is probably no single wiki that would contain all the words from the list. I planned to just set the bot to rebuild a static list every night, just like I do in the case of lists of missing lemmas in Polish Wiktionary, I must only succeed with the untypical authentication used by the LiLi wiki. Yes, the new button would be a good thing. Currently, there is a button "Remove words already recorded" in the Record Wizard, but it removes only words recorded by the current speaker. Olaf (talk) 12:04, 25 February 2021 (UTC)
If the button were in place, I would just produce the lists once (all lemmas sorted with non-LiLi recordings removed) and would rely on the new mechanics to remove automatically audios recorded in LiLi. The lists could be refreshed once a month or never. From my perspective, it would be a better solution, because the data would be more up-to-date. So, I believe the question in the header of this section is mock - the two solutions are not opposite, they complement each other. Moreover, the "button" solution alone won't provide good sorting and a good corpus of lemmas. Olaf (talk) 12:35, 25 February 2021 (UTC)
Note that there is a feature request to ask for a button not to record words already recorded by any speaker (not only you).
About the list you want to generate, it is indeed very interesting, even if they are "dynamic". I remember that Lingua Libre has several goals. One of them is to provide missing recordings for Wiktionaries or Wikidata lexeme. Another one is to provide several recordings for a same word in order to show the diversity of pronunciation depending of locutor location. So whatever the lists you produce, they will be useful to deserve one of this goal. So go ahead :) Pamputt (talk) 16:48, 25 February 2021 (UTC)
Yes, the button for removing already recorded words would be a big step ahead for the project. What about the existing recordings in Commons, that were created with other tools than LiLi? Perhaps you could update your database with those recordings? I don't need it, because my bot adds them anyway in Polish Wiktionary, but perhaps your bot, and anybody who downloads the LiLi datasets, would have a larger base of recordings? Olaf (talk) 20:46, 25 February 2021 (UTC)

So I created a few lists just to gather the feedback:

Enjoy! Any opinions are welcome. However, the lists are not refreshed daily yet, I must still work out the authentication. In the end, there should be at least 30 lists. Olaf (talk) 03:20, 26 February 2021 (UTC)

Thanks Olaf for these lists. I looked at the French list and I did not see any typo, so the quality and the value is high. Pamputt (talk) 07:01, 26 February 2021 (UTC)
Saw your list in Special:RecentChanges ! This is good. Nice ! We can wait a week or so to think about the side effects, but seems good. Yug (talk) 09:14, 26 February 2021 (UTC)

Finally, I implemented the authentication. Apparently, bots don't need to use OAuth and log via Commons here (in contrast to normal users).

Lists for 72 languages have been generated: [1] (afr, ang, ara, ast, aze, bel, ben, bul, cat, ces, cmn, cym, dan, deu, ekk, ell, eng, epo, eus, fao, fas, fin, fra, gla, gle, glg, grc, heb, hin, hrv, hun, hye, ina, ind, isl, ita, jav, jpn, kan, kat, kaz, kor, lat, lit, ltz, lvs, mar, mkd, mlg, nld, nor, oci, pan, pol, por, ron, rus, san, slk, slv, spa, sqi, swa, swe, tam, tel, tha, tur, ukr, vie, yid, yue). The lists will be updated every night. Olaf (talk) 00:39, 27 February 2021 (UTC)

Hi there! There's a planned feature that I will be working on in May with User:WikiLucas00. The goal is kinda similar to what you're achieving, except that it's going to be implemented in the RecordWizard (as a dedicated button). I expected to use Petscan queries, but the idea of having a bot updating lists would be kinda interesting. Petscan is fine to do queries on Wikimedia projects, but having a bot that would scrape something else and generate lists out of that would be great too. And both "methods" could be fed into the new "generate word lists" mode. Keep me in touch, Olaf! --Poslovitch (talk) 10:29, 27 February 2021 (UTC)
Perfect. It would be nice to have the list feature integrated with RecordWizard. I'm not sure what exactly are you going to implement, however, if you plan to do something similar, it takes a considerable amount of time to scan 150 wiktionaries for lemmas, so I believe the lists should be at least partially prepared before the user clicks the button. For example, the existing recordings might be removed at this point from the prepared list. Olaf (talk) 17:54, 27 February 2021 (UTC)

LLBot and Wiktionary

  • If I record the words from List:Ben/Lemmas-without-audio-sorted-by-number-of-wiktionaries, will the LinguaLibre bot add those audios automatically? --টিটো দত্ত (Titodutta) (কথা) 16:11, 27 February 2021 (UTC)
    • I'm not the operator, but I believe eventually the LiLi bot will add them. It has added some of my recordings in Polish to French Wiktionary and Occitan Wiktionary, so it works at least in those wikis. And I promise your recordings will be added by Olafbot to the corresponding Bengali articles on Polish Wiktionary this night, just like any pronunciation recording appearing in Commons. And of course, the words should disappear from the list. Olaf (talk) 17:10, 27 February 2021 (UTC)
The LLBot is not able to add recordings on the bengali Wiktionary (1° it's not coded to do so, 2° it does not have the bot status on it). If the entries exist on the French or the Occitan Wiktionaries, then the recordings will be there, of course. --Poslovitch (talk) 20:35, 27 February 2021 (UTC)
@Titodutta to complete what Poslovitch said, LLBot is not able yet to add recordings on the Bengali Wiktionary. That's said, it is possible to do it in the future. I think we should create LinguaLibre:LinguaLibreBot as a place to request a support for a new Wiktionary or to report issue on a given Wiktionary. What do you think? Pamputt (talk) 09:03, 28 February 2021 (UTC)
@Pamputt It already exists: LinguaLibre:Bot ;) ! --Poslovitch (talk) 10:46, 28 February 2021 (UTC)
@Titodutta you can create a request on LinguaLibre:Bot based on the form example. The pages guides you to provide the needed informations so we can walk toward authorizing User:LinguaLibre bot on your target wiktionary. Poslovitch (and myself) will be your guides. Olaf may join us too since he started to run a bot and may be interested by this field. Yug (talk) 13:29, 28 February 2021 (UTC)
@Yug IMHO, only LLBot should be adding recordings to wiktionaries. It acts like a display - you record on LL, then LLBot comes to your Wikt and adds the recordings. Moreover, this would help with code maintenance. If we have a dozen of bots dedicated to each Wikt, and each one of them maintained by someone else, we will fall into the same issues as we did when we took over Lingua Libre's code. It's better to keep everything that's related to "add recordings on Wiktionaries" on the LLBot. Same repository. Same maintainer(s). No duplication issues. And easily recoverable. --Poslovitch (talk) 12:13, 1 March 2021 (UTC)
All bot-related conversations could be gathered on LinguaLibre:Bot(s). Various bots can co-exists, preferably with distinct tasks indeed. Yug (talk) 17:04, 1 March 2021 (UTC)

Stabilization, Communication & outreach

Draft
Twemoji12 1f3d7.svg
Twemoji12 1f3d7.svg

This section is a work in progress.

Remove duplicates bug

Hi, I already mentioned that the Remove duplicates feature has a bug. For some records, they are not removed. For example, I recorded aprilo four times. And I just found why : just before recording this word for the fourth time, I checked if it existed. There was no Q-element with this name, and the Blazegraph request doesn’t find anything. I recorded aprilo. And by checking again, the item is now foundable. But it was not created today, only updated. So I checked why : check the diff. The record wizard is case sensitive, but somehow, if an element already existed with a different case, this element is updated, instead of creating a new one. Lepticed7 (talk) 11:27, 28 February 2021 (UTC)

FYI, the ticket tracking this issue is T267876. Pamputt (talk) 13:11, 28 February 2021 (UTC)
Pamputt, in the phabricator ticket, it's better to save the list toward the history of the source, like so. The discussion page are changing too fast while Phabricator tickets stay for a long time. I plan to update those links on phabricator as I bump into them ;) Yug (talk) 11:29, 1 March 2021 (UTC)

List size ?

How do we handle very large lists ? Some sources we are starting to use are 50,000 or 100,000 items. How large can the Record Wizard handle the load ? In term of UX, human well being, shouldn't we limit recording session to max=1000 or max=2000 ? I worry loading and displaying 2000+ or 10,000+ items contains an annoying "infinite scroll", "you are drowning under water" effect. Whereas smaller, max=1000 items contains a positive "you can do it boss" dimension which plays fully on gamification and "current level+1" motivational concepts. The same way the record wizard avoids already recorded it could be wise to limit how many words we load. Yug (talk) 13:43, 28 February 2021 (UTC)

  • It handles badly. I have got a few lists with 50,000 or more words now. Not only infinite scroll, the record wizard also hangs if a list has more than 5,000 or so words (of course it depends on one's computer RAM etc also), however I won't even try to load a 50,000 or 60,000 word's list on the RecordWizard. "max=1000" is an option for sure. What I'd be really happy to see is "&from" (example) and similar filters. Think of a large list of 60,000 words, alphabetically sorted, if we can limit loading only those words starting with/ending with specific letters that will make things easier (however this loading should be slow also). Another option is employ Olafbot and keep on updating such lists, what Olafbot is doing at this moment. Regards. --টিটো দত্ত (Titodutta) (কথা) 18:09, 28 February 2021 (UTC)
Others and myself frequenly split lists in smaller chunks and pages but it's not ideal on the maintenance side. I wonder if adding sections to the list couldn't be a solution. So we pick a list, then if any we see all and pick a section. It would present interesting benefices (editability). I dont have easy implementation solution, but visibly we have an aspect to improve. As of now it seems wise to keep lists at max=5000 ?
In any case, the questiin of Lists maintenance, curayion and merging is growing. Yug (talk) 18:46, 28 February 2021 (UTC)
Ok, I'm gonna trim my generated lists to 1000 entries, if there is a problem (technical or psychological) with larger chunks. Olaf (talk) 22:08, 28 February 2021 (UTC)
  • Yes, every night (Warsaw time). 1000 may be too small, sometimes people make more in 24 hours. 1500? BTW, I'm thinking about adding new regularly updated lists. For example, the frequency lists with the recorded audio removed probably also could be useful. In Polish, I personally use lists based on the number of internal links in Polish Wiktionary, but it works only for this language - in Polish Wiktionary we link every word in a definition or an example to its lemma, so there's quite a large corpus of words. In Polish, and probably in a few other languages with simple phonology, I can create lists without /r/ for people with rhotacism (like at least three Polish speakers of pronunciation in Commons, including myself). If you have any further ideas, how to enhance Olafbot's activity, please let me know. Olaf (talk) 00:18, 1 March 2021 (UTC)
  • @Olaf hi. On my side I approach it this way : 1000 = 1h+ for moderately experienced users. I would therefore put 2000 words / day as a maximum, solidly ambitious yet still healthy size (doable happily when motivated). Given your list is updated daily, that's the max I recommend. But 1500 and 1000 daily are more gentle and likely an healthier choice if we want to encourage long-run contributions. Our experienced contributors may come to like this habit of 1000 words / 45~60minutes daily recordings. You see the idea.
    For non daily-updated lists, we can go to max 5000.... with the understanding and assumption it suit a 3-days recording sprint followed by a resting period.
    My personal observation is that >1000 words/day becomes cognitively exhausting. Generally speaking, we can also see from Special:RecentChanges and on-the-ground's practices that the 1,000+ threshold is largely avoided by our existing users. Most session are between 30 and 400 recordings (between 2 to 30 mins active recording). There is crowd wisdom in that ! Yug (talk) 11:19, 1 March 2021 (UTC)
  • Ok, I limited the generated lists to 1000 - they will be trimmed this night. I think Yug is right, and rarely anybody will need more, and the lists shouldn't overwhelm people. Olaf (talk) 22:58, 1 March 2021 (UTC)
@Olaf & Yug That's a good decision. We should never load too many elements in the Record Wizard. @Yug, I saw that some of your lists are still 5000 elements-long, should we also trim them?
I tried recording a part of List:Fra/Lemmas-without-audio-sorted-by-number-of-wiktionaries back when it was still 5000 words-long, and almost all of my 300 audios were corrupted (I had to dump a lot of them before uploading to Commons), probably due to my browser's memory overloaded by the long list (I have a pretty strong internet connection and used a PC with 16GB of RAM).
In relation to this topic, I opened a Feature request on Phabricator (T276014), to be able to load only a part of a list in the record wizard. — WikiLucas (🖋️) 02:07, 9 March 2021 (UTC)

Exclusion lists?

There could be another problem with the generated lists - although the lists are usually self-correcting (any typo would have to appear on many wiktionaries at the same time to make it to the top), at least for less popular languages the lists contain errors. The errors are going to accumulate at the beginning of each list because all the good words around them from the top are eventually recorded, and the errors persist. I had this problem in lists generated for Polish Wiktionary, and the solution was to allow users to remove the errors from the list. Each generated list on pl-wikt (example) has its own "exclusion list" (example). Words added by users to the exclusion list are automatically removed by the bot from the main list every time it is updated. The system works fine on pl-wikt because it is described in the header of each list, and people got used to it, but I have no idea how to do it in Lingua Libre where no explanatory comments can be included in a list. Or maybe the bot could monitor any deletion of words from the lists done by users and maintain the exclusion lists on his own? But still, people would have to know there is such a possibility and care about it. Or maybe it's too early to think about it? What do you think? Olaf (talk) 12:04, 1 March 2021 (UTC)

Is there any way to put a comment on the list, visible to the user? Except for its title of course... Olaf (talk) 12:18, 1 March 2021 (UTC)
IIRC, the "noinclude" tag is not ignored by the RecordWizard (ie. it shows as a "word" in the RW). So I guess there's no way to add a comment. That'd be an interesting feature request to be honest. To have something like a "description" of the list in the RecordWizard, something like what we can see on Wikidata. Let's discuss that, and maybe open a Phab ticket too. --Poslovitch (talk) 12:21, 1 March 2021 (UTC)
I don't know where, but I think we have a feature request on RecordWizard for more advanced list loading, including ignoring (License|warning|info) templates and/or <noinclude></noinclude> balises. (@Pamputt ) I also mentioned in the section above the possibility of wiki sections within list pages. We lack active VueJS developer to maintain and integrate features requests safely to the RecordWizard. Setting up a team of 1~2 VueJS devs should be part of our external outreach objectives. Yug (talk) 17:00, 1 March 2021 (UTC)
I can be one of these VueJS devs. But the thing I'm lacking is a test environment. And, there's also a thing that should be noted: the RecordWizard is developed in VueJS 2.x. It's not compatible with any of the Vue's development tools: Vue Devtools does not even "see" that the RecordWizard uses Vue! --Poslovitch (talk) 18:59, 1 March 2021 (UTC)

Keyboard control

Can we control the Record Wizard with a keyboard? My mouse makes rather loud sound when clicked, and the first word is always replaced with this sound. I have to add a dummy word at the beginning of the list. I would like to use a keyboard instead. It would be even more interesting if I could repeat recording of the last word with one key pressed. Olaf (talk) 23:11, 2 March 2021 (UTC)

Yes, during the recording, you can use keyboard. You can click on the keyboard icon (at the top right) to see all of them). That's said, we should improve Help:RecordWizard_manual#Keyboard_shortcuts because it is not complete. Pamputt (talk) 06:13, 3 March 2021 (UTC)
Thank you very much. It's very helpful. I wonder only why this help dialog in Record Wizard is displayed always in French, and why I can't find it in TranslteWiki? Olaf (talk) 01:47, 4 March 2021 (UTC)
I did not know it was only displayed in French. I added these strings to the list of strings that need to be translatable. Pamputt (talk) 06:30, 4 March 2021 (UTC)
@Pamputt Support Support For example, we would need keyboard shortcuts to unselect or re-select elements while listening to the recordings. Also, I think using arrows to go from an element to another doesn't work for this step of the Record Wizard (it only works in the recording step). — WikiLucas (🖋️) 01:56, 9 March 2021 (UTC)

Properties for languages : import on lili Qitem or fetch from wikidata ?

Hello all, hello VIGNERON. As I progress toward importing the UNILEX lists I'am examining the languages we have, how many audio we have. For each language I would like to see the native population size of speakers. Coupling our current reach (the list of our language and the number of audios) together with populations sizes, we will be able to see how is our biases on major vs minor languages. There are some rare language in the list. So maybe, maybe, we are actually good on this side. Or maybe not. But we need to have some vision o this.

As for the implementation let's take Cantonese (Q9186)'s "number of speakers" (Property:P1098) = 72,893,210. Is there a way in Query:Viz to make a Sparql query fetching both most data from LinguaLibre (as now) but also one value from wikidata ? Yug (talk) 14:55, 4 March 2021 (UTC)

Natural Language ToolKit (nltk)

Greetings all,
Do we have anyone with knowledge on NLTK and its ecosystem ? I'am in communication with Google/corpuscrawler and Unicode-org/UNILEX, examining their data, and I still find notable holes. Esperanto epo, Catalan cat, Afrikaans afr, Korean kor and others have no corpora, therefore no wordlist. (I suspect UNILEX was a complementary project, exploring the en:long tail so some major language were not cared for. According to the code, they also crawled only few websites via human-defined targets. They identified few language-specific words, ex "currently" and "first", googled it (maybe +wordpress), and thefore identified targets matching their crawlers.) I'am making a new review of online corpus mainly using OPUS.nlpl.eu (en:Corpus Linguistic research center) as my entry point. They are specialized in parallel corpus but also provide their monolingual corpus in raw text and tokenized formats. I noticed their data for Wikipedias (/wikipedia.php: 20 languages), Wikipedia Content Translations (/wikimedia.php: 288), Tatoeba (/tatoeba.php: 359), TED (/TED2020.php: 108), Bible (/bible-uedin.php: 102). More I dive into Wikipedia corpus and more I'am uncomfortable with it: it's really noisy, the clean up is hard. But what are the other options ? Do NLTK community and professional have built-in and maintained corpora per languages ? Do we know what they are doing on their side for corpora and wordlist ? Do you have any NLTK contacts to share these questions with ? Yug (talk) 11:32, 5 March 2021 (UTC)

Ok, this may be a bit too ahead of current challenge. Yug (talk) 09:15, 7 March 2021 (UTC)
@Yug As far as I know, it seems that NLTK is mostly used for learning purposes, and that there is no strong professional community around it. — WikiLucas (🖋️) 01:53, 9 March 2021 (UTC)
Thanks WikiLucas00, too bad. But that's interesting to know. Yug (talk) 10:58, 9 March 2021 (UTC)

Bug in Record Wizard

Closed. See T276724.

I found a bug in the Record Wizard:

  1. Click "Local list"
  2. Remove all text from the "Title" field
  3. Click "Done"

Now, there is no chance to exit this dialog window forever and ever. You can only reload the page or close the browser window. Olaf (talk) 20:58, 7 March 2021 (UTC)

This is definitely a bug. I remember I already experienced it and I though I had opened a Phabricator ticket but I cannot find it, so I probably did not create it... Done, see T276724. Pamputt (talk) 21:27, 7 March 2021 (UTC)
I found it back, T266921. I will close T276724 as duplicate. Pamputt (talk) 21:35, 7 March 2021 (UTC)
Will be easy to fix. There is already a built-in error message when you type an erroneous pagename. Something to the effect of if inputString==null then return error. Yug (talk) 21:46, 7 March 2021 (UTC)
But we would have no way to test the fix. So until the fix can be tested, it must not be merged in the code. --Poslovitch (talk) 22:36, 7 March 2021 (UTC)

Ratelimite update

See: phab:T260649 « Ratelimit : improve handling of upload ratelimits via local JS », phab:T276992 « Ratelimit : Lingualibre-Commons require better integration via whitelisting or else. »

Greetings,
A short summary of past weeks monitoring the ratelimit bug and its impact. Since I noticed our users were indeed bumping into this 380/h wall, I monitor daily the recent changes and investigate upload patterns. Most users rights now (Marathi!) actually keep sessions under 380, and don't reach their ratelimit. I would say 80% of our actively-recording users don't reach their ratelimit in despite 5+ medium sessions per months. Their activity range between 20 and 200+ audios per session. Then, there are few more ambitious users, who display a "growth" pattern : their sessions increases, get to ~350, then bump into the wall. I tried my best to catch those and get them more user-rights on Commons as they neared the limit. Mostly succeeded. Few jumped up and bumped into the ratelimit before I could mentor for userirghts. Last but not least, I did some light catalan outreach this week. One of the users who answered the call and came over jumped straight to record Olaf's cat 1,000 words list, as I recommended in my call. Impressive and fully what we want. I noticed his 380 uploads (« 380 = Warning! User bumped into ratelimit! ») and therefore made contact. This user has since confirmed to me that s.he lost 620 of its recordings. Very, very embarrassing for us and for me, who called this user over.

We handled this loss the most elegantly and supportively as possible and thanks to this users positive will, it went ok. But it gives a sense of the impact this ratelimite can have, why hand monitoring of upload followed by proactive Commons userrights requests is direly necessary.

On the solution side we have a medium strategy short term solution –create a warning ribbon with user's ratelimit if = 380, current edit count on commons, and message–, see T260649; but also could explore a high strategy long term solution asking commons is more tolerant with our users uploads. Yug (talk) 12:56, 9 March 2021 (UTC)

I start to think the lists should be limited to 380. It looks like a trap, when we invite people to contribute, give them a list of words, and then throw away most of they work. The warning ribbon is the minimum. I could write it as a gadget myself, but I can't test it (see the section below). :-/ Olaf (talk) 13:56, 9 March 2021 (UTC)
Gadget is pure js (ex: MediaWiki:Gadget-RecentNonAudio.js), you test it in your browser. Could work ! Yug (talk) 13:57, 9 March 2021 (UTC)
I know, I used to write gadgets in other wikis. But here I can't edit the MediaWiki namespace, and I can't even use my personal Common.js. So where am I supposed to put the code of the gadget and how can I test it? Ok, perhaps I can develop it somewhere else and present it here when it's ready, but the other wikis have no Record Wizard and they look different. I believe, there are too many restrictions here. I can't understand why anybody would like to block using personal common.js. Olaf (talk) 14:20, 9 March 2021 (UTC)
Oooh...... You know what ? I used site-wide common.js and my admin right to dev MediaWiki:Gadget-RecentNonAudio.js if I remember well. Also, this week, we granted temporary admin rights to Eihel due to specific technical need. If you want to attack this you can ask temporary admin rights on the Admin board. Part of the code could be later (late Spring/Summer) fully integrated to the Record wizard. Yug (talk) 14:29, 9 March 2021 (UTC)
Ok, I applied for the rights. Olaf (talk) 15:30, 9 March 2021 (UTC)
  • @Yug, Olaf, & VIGNERON Do you think that it would be conceivable to implement a new user right on Commons, dedicated to (every) Lingua Libre users? This way, users could use their ratelimit exemption on Commons only when using Lingua Libre, and not from elsewhere. Which other long-term solution could we bring for when we decide to start a discussion with Commons community? — WikiLucas (🖋️) 18:38, 9 March 2021 (UTC)
If the right were to be added automatically, it would require MediaWiki extension, I'm afraid. And politically, it might be hard to swallow for Commons admins. When I got the autoconfirmed right to break the 380 recordings barrier, I was reminded "not to go too fast", and remember I'm not a bot. Whatever that means. :-) Olaf (talk) 18:50, 9 March 2021 (UTC)
@WikiLucas00 I'm skeptical on a long term fix.
  • LinguaLibre:User_rights#Commons_ratelimits_in_code shows that 380 is an hard-coded limit. Together with other ratelimits.
  • Users are uploading audio files via their web-browser and Commons.wikimedia.org accounts. I'am not sure we have any tag in the request.
  • Someone talked about an "app whitelist", but again, this may be different than our current situation. See the point above.
So I'am very skeptical or unaware at this point for an already built-in solution. We may, tho, find the right place to raise this issue on phabricator, then ask for a solution. Yug (talk) 21:25, 9 March 2021 (UTC)
  • @Olaf I think that limiting the lists to 380 or 400 (assuming that we lose/the user discards 20 elements), and updating them more frequently (twice a day for example) is a good idea. — WikiLucas (🖋️) 18:38, 9 March 2021 (UTC)
Ok, I will change it today or tomorrow. Olaf (talk) 18:49, 9 March 2021 (UTC)
Trimmed to 380. Now I'm testing updating the lists every three hours, concurrently with other bot activities. I hope there will be enough memory available for two heavy bot processes. Olaf (talk) 22:35, 9 March 2021 (UTC)

I requested SangeetaRH today (she bumped into the ratelimit in past days) and will request Unjuan's new userrights as soon as 2000 contributions are reached. Yug (talk) 21:36, 9 March 2021 (UTC)

NB: looking for something else, I stumble upon this task phab:T110249 which is old and may not be solved soon but, if solved, may be relevant for this situation. Cheers, VIGNERON (talk) 12:45, 23 April 2021 (UTC)

Generators

I wanted to see what the generator is. I copied the demo generator as described on Help:Create a new generator, reloaded the page (clearing the cache), logged out, logged in, and still can't see any change in the Record Wizard. How can I start the generator? Olaf (talk) 11:17, 9 March 2021 (UTC)

Apparently, it's not allowed: [2]. This help page doesn't make any sense. Olaf (talk) 13:23, 9 March 2021 (UTC)
Just to understand why it does not work and why at the same time this help page exists. The help page has been written on October 2018. In June 2020, Lingua Libre has been largely revamped and I think common.js has been disabled at that time. So it was maybe working in 2018 but not anymore in 2020. Let us hope it comes back soon. Or maybe it never worked here :( Pamputt (talk) 18:56, 9 March 2021 (UTC)

Documentation update?

We are progressing on so many sides these days. But at some point we may gain from turning back and doing some "old pages" clean up. Help:Lists (3~4 pages) need care and merging. Help:Create a new generator too. How many others ? Some recent pages and forums (ex: Lingualibre:Grants, LinguaLibre:Workshops, LinguaLibre:Jargon, Technical board, Bot) and even the main forum needs some love. Meanwhile I will have to reduce my activity in coming months due to research deadlines and other Grant requests to write, including for Lingualibre. On the bright side that will reduce the flow of news sections ^^. One small task we can do as of now its to tag the dead forum sections as "closed". We need fresh users and fresh eyes to be honest. Yug (talk) 19:19, 9 March 2021 (UTC)

Help:Create a new generator is ok. We just need to add that currently only sysop can do it. All the rest remains valid. Pamputt (talk) 19:27, 9 March 2021 (UTC)
Now I have the admin rights, I have the demo generator copied to my common.js and still it doesn't work for me (or I don't know how to start it). Olaf (talk) 22:43, 9 March 2021 (UTC)
A quick technical note: when updating the generator "External Tools" for MediaWiki 1.35, it worked partially as a gadget: the button was shown but the window did not open, except when using the mode debug of the ResourceLoader (by adding the parameter "?debug=true" in the URL). It worked fully when I copied the code on the server in the RecordWizard extension next to other generators (currently only on the server, I will do soon Pull Requests to Github). It is probably due to side effects in loading order. Seb35 (talk) 10:57, 23 April 2021 (UTC)

Bot login

Something changed, and page [3] doesn't exist anymore. This page exists on any other wiki (example) and Olafbot used to log in in this way before the fire - AFAIK only in this way bots can log with bot passwords obtained on Special:BotPasswords. So currently Olafbot's list generation is disabled. Could you switch this feature on again? Olaf (talk) 08:32, 23 April 2021 (UTC)

Addendum: The whole API seems to be gone. Olaf (talk) 08:33, 23 April 2021 (UTC)
Hi Olaf no idea how to solve this issue but I have created a ticket on Phabricator to track it. Pamputt (talk) 09:22, 23 April 2021 (UTC)
Ok, I found the problem. The script path changed - before the fire, and on all other Mediawiki projects script path is /w (for example lingualibre.org/w/api.php) Now it's lingualibre.org/api.php. Compare entry points in Special:Version and fr:Special:Version. The bot is now working, problem solved, but I believe the change may affect some other scripts. Olaf (talk) 10:52, 23 April 2021 (UTC)

@Olaf oh, thanks for letting us know. This is strange, indeed some pages were using the /w/ path and most were using the / path, I'm not sure why or how. It created some issues and broke some tools, so all pages should now be in / to avoid inconsistencies. Ping @Seb35 who probably could tell us more about it. Cheers, VIGNERON (talk) 11:24, 23 April 2021 (UTC)

Yes, I just checked in archived Special:Version from before the fire, that the entry points didn't change. There must have been a redirection from lingualibre.org/w to lingualibre.org which is not in place now. Olaf (talk) 11:28, 23 April 2021 (UTC)
Before the fire, I was using / successfully. Yug (talk) 21:03, 25 April 2021 (UTC)

Translating

Translate extension is not configured properly. When I click the "Tanslate this page" link on the main page, it's not working as expected. Olaf (talk) 11:35, 23 April 2021 (UTC)

Yes, I also experience the same. I have opened a ticket on Phabricator. Pamputt (talk) 11:41, 23 April 2021 (UTC)

CSS & Indentation

It seems that the use of ":" does not work anymore tor indent messages. Probably a problem of CSS somewhere. So if someone knows how to fix that, please do not hesitate. Pamputt (talk) 11:45, 23 April 2021 (UTC)

It appears images suffer the same issue. See for example the image at the top of Help:Configure_your_microphone which has « right » parameter; but image is place on the left. Pamputt (talk) 16:22, 24 April 2021 (UTC)
@VIGNERON 0x010C had neutralized and removed some native mediawiki css, and added some custom css.
His css selectors were designed for the old mediawiki html structure and classes. Selectors may be non-functioning now that new structures and classes are in place.
When inspecting the code, the dl element are still correctly nested.
VIGNERON, is this within WikiValley's mission scope ? Yug (talk) 21:07, 25 April 2021 (UTC)

Click on $1 below, then read the word aloud

In the Record Wizard, at the Studio step, this sentence is displayed ("Click on $1 below, then read the word aloud"). I guess "$1" should be replaced by something bu I do not remember why. Anyway, something is broken here. I have opened a Phabricator ticket. Pamputt (talk) 16:32, 24 April 2021 (UTC)