LinguaLibre
Technical board
Revision as of 22:42, 16 February 2021 by Poslovitch (talk | contribs) (→Using Magic Word {#language:} and Extension:CLDR (?))
- Local developments are easy. You can customize your css and your js, including creating a local WikiJS script, even with limited edit rights.
- LinguaLibre Bot (Python, github) is a high-impact project. Help is needed to authorize it on more wikis.
- Join us on Phabricator and GitHub.
- Developers: we especially look for Bot Masters (Python, NodeJS), SPARQL experts, VueJS developpers, issues coordinators, but everyone is welcome.
- Projects coordinators: we also look for organizers of recording/hacking meet-ups, who are able to build a network with language learning, language conservation and NLP actors.
- Please announce your hacking project here to raise awareness and gather feedbacks.
- Most of our actions remain small in scope and volunteer-based. In case your project is large enough, you could learn about some of the funding options.
- March 3rd, 2021: Wikidata Lexemes & Lingua Libre coordination assessment
- February 19th, 2021: First progress report with WikiValley and VIGNERON
- January 25th, 2023: the latest Github revision has been pushed on the production server. Kurdish Wiktionary is now supported. Oriya Wiktionary will be very soon. Support of more Wiktionary versions should follow.
Please visit LinguaLibre:About to learn more about the project.
Migration of technical contents
Hello all, Please help migrate technical contents from the main LinguaLibre:Chat room to here. Yug (talk) 18:49, 12 February 2021 (UTC)
2021 Github refreshing : call for volunteers and discussion
- See also Github.com/lingua-libre
Hello all,
Since November 2020 there is an ongoing effort to clean up, document, fix the 11 github repositories upon which LinguaLibre.org stands. A summary is available on the main forum and will be migrated here shortly. This section will focus on gathering users with development skills and discuss about possible fields of action (repositories). We especially look for Bot Masters (Python, NodeJS), Sparql expert, VueJS developpers, issues coordinators. Yug (talk) 15:57, 12 February 2021 (UTC)
Early 2021 codings : Wikivalley & volunteers communication board !
WikiValley have been selected to make a notable technical push on the LinguaLibre Suite where volunteer developers are not enough. They will coordinate with volunteers developers in order to smooth everyone's work, avoid duplicate efforts and git conflicts. The Start, End, and Repositories columns below are especially important, please keep them up to date, respect them, or change them whenever required. If you need to work on a repository under work, contact the developer listed there and organize as needed. Our objective here is to keep clarity and to progress smoothly. Please avoid emails and prefer communicating here within subsections so we can all be somehow aware of how are things going. Yug (talk) 15:56, 12 February 2021 (UTC)
- Note: Volunteers started working around in December. WikiValley around Feb. 11th. Yug (talk) 15:56, 12 February 2021 (UTC)
Past developments | |||||
---|---|---|---|---|---|
Start | End | Contacts/dev | Team | Repository | Advancement & result so far. |
2021/02/01 | 2021/02/10 | Yug | Volunteers | SignIt | Get back control (access right) ; fix video query ; test locally ; publish new version on Mozilla store → Fixed Firefox extension |
2021/02/01 | 2021/02/16? | Yug Michael |
Volunteers WM-France |
/operations /CommonDownloadTool |
Explore possible breakpoints ; identify likely cause ; fix ; deploy ; run → Fixed https://lingualibre.org/datasets/ |
2021/02/? | 2021/02/11 | Vigneron WikiValley |
Wikivalley | QueryViz Other? |
Explore possible breakpoints ; identify cause ; fix ; deploy ; inquire on numbers differences → Fixed LinguaLibre:Stats |
Current developments | |||||
2021/02/01 | 2022/01/01 | Poslovitch | Volunteers | Lingua-Libre-Bot | Maintain, update and operate the bot. 2021 Q1 [WIP]: Refactor the bot to ease implementations of additional Wiktionaries. |
Planned developments | |||||
2021/02/01 | 2021/03/?? | Poslovitch | Volunteers | /operations /CommonDownloadTool |
Project: Explore datasets scripts and queries. May require SPARQL assistance. |
2021/02/? | 2021/02/? | WikiLucas00 Yug |
Volunteers | CustomSubtitle BlueLL |
Explore Subtitle's ribbon's bug ; identify cause. |
@VIGNERON please keep us informed a bit on what your team is touching. Just edit above and ping us to notify us of an update. Yug (talk) 16:25, 15 February 2021 (UTC)
User box ?
Babel user information | ||||||
---|---|---|---|---|---|---|
| ||||||
Users by language |
It may be cool to create an userbox "dev" {{Userbox-dev}}, on the model of {{Userbox-records}}, with Python, Javascript, PHP, VueJS, Wikimedia Bot as specific sub-categorization ? Yug (talk) 15:59, 12 February 2021 (UTC)
- I kinda disagree with that. Lingua Libre is not meant to become a hub for techies. Sure, we need all the help that comes, yet the only usecase I foresee for these userboxes would be in the event something goes bad and we need someone with the good skills to take care of that. But, since it has to be added by oneself on one's user page, the same can be said of the page where we list who does what (I don't recall how it's called). Which one of these two 'systems" should be kept? --Poslovitch (talk) 21:53, 13 February 2021 (UTC)
- In term of community I see ourselves as somewhere inbetween Wikipedia and Wikidata communities. We mainly deal with singleton : audio files, which are data units. People come, do a more or less powerful recording contributions, then sharply reduce their involvement and leave thousands files units here.
- And like Wikidata, we need people giving life to these data units. This is done via reuse, bots, webapps, text-to-speech. Developers' creations.
- So yes, developers have to become an important piece of our community. And we would gain to create some active dynamic gathered around languages and projects (repositories). Yug (talk) 22:30, 13 February 2021 (UTC)
Datasets has become super slow ?
I try to interpret and understand how /datasets are generate.
- On April 2020, French dataset of about 100,000 audios is processed in 51 minutes.
- On February 2021, Bengali dataset of about 50,000 audios is processed in 18 hours.
What do I miss ? Yug (talk) 00:09, 13 February 2021 (UTC)
Zip file | Date | Bits |
---|---|---|
lingualibre_full.zip | 2019-May-17:01:18 | 1989664440 |
Q101-srr-Serer.zip | 2019-Nov-05:03:09 | 14967 |
Q113-cmn-Mandarin_Chinese.zip | 2019-Nov-05:03:09 | 112613 |
Q115107-bcl-Central_Bikol.zip | 2019-Nov-05:03:09 | 166323 |
Q127-tam-Tamil.zip | 2019-Nov-05:03:09 | 154352 |
Q130-zho-Chinese.zip | 2019-Nov-05:03:10 | 2724328 |
Q131-hye-Armenian.zip | 2019-Nov-05:03:10 | 824117 |
Q141-cym-Welsh.zip | 2019-Nov-05:03:10 | 12905993 |
Q154-amh-Amharic.zip | 2019-Nov-05:03:11 | 2653977 |
Q165-hat-Haitian_Creole.zip | 2019-Nov-05:03:11 | 233588 |
Q169-tgl-Tagalog.zip | 2019-Nov-05:03:11 | 77198 |
Q170137-mos-Mossi.zip | 2019-Nov-05:03:11 | 1158142 |
Q205-gre-Greek.zip | 2019-Nov-05:03:11 | 239390 |
Q231-myv-Erzya.zip | 2019-Nov-05:03:21 | 205878 |
Q242-fon-Fon.zip | 2019-Nov-05:03:21 | 1538614 |
Q258-nso-Northern_Sotho.zip | 2019-Nov-05:03:24 | 774299 |
Q311-oci-Occitan.zip | 2019-Nov-05:03:33 | 511332485 |
Q318-bam-Bambara.zip | 2019-Nov-05:03:33 | 277786 |
Q321-gaa-Ga.zip | 2019-Nov-05:03:33 | 3247380 |
Q336-ori-Odia.zip | 2019-Nov-05:03:34 | 38697693 |
Q339-sat-Santali.zip | 2019-Nov-05:03:34 | 128941 |
Q34-mar-Marathi.zip | 2019-Nov-05:03:34 | 2274397 |
Q35-nld-Dutch.zip | 2019-Nov-05:03:34 | 36279372 |
Q385-ita-Italian.zip | 2019-Nov-05:03:34 | 3440247 |
Q388-que-Quechua.zip | 2019-Nov-05:03:35 | 397476 |
Q39-tel-Telugu.zip | 2019-Nov-05:03:35 | 85571 |
Q397-heb-Hebrew.zip | 2019-Nov-05:03:35 | 1657223 |
Q405-bas-Basaa_language.zip | 2019-Nov-05:03:35 | 1515700 |
Q437-mal-Malayalam.zip | 2019-Nov-05:03:35 | 138601 |
Q446-pan-Punjabi.zip | 2019-Nov-05:03:35 | 11004 |
Q4465-mis-Teochew_dialect.zip | 2019-Nov-05:03:35 | 69734 |
Q45-nor-Norwegian.zip | 2019-Nov-05:03:35 | 431566 |
Q46-ltz-Luxembourgish.zip | 2019-Nov-05:03:35 | 1679618 |
Q51299-hav-Havu.zip | 2019-Nov-05:03:37 | 56823 |
Q51302-tay-Atayal.zip | 2019-Nov-05:03:37 | 65533 |
Q52067-bbj-Ghomala'_language.zip | 2019-Nov-05:03:37 | 1765823 |
Q52068-bum-Bulu_language.zip | 2019-Nov-05:03:37 | 1382789 |
Q52071-dua-Duala.zip | 2019-Nov-05:03:37 | 1206427 |
Q52073-bdu-Oroko.zip | 2019-Nov-05:03:37 | 1723960 |
Q52074-bzm-Londo.zip | 2019-Nov-05:03:37 | 1750380 |
Q52295-atj-Atikamekw.zip | 2019-Nov-05:03:37 | 7315215 |
Q74905-mis-Sursilvan.zip | 2019-Nov-05:03:37 | 14618 |
Q83641-gcf-Guadeloupean_Creole_French.zip | 2019-Nov-05:03:38 | 7412512 |
Q930-mis-Gascon_dialect.zip | 2019-Nov-05:03:39 | 179656450 |
Q931-mis-Languedocien_dialect.zip | 2019-Nov-05:03:40 | 191575650 |
Q123-hin-Hindi.zip | 2020-Apr-25:03:30 | 1704401 |
Q126-por-Portuguese.zip | 2020-Apr-25:03:31 | 43732966 |
Q129-rus-Russian.zip | 2020-Apr-25:03:32 | 60844464 |
Q150-afr-Afrikaans.zip | 2020-Apr-25:04:18 | 42363003 |
Q159-dyu-Dioula_language.zip | 2020-Apr-25:04:18 | 784432 |
Q19858-bci-Baoulé.zip | 2020-Apr-25:04:18 | 1268304 |
Q203-cat-Catalan.zip | 2020-Apr-25:04:18 | 9738365 |
Q204940-ken-Nyang_language.zip | 2020-Apr-25:04:18 | 483396 |
Q208-vie-Vietnamese.zip | 2020-Apr-25:04:18 | 8822067 |
Q219-ara-Arabic.zip | 2020-Apr-25:04:19 | 85373129 |
Q21-fra-French.zip | 2020-Apr-25:05:10 | 2112950650 |
Q221062-mis-Cantonese.zip | 2020-Apr-25:05:10 | 3895600 |
Q22-eng-English.zip | 2020-Apr-25:05:12 | 131688602 |
Q25-epo-Esperanto.zip | 2020-Apr-25:05:19 | 445662713 |
Q264201-ary-Moroccan_Arabic.zip | 2020-Apr-25:05:19 | 1371064 |
Q273-kab-Kabyle.zip | 2020-Apr-25:05:19 | 370876 |
Q298-pol-Polish.zip | 2020-Apr-25:05:21 | 145009958 |
Q299-eus-Basque.zip | 2020-Apr-25:05:21 | 46035866 |
Q33-fin-Finnish.zip | 2020-Apr-25:05:46 | 19473062 |
Q386-spa-Spanish.zip | 2020-Apr-25:05:46 | 28434220 |
Q389-jpn-Japanese.zip | 2020-Apr-25:05:46 | 145688 |
Q392-ces-Czech.zip | 2020-Apr-25:05:46 | 96844 |
Q44-swe-Swedish.zip | 2020-Apr-25:05:46 | 166237 |
Q4901-shy-Shawiya_language.zip | 2020-Apr-25:05:47 | 15804835 |
Q6714-arq-Algerian_Arabic.zip | 2020-Apr-25:05:47 | 3420182 |
Q80-kan-Kannada.zip | 2020-Apr-25:05:47 | 3662223 |
Q24-deu-German.zip | 2021-Feb-11:15:32 | 258363332 |
Q307-ben-Bengali.zip | 2021-Feb-12:07:28 | 1079637723 |
- IMO, this can only be investigated through the logs. Maybe the requests to Commons are taking a longer time than they used to? Maybe the datasets server is under higher load (thus slowing it)? We need you, Michaël! --Poslovitch (talk) 21:41, 13 February 2021 (UTC)
- @Poslovitch could it be that the script upload the "never uploaded yet" ? If so, the April 2020 French dataset was just the 3000 recent French audios whereas Feb 2021 Bengali dataset was like "Yo, there are the 50,000 bengali audio, deal with it B)" Yug (talk) 22:24, 13 February 2021 (UTC)
- @Yug that might be it. I still don't fully understand what the script does and, well, we can say the documentation is clearly lacking there. I'm working on that too - but yeah, that might be why it's taking more time. We should let it run for a first time, and then force another dataset update a few days later so we can compare both. --Poslovitch (talk) 22:38, 13 February 2021 (UTC)
- @Michael Barbereau WMFr Seems the script has finished running, right ? Any idea why it's so slow in 2021, is there some known overload or hardware issue ? Yug (talk) 12:07, 15 February 2021 (UTC)
- @Yug that might be it. I still don't fully understand what the script does and, well, we can say the documentation is clearly lacking there. I'm working on that too - but yeah, that might be why it's taking more time. We should let it run for a first time, and then force another dataset update a few days later so we can compare both. --Poslovitch (talk) 22:38, 13 February 2021 (UTC)
- @Poslovitch could it be that the script upload the "never uploaded yet" ? If so, the April 2020 French dataset was just the 3000 recent French audios whereas Feb 2021 Bengali dataset was like "Yo, there are the 50,000 bengali audio, deal with it B)" Yug (talk) 22:24, 13 February 2021 (UTC)
Using Magic Word {#language:} and Extension:CLDR (?)
- For general awareness. No real question asked.
I found out LL uses the MediaWiki Extension CLDR. Its data comes from the w:Common Locale Data Repository Project (CLDR), part of the Unicode Consortium. This extension automatizes translations from iso-639 codes to target languages name words, ex: {#language:it|en}
→ Italian. Coverage range is ~500 names in ~166 languages. Translate wiki has a tutorial on how to contribute to this CLDR website.
- mw:Help:Magic_words#Miscellaneous > {{#language:language code|target language code}}
- → {{#language:ar|en}} → Arabic
- → {{#language:ar|hi}} → अरबी
- → {{#language:ja|hi}} → जापानी
- → {{#language:fr|he}} → צרפתית
- → {{#language:fra|he}} → fra (not available)
- → {{#language:fr-ca|he}} → Canadian French (falls back on English)
- → {{#language:mar|hi}} → mar (n.a)
- → {{#language:en|mar}} → English (falls back on English)
- → {{#language:mar|en}} → mar (n.a)
- → {{#language:mar|mar}} → mar (n.a)
- mw:Extension:CLDR MediaWiki Extension : "Provides functions to localize the names of languages, countries, currencies, and time units based on their language code."
- Github mirror > key folder : /CldrNames & Wikimedia corrections here.
While we would gain to stay focus on our own recording mission, it stays interesting to be aware of this project. cc @Poslovitch , for the Magic Word. Yug (talk) 12:07, 15 February 2021 (UTC)
- Good to know, but I don't think we're going to use that in the foreseeable future. --Poslovitch (talk) 22:41, 16 February 2021 (UTC)
BlueLL theme might break when updating to MW 1.35
Hi @VIGNERON . According to this pending PR on GitHub (https://github.com/lingua-libre/BlueLL/pull/3), the BlueLL theme might not be compatible with MW 1.35. I have limited knowledge in MW themes, but I can merge the PR if needed. What's your opinion about it? --Poslovitch (talk) 10:49, 16 February 2021 (UTC)
- Hi there, I also don't have the technical understanding of mediawiki themes to say much, but I encourage you to talk with jdlrobson to see what he think about his fix & 1.35. Yug (talk) 21:30, 16 February 2021 (UTC)
Generate a summary.csv alongside the datasets ?
This idea just crossed my mind. Would it be interesting to generated a summary.csv file containing the list of available datasets, their generation date, their size in bytes with additional information such as amount of recordings, amount of speakers, total length of audio files... Any opinions? --Poslovitch (talk) 11:15, 16 February 2021 (UTC)
- (Then why not a minimalist HTML5 webpage with a single table ? Would be more elegant. Yug (talk) 21:35, 16 February 2021 (UTC))
- Also, only 13 zip have been updated. More languages have been active in the past month alone. The /datasets/ also doesn't display the 100+ language he should. So I suspect
create_datasets.sh
is still not doing the full thing. - Poslovitch, you see it too ? Yug (talk) 21:35, 16 February 2021 (UTC)
Where is LinguaImporter's code (admins only)
Hello, we have T233917 which request to edit the language importer tool. I checked Mediawiki:Common.js and github/lingua-libre with the UI's string search:LinguaImporter and search:Import a language, but nothing. Any idea where is this LanguageImporter tool coded ? Yug (talk) 21:48, 16 February 2021 (UTC)