User talk
Difference between revisions of "Yug"
Line 331: | Line 331: | ||
:{{GT|Oui, j'aurai bien aimer discuter avec toi. On travaille souvent en parallele ou à des moments différents, mais la collaboration synchrone est très encourageante aussi. Ping moi si tu passes sur Paris on se prend un verre.<br>Coté bot, oui, le bot est un point majeur à débloquer. LLBot a été approuvé cet été pour le wikt@oc ([1]), son 5ème wiki.<br>Mais quand on examine l'activité globale du bot [[:meta:Special:CentralAuth/Lingua_Libre_Bot]]), le bot contribue essentiellement sur fr:wikt. Un peu sur oc:wikt et shy:wikt, deux wiktionaries extrèmement peu visités. Donc essentiellement, et terme d'usage, ''*toutes les contributions Lingualibre depuis 6 ans servent uniquement le wiktionaire francais*'', un pays occidentale riche avec une grande envie d'exotisme. Tous les efforts des polonais, punjabi, bengali, marathi, haitiens, brésiliens (Surui), rwandais, Cantonais, etc, ne sont consultés que par les utilisateurs du wiktionary francophone.<br>Nous avons besoin de '''plus de dévelopeurs, mieux équilibrer la charge parmis nos dévelopeurs, et de mieux planifier les sprints''' (pour économiser nos aidants). Dans ce cas, nous pourions alors prendre en charge la création des 6 bots en attente sur [[Lingualibre:Bot]].<br>Si tu souhaites estimer la faisabiliter d'avancer sur ce front, il faudrait consulter [[User:Poslovitch]] and [[User:Lepticed]], ce sont les derniers à avoir toucher à ce bot. Une factorisation du code était prévu mais pas de nouvelle depuis ~8 mois.}} | :{{GT|Oui, j'aurai bien aimer discuter avec toi. On travaille souvent en parallele ou à des moments différents, mais la collaboration synchrone est très encourageante aussi. Ping moi si tu passes sur Paris on se prend un verre.<br>Coté bot, oui, le bot est un point majeur à débloquer. LLBot a été approuvé cet été pour le wikt@oc ([1]), son 5ème wiki.<br>Mais quand on examine l'activité globale du bot [[:meta:Special:CentralAuth/Lingua_Libre_Bot]]), le bot contribue essentiellement sur fr:wikt. Un peu sur oc:wikt et shy:wikt, deux wiktionaries extrèmement peu visités. Donc essentiellement, et terme d'usage, ''*toutes les contributions Lingualibre depuis 6 ans servent uniquement le wiktionaire francais*'', un pays occidentale riche avec une grande envie d'exotisme. Tous les efforts des polonais, punjabi, bengali, marathi, haitiens, brésiliens (Surui), rwandais, Cantonais, etc, ne sont consultés que par les utilisateurs du wiktionary francophone.<br>Nous avons besoin de '''plus de dévelopeurs, mieux équilibrer la charge parmis nos dévelopeurs, et de mieux planifier les sprints''' (pour économiser nos aidants). Dans ce cas, nous pourions alors prendre en charge la création des 6 bots en attente sur [[Lingualibre:Bot]].<br>Si tu souhaites estimer la faisabiliter d'avancer sur ce front, il faudrait consulter [[User:Poslovitch]] and [[User:Lepticed]], ce sont les derniers à avoir toucher à ce bot. Une factorisation du code était prévu mais pas de nouvelle depuis ~8 mois.}} | ||
:[1]: [https://oc.wiktionary.org/w/index.php?title=Wikiccionari:Demandas_als_administrators&oldid=367722#Lingua_Libre_Bot Vote on wikt@oc] [[User:Yug|Yug]] ([[User talk:Yug|talk]]) 21:37, 12 December 2022 (UTC) | :[1]: [https://oc.wiktionary.org/w/index.php?title=Wikiccionari:Demandas_als_administrators&oldid=367722#Lingua_Libre_Bot Vote on wikt@oc] [[User:Yug|Yug]] ([[User talk:Yug|talk]]) 21:37, 12 December 2022 (UTC) | ||
+ | ::Selon Lepticed, il y a ~6 fichier et il faut une bonne journée pour rentrer dans le code. [[User:Yug|Yug]] ([[User talk:Yug|talk]]) 20:22, 13 December 2022 (UTC) |
Revision as of 20:22, 13 December 2022
dev
Salut Yug,
Petit rappel que pour faire des tests, il y a l'instance de développement https://v2.lingualibre.fr sur laquelle tu peux faire tous les essais que tu veux, à n'appliquer ici qu'une fois que c'est fin prèt. Ça évitera à des gens de tomber sur des trucs bizarre en chargeant le site au mauvais moment ;).
Bisous — 0x010C ~talk~ 15:45, 26 December 2018 (UTC)
PS : et ne laisse pas traîner des pages inutiles (MediaWiki:Recordwizard) ; en l'occurence ça bloque notamment d'éventuelles futur changements et traductions poussés par translatewiki.
- Piouf! T'es vraiment tombé dessus !!! *0* !!! J'ai fais aussi vite que possible !!!! Merci pour ta vigilance, tout me semblait bien restoré, sauf si j'étais aveuglé par du cache !... Ah! v2 c'est vrai ! Tu peux me mettre admin ??? je voudrais tester le support d'images dans https://v2.lingualibre.fr/wiki/MediaWiki:Sidebar ! Yug (talk) 20:09, 26 December 2018 (UTC)
- PS: Je ne comprends pas bien ces questions / pages / balises de traductions... Yug (talk) 20:10, 26 December 2018 (UTC)
About
Ton intro et l'historique sont bien sur LinguaLibre:About, mais j'ai viré ce qui n'a rien à faire sur cette page (histoire que ça soit un minimum pro / propre / efficace). Tu peux retrouver le contenu sur la version historisé. — 0x010C ~talk~ 21:59, 26 December 2018 (UTC)
- Ok, cool! J'ai recupéré ca de github, j'y fais du nettoyage. J'ai pas encore décider ou mettre ces truc sur LinguaLibre... Je te tiens au jus.
- Sur github, le repository listes est supprimable ! Yug (talk) 18:08, 28 December 2018 (UTC)
Ongoing work
Get your data :
$git clone git@github.com:hermitdave/FrequencyWords.git
Then, save into dave-to-LL.mk
the following in the root of your directory:
# RUN: # make -f dave-to-LL.mk iso2=pl iso3=pol processing # to do the work # make -f dave-to-LL.mk iso2=pl iso3=pol all # to do the work AND print few messages iso2="pl" iso3="pol" all: processing messages processing: sed -E 's/ [0-9]+$$//g' $(iso2)_50k.txt | sed -E 's/^/# /g' > $(iso2)-words-LL.txt split -d -l 2000 --additional-suffix=".txt" $(iso2)-words-LL.txt "$(iso3)-words-by-frequency-" messages: head -n 5 $(iso2)_50k.txt head -n 5 $(iso2)-words-LL.txt head -n 5 "$(iso3)-words-by-frequency-00.txt" head -n 5 "$(iso3)-words-by-frequency-01.txt" wc -l $(iso2)-words-LL.txt wc -l "$(iso3)-words-by-frequency-01.txt"
Then, find your {iso2}_50k.txt file. Put both in the same folder, and run the command below with your needed iso2 and iso3 values :
make -f dave-to-LL.mk iso2=pl iso3=pol processing
Yug (talk) 18:17, 28 December 2018 (UTC)
Subtlex
iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt sed -E 's/(,[0-9]+.?[0-9]*)+//g' $iso2-words.txt | tail -n+4 | head -n 20000 | sed -E 's/^/# /g' > $iso2-words-LL.txt split -d -l 2000 --additional-suffix=".txt" $iso2-words-LL.txt "$iso3-words-by-frequency-"
Feature idea : table tacking existing languages on LinguaLibre.fr
I have difficulties to keep track all the languages I helped to add to LinguaLibre. Taiwan has 16 languages and 42 locals variations. Maybe it already exists... If not, It would be a positive have a sortable table such as below :
Wikidata qid | LinguaLibre qid | English name | Language group | Active ? | Numb. or recordings |
---|---|---|---|---|---|
Q715766 | Atayal (Q51302) | Atayal | Taiwanese | Low | 4 |
Q718269 | Sakizaya (Q51871) | Sakizaya | Taiwanese | Low | 6 |
... | ... | .... | ... | ... | ... |
Yug (talk) 12:16, 31 December 2018 (UTC)
- I'am finding out how LinguaLibre:Stats is coded, maybe I will be able to produce something :D Yug (talk) 12:59, 31 December 2018 (UTC)
WP query and frequency
Helper : https://codepen.io/hugolpz/pen/ByoKOK
curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt
Tamil
- https://mobile.twitter.com/IyalMozhi/status/1265550815051350019 Yug (talk) 10:31, 27 May 2020 (UTC)
Clean up corpus
find -iname 'fra-opus-100k.txt' -exec cat {} \; | grep -o '\w*' | sed -e "s/^c$/ce/gi" -e "s/^d$/du/gi" -e "s/^j$/je/gi" -e "s/^m$/moi/gi" -e "s/^n$/ne/gi" -e "s/^qu$/que/gi" -e "s/^s$/se/gi" -e "s/^t$/toi/gi" -e "/^[0-9]*$/d" | awk '/^[[:upper:]][^[:upper:]]*$/{$1=tolower(substr($1,1,1)) substr($1,2)}1' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > fra-subtlex.txt
cat userlist.txt | sed -e 's/^# //g' | tr "\n" "\|" | tr "\^" "\(" | sed -e "s/^/\\\(/" -e "s/|$/\\\)/g"
Lingua Libre Story for September 2020
- This is not an official story or newsletter. This is an attempt by a user to share some updates about the program. There might be more stories which I have missed.
September 2020 was an eventful month and we have seen a lot of activities of uploading new content and also around project-related discussion. Here are some of the best stories from September 2020.
- 300,000 files: On 10 September 2020 we completed 300,000 pronunciation uploads. After the launch in August 2018, the first 100,000 files were uploaded in April 2019, and the milestone of 200,000 files was reached on January 2020. As of 30 September 2020 there are 366 speakers at this project working in 92 languages.
- Maximum number of pronunciations in a month: In September 2020, 23,209 files were uploaded. This is the maximum number of files uploaded ever in a particular calendar month (earlier it was 22,963 files in June 2020, and 22,293 files in May 2019).
- Indian language in top 3 list: This month Bengali language came into the top three languages by the number of files uploaded using Lingua Libre. This is possibly the first time a non-European/Indian language came into the top three most-uploaded languages on the project.
- Project chat: Several discussion started on the Chat room, such as Bug testing (you may help), Technical preparations etc.
That's it. Have a good time. --টিটো দত্ত (Titodutta) (কথা) 13:35, 1 October 2020 (UTC)
This post is under CC0 license, feel to free share with anyone, anywhere, without any restriction
Re: Speeding bug
Bonjour/নমস্কার, I have been following the thread for sometime. However I did not understand it. It describes the bug as "nasty", do you mean, after recording the audio speed is becoming much faster? Yes, that happens. That's speeding up. If you mean while previewing the audio suddenly stops in between and try to load, that also happens, that's speeding down (and thankfully does not change the end result, so you can upload). Most probably you are talking about speeding up. I can add my opinion/view (and possibly ask you another question also). Regards. --টিটো দত্ত (Titodutta) (কথা) 02:03, 16 October 2020 (UTC)
- @Titodutta
- For some users, when they record a 2 seconds words, the audio produced is 0.3 second long : a speeded up version of what we want. Open and listen this one : 飞行员已经修复了他们的设备 (Q338365).
- Why nasty ? Because some people may record 5000 audios and 5% of the set is corrupt. This makes the whole dataset untrustworthy and unusable.
- As for time management we would go faster deleting the whole 2000 recordings and ask the speaker to record again. But since we work with volunteer, coming to them to tell them "hey, our web app failed, can you re-contribute the exact same thing for free again ?" is quite embarrassing for everyone to discuss.
- Half luck is... this bug appears on whole session. By example, on Luilui6666 uploads that day, only the 126 uploads around 04:5* am are corrupted. All those uploaded at 04:29 are fine. (Don't know why)
- Yug (talk) 09:14, 17 October 2020 (UTC)
- Oh, that's kind of common. I do not record so many words at a time, and double-check (speaker & earphone) before uploading (that's why it is actually a lot more time, and my upload rate is pretty slow). Anyway, this is not an uncommon thing. The question I wanted to ask you: do you face this error for any given list? I face it sometimes, let me explain suppose I uploaded 12, 20, 15, 18 words, and while I am doing the fifth list/batch, the speeding issue starts. Generally with 1–2 words out of 15–20 words. Next time, it will be more and almost 50% words will be corrupt. The easy solution I do is: reload the page and start the Record wizard from the beginning. That helps. Anyway, if anyone wants to test (the results on the Project chat page) and finds there is no error, that does not definitely mean there is no error. In my observation, the error happens after a certain time/number of uploads. That's my thought on this so far.
- The example you have given, yes that's the problem. However it is suggested to check every audio before uploading, perhaps. --টিটো দত্ত (Titodutta) (কথা) 09:32, 17 October 2020 (UTC)
- Plus I do not try with so many words, my highest has been 140 or so words, that too I try very rarely. But yes, you are right, if it is a big list of a few hundred words, the error may start anytime, and that's problematic. --টিটো দত্ত (Titodutta) (কথা) 09:35, 17 October 2020 (UTC)
- So you are affected as well. Please copy or move your report above into LinguaLibre:Chat_room#Speeding-up_bug_:_call_for_testers. We collect observed behaviors and brainstorm there. It's important to know how many people and which systems are affected. Yug (talk) 09:47, 17 October 2020 (UTC)
- Absolutely, I face this issue as well. The problem is I can not replicate it and do not know when this error is going to come. Mostly it is uploading after 50/100/any number of words. That's the reason (not able to replicate) I did not post there so far. Let me know if you think I should post this there. I am using Firefox 81.0.2 (64 bits), Windows 10, I am using an external microphone. Regards. --টিটো দত্ত (Titodutta) (কথা) 10:00, 17 October 2020 (UTC)
- @Titodutta Add your system info to the table. In the comment column, specify shortly that it's occasional. Then, in the discussion, paste what you wrote me above, the long explainer of what you see. Maybe someone will think about something. Following Luilui's corrupted set (126, long sentences) I suspect longer sentences to create a saturation effect. But we need more testimonies. Yug (talk) 10:28, 17 October 2020 (UTC)
- Absolutely, I face this issue as well. The problem is I can not replicate it and do not know when this error is going to come. Mostly it is uploading after 50/100/any number of words. That's the reason (not able to replicate) I did not post there so far. Let me know if you think I should post this there. I am using Firefox 81.0.2 (64 bits), Windows 10, I am using an external microphone. Regards. --টিটো দত্ত (Titodutta) (কথা) 10:00, 17 October 2020 (UTC)
- So you are affected as well. Please copy or move your report above into LinguaLibre:Chat_room#Speeding-up_bug_:_call_for_testers. We collect observed behaviors and brainstorm there. It's important to know how many people and which systems are affected. Yug (talk) 09:47, 17 October 2020 (UTC)
Project:Otherworldly languages
- Wistling:
- Canaries
- India > Khongtong > Melody-names https://www.bbc.com/news/av/world-asia-india-49953861
Referents
This table maps referents users to contact according to various needed.
Username & email | LinguaLibre.org (Wiki) | Github (Dev) | Phabricator (Tasks) | Social web (Communications) | Community contact of... |
---|---|---|---|---|---|
0x010C | Administrator Bureaucrat |
Owner (admin) | ? | tw:@LingLibre_WMFr | – |
Jixitris | User | Owner (admin)/Dev | – | – | |
Yug ✉ | Administrator | Owner (admin)/Organisation | Member | – | – |
Xenophôn | Administrator Bureaucrat |
Owner (admin) | – | Wikimedia France | |
MickeyBarber | User | Owner (admin) | – | Wikimedia France | |
Lyokoï | Administrator | Member | tw:@LingLibre_WMFr | – | |
Pamputt | Administrator Bureaucrat |
Member | Member | – | – |
Poslovitch | User | Owner (admin) | – | Eastern France's academics | |
Adélaïde Calais WMFr | User | Owner (admin) | – | Wikimedia France | |
Tshrinivasan | User | Member | – | India | |
Eavqwiki | Administrator | Member | – | Paris academics | |
WikiLucas00 | Administrator | – | Member | – | – |
[[User:|User:]] | User | Member | – | – | |
Bold: most active user(s), referent one. |
User essays
I have two user essays, did you see these? These will tell about the workflow I and we use for Bangla (Bengali)
Let me know if you find these interesting/worth-sharing. --টিটো দত্ত (Titodutta) (কথা) 00:06, 3 February 2021 (UTC)
- Hello user:Titodutta, I'am quite busy at the moment with Github clean up, coordinations, dev & fix. In addition to work and family life. I plan to turn back to LinguaLibre's community here next week and then check up your resources. But as a rules of thumb on wiki, better to go too fast than slow yourself down due to others ;) Yug (talk) 13:33, 4 February 2021 (UTC)
Userrights
That would be perfect. Thank you for submitting this request for me! Poemat (talk) 00:52, 18 February 2021 (UTC)
Autopatrol
Thank you! This limitation is stupid and counterproductive. I understand it's a Commons policy, and not Lingua Libre invention, but as long as it is in force, you could at least add a counter or a warning to the app, to tell people they are approaching their limit. It can be very disappointing for new users. Olaf (talk) 01:16, 18 February 2021 (UTC)
- Yes, it has been a sneaky things to discover. Earliest users are also heavy Commonist and generally already have autopatrol rights, so so spent 2 years without noticing, precisely because it only affects new users... We are on it and will do our best. Yug (talk) 01:40, 18 February 2021 (UTC)
- Regarding the bot - my bot is in Java, and it's a big complicated machine, which I have been developing for 12 years. Adding all the new pronunciation recordings the night after they appeared in Commons is just one of its functions. I'm afraid the different programming language and the focus on Polish Wiktionary makes my bot incompatible with your production. However, I can see another area where I could probably help. Each time I start the recording, I run my code to produce a list of words to record. I can see LL has a list feature but at least for the Polish language the lists are useless (sorry) - they contain inflected forms and words that have been already recorded by others. My list consists only of lemmas that have no recordings in Commons (both from Lingua Libre and in the old format), and it is sorted in descending frequency order. My bot could maintain similar lists for the most popular languages, updating them every night. It wouldn't be a problem, I've been maintaining similar frequency lists for years to help Polish Wiktionary editors create the Wiktionary articles about the most frequently used words in various languages, so most of the code is already done. The only things I would need are a way to upload the lists to the LL system, and a bit of configuration to make them possible to select in your record dialog. What do you think about this idea? Olaf (talk) 17:47, 19 February 2021 (UTC)
- p.s. If you want to take a look at my production, here is a list of French lemmas that have the biggest number of French sections in various language editions of Wiktionary, but have no French section in Polish Wiktionary: https://pl.wiktionary.org/wiki/Wikis%C5%82ownik:Ranking_brakuj%C4%85cych_s%C5%82%C3%B3w_wed%C5%82ug_wyst%C4%85pie%C5%84_w_innych_wikis%C5%82ownikach/francuski
- I maintain similar lists for about 60 other languages, updated nightly: https://pl.wiktionary.org/wiki/Kategoria:Rankingi_brakuj%C4%85cych_s%C5%82%C3%B3w_wed%C5%82ug_wyst%C4%85pie%C5%84_w_innych_wikis%C5%82ownikach
- For my own needs I generate a similar list of Polish words with no recordings in Commons. This is what I could offer for many languages. I additionally split the Polish list to separate words with the /r/ consonants and the rest, because I can't speak /r/ properly, so the words with /r/ are for my wife (Poemat), but I can do the split in Polish only. Olaf (talk) 18:08, 19 February 2021 (UTC)
- @Olaf Hello, and thank you for this bot summary. It's interesting to discover other bots helping in unexpected ways. FYI, Poslovitch is now managing the LinguaLibre Bot. He got access-rights recently, has Python abilities and is studying the bot.
- Poslovitch and myself are currently focus on Wimedia India's Wiki Meet online conference. I myself contributed an bit too much recently and need a light wikibreak in coming week(s). We should keep in mind your bot and its approach. Maybe migrate this conversation to LinguaLibre:Technical board ? Yug (talk) 15:11, 20 February 2021 (UTC)
LiLi video is up
Your session was great. The video is here m:Wikimedia_Wikimeet_India_2021/Program. --টিটো দত্ত (Titodutta) (কথা) 20:28, 23 February 2021 (UTC)
Olafbot
- Moved to LinguaLibre:Technical board --Yug (talk) 12:09, 25 February 2021 (UTC)
Would you be so kind and delete the following redirects:
- List:Ben/lemmas-without-audio-sorted-by-number-of-wiktionaries
- List:Deu/lemmas-without-audio-sorted-by-number-of-wiktionaries
- List:Eng/lemmas-without-audio-sorted-by-number-of-wiktionaries
- List:Epo/lemmas-without-audio-sorted-by-number-of-wiktionaries
- List:Fra/lemmas-without-audio-sorted-by-number-of-wiktionaries
- List:Hin/lemmas-without-audio-sorted-by-number-of-wiktionaries
- List:Spa/lemmas-without-audio-sorted-by-number-of-wiktionaries
- List:Ukr/lemmas-without-audio-sorted-by-number-of-wiktionaries
- List:Eng/words without audio, sorted by num of wiktionaries, refreshed daily
I have changed the capitalization a little bit in the code, so the lists maintained by the bot have names starting with "Lemmas" instead of "lemmas". The old versions are not needed. Sorry for adding more work. BTW, I found no template for Speedy Delete, I guess, there was no need for it on this wiki? Olaf (talk) 00:40, 27 February 2021 (UTC)
- I have deleted the redirect pages under U1/Author's request criteria. --টিটো দত্ত (Titodutta) (কথা) 16:30, 27 February 2021 (UTC)
The bot is written in Java. It's a big ugly blob of many functions accumulated in the last 12 years. And the list generation for LinguaLibre is not independent of the rest, because the same scanning of Wiktionaries is used in list generation for Polish Wiktionary, and Commons category scanning is used to add audio files to Pl-wikt. So it would be rather hard to move to lingua-libre repo unless I split it in half. Olaf (talk) 12:27, 1 March 2021 (UTC)
- There are 72 "Lemmas-" lists, but last night I added three new (example). However they are generated only for the Polish language from statistics of links between words in Polish Wiktionary - every word in a Wiktionary definition or an example is linked to its lemma there (like here), so it was possible and produced very good results. I use just these lists when recording for LiLi. But I'm afraid it's not possible to generate a good list in this way for any other language.
- I'm also thinking about generating lists of geographical names using Wikipedia and Wikidata, but it's for the future.
- I believe your idea of importing Unilex lists is very good. :-) Olaf (talk) 01:09, 4 March 2021 (UTC)
- Hey, Olaf, thanks for the support. I'am brain-exhausted but the bot is ready and I see we are attacking our bottlenecks from several sides. It's pretty nice to witness. I made a test run with List:Ita/M…. That gives an idea. Anyway, Thank you :) Yug (talk) 22:57, 4 March 2021 (UTC)
- Ok, I understand the reasons for not using the Lemmas list. However I don't quite understand, what exactly am I supposed to do. Do you want me to generate the Marathi first-letter lists with this script and upload it with the bot? Should I remove the recorded words from them? Once? Every night? I'm not sure what you expect. In fact, if they divide the work among them, there will be no words recorded by other people, so perhaps the current option in the Record Wizard for removing one's own recordings is enough here? Olaf (talk) 01:47, 6 March 2021 (UTC)
- Hey, Olaf, thanks for the support. I'am brain-exhausted but the bot is ready and I see we are attacking our bottlenecks from several sides. It's pretty nice to witness. I made a test run with List:Ita/M…. That gives an idea. Anyway, Thank you :) Yug (talk) 22:57, 4 March 2021 (UTC)
Re: Lists
Wielkie dzięki! (🗨️ translate) I should have thought about it(clarification needed) before advertising in the Chat Room, but it may be useful in the future. Do you think an automatically updated version of the Unilex lists with recordings removed would be also useful? Olaf (talk) 19:09, 5 March 2021 (UTC)
- @Olaf We are focus on code these days, I do it too : forgetting the outreach. But this List / coding sprint are landing. I still need a more serious pause on my side. Now that we have lists, better and wider outreach is clearly the next big thing on lingualibre.
- I think your lists should focus on wikt needs : the missing audios. On my side, I want to provide a stable track for users such as Titodutta, WikiLucas, Poslovitch : who came and slowly recorded 10,000+ items. This will provide clean data for outside (e-)dictionaries.
- I think I may delay the UNILEX import a bit, and in any case I won't import the composite IETL languages yet. I'am lightly overworking these past month so I need a break away for the wikis. Yug (talk) 19:40, 5 March 2021 (UTC)
@Olaf Hello again,
I'am very cautious about my imports given the size of it. Right now I'am back on naming conventions and would like to exchange with you. I identified the following elements:
- List:{Iso} : defines the language.
- {purpose} : Most common, Missing wikt, letters, Swadesh
- {source} : unilex, subtlex, etc.
- {id}: defining the list within its series.
- {range} : such as
1-1000
,00001-01000
and00001_to_01000
. - {number} : from
1
then goes up.
- {range} : such as
- various separators :
_
,-
,,
,_to_
.
I tried a bunch of combinations, with different order, separators and capitalization (UNILEX, Unilex, unilex). It also needs to keep in mind:
- Title must be simple English.
- Title must ease identification and loading. (So all lists cannot starts by "Words")
- Title must ease search and maintenance, possibly by script.
For {id} I realize the range was a developer-centered need. Better to use numbers.
On {source} can be required to discriminate between similar-purposes lists. Or should we have a policy : only one list serie by type of approach ? Only one frequency list per language, EITHER Unilex OR Subtlex OR HermiteDave ?
I now reduced the formats to the following (ignore separators):
Schema | Example | Comment |
---|---|---|
List:${Iso}/{purpose},_{source}_{id} | List:${Iso}/Words_most_used,_UNILEX_1 | Minimalist, purpose first. |
List:${Iso}/{purpose}-${id} | List:${Iso}/Common_words_1 List:${Iso}/Letter_A-${id} |
Minimalist no source. May be troublesome for later mass maintenance. |
List:${Iso}/{source}-{purpose}_{id} | List:${Iso}/UNILEX-Common_words_1 List:${Iso}/UNILEX-Frequent_words-1 List:${Iso}/UNILEX-Letter_A-1 |
Minimalist, simple English 1. Minimalist, simple English 2. Minimalist letter. |
List:${Iso}/{source}-{purpose}_{id} | List:${Iso}/Unilex-Common_words_1 List:${Iso}/Unilex-Frequent_words_1 List:${Iso}/Unilex-Letter_A_1 |
Source not capitalized. |
List:${Iso}/{source}-{purpose}_{id} | List:${Iso}/UNILEX-Words_sorted_by_frequency_1 | Explanatory title. |
List:${Iso}/{purpose} | List:${Iso}/Lemmas-without-audio-sorted-by-number-of-wiktionaries | Only one list so no {id}. {purpose} also expose the source so no independent {source}, otherwise duplicate. Explanatory title. Directed to Wiktionary public so "Lemmas" is fine. |
I want to think it twice because the format we use in coming days will likely become a de facto gentle recommendation. Any idea and comment on these ?Yug (talk) 08:15, 7 March 2021 (UTC)
- Random thoughts about it:
- Lists are displayed in the Record Wizard in alphabetical order. If you put 10 frequency or Unilex lists at the same time, nothing else from this point down the alphabet is likely to be visible for a newcomer. (For example, in Marathi, there is no chance to see any new list, because the directory is bloated with lists having names starting with B. But Marathi is a special case, I understand). So I think uploading initially just one list of each type for each language is a good idea. There is always time to upload more when somebody has taken the bait.
- I wonder if simple English is always the best solution. Perhaps it would be better to name the lists in the local language if possible - not every local speaker must be good at English. I know, it's hard or impossible with so many languages.
- How do you solve the license problem for the Unilex lists? I believe the license should be visible somewhere, but currently, the Record Wizard has no such capability?
- Maybe we should put a special sign at the beginning of the name of each bot-created list, to make the lists always visible at the top of the directory for the newbies? Without it, the lists can be easily lost among old obsolete lists created by users.
- Perhaps the old user-created lists should be at some point removed when no activity takes place for a long time?
- Maybe there should be a marker, that the list is automatically updated? But the name visible in the Record Wizard is probably too short.
- Maybe there should be a separate directory in the Record Wizard for the user-created lists, and the "official lists" (perhaps they could have a better name). I mean, in some languages there are a lot of old lists, and a newcomer may have no idea what to start with.
- Perhaps the Record Wizard could show the list directory as a tree, or in a form similar to a file system? Then we could have as many lists of each kind as we want, and all the remarks above would be negligible.
- Olaf (talk) 20:10, 7 March 2021 (UTC)
- All this discussion make me think that the list system should be highly improved. I will open a new ticket on Phabricator to keep track of all your propositions. As a feature request, the new list system may be developed in the future. Pamputt (talk) 20:55, 7 March 2021 (UTC)
- Yes. I was not needed when we barely had lists. Things have changed. The minimal would be to 1) Agree on a set of rules for lists maintenance, so we can merge the small ones (see List:Mar/) ; 2) Be able to have sections within List page, and in the loading, to load a serie (ex: List:Mar/UNILEX), then a section (ex: List:Mar/UNILEX#2).
- Also, we need developers. We need to get out of Lili and do outreach.
- Olaf, thanks for your input. I agree with your concerns (will come back to it in few days due to an IRL deadline). As for license I add {{UNILEX license}} on the talkpage. Yug (talk) 21:06, 7 March 2021 (UTC)
- Also : Mar list creations, we need better communications. Yug (talk) 09:24, 8 March 2021 (UTC)
- All this discussion make me think that the list system should be highly improved. I will open a new ticket on Phabricator to keep track of all your propositions. As a feature request, the new list system may be developed in the future. Pamputt (talk) 20:55, 7 March 2021 (UTC)
Botting without API?
Apparently, API is not installed or disabled in the Phoenix edition, yet Dragons Bot was able to log in and edit. How? Is your bot emulating a human with OAuth login and standard form-based editions? Or was the API switched off just recently and your bot is grounded too? Olaf (talk) 10:20, 23 April 2021 (UTC)
- Never mind, the problem solved. API endpoint has been changed. Olaf (talk) 10:58, 23 April 2021 (UTC)
Suppression de List:Fra/Dico_des_Ados/alea
Salut, pourquoi as-tu supprimé List:Fra/Dico_des_Ados/alea ? Je l'utilisais pour les mots du Dico qui n'avaient pas encore de pron... --DSwissK (talk) 17:48, 24 April 2021 (UTC)
- @DSwissK en attendant que Yug te réponde, j'ai restauré la page de sorte que tu puisses continuer à l'utiliser. Bons enregistrements. Pamputt (talk) 18:35, 24 April 2021 (UTC)
- Salut @Pamputt merci mais j'ai déjà recréé sous un nom un peu plus logique. ;) DSwissK (talk) 18:54, 24 April 2021 (UTC)
Kurdish List
Hi I realized this list List:Kur/words-by-frequency-00001-to-01000 has gotten deleted. Is there any way I can get it again? Balyozxane (talk) 22:24, 6 May 2021 (UTC)
- @Balyozxane true ! I cleaned the place before to make some mass upload. I'am on it. Uploading Kur list shortly. I'm not sure which Kur is in it tho, but those text-words can be loaded and used for any Kurd dialect. Yug (talk) 14:05, 7 May 2021 (UTC)
- Done restored ! Special:RecordWizard > Step 3: Details > Local List : enter "kur/", you will find the lists. Few notes :
- 5000 words at the moment, but I could add more.
- Don't worry if words are lowercase, uppercase, Capitalized and you prefer another writing → its twin word is very likely down the road as well and Lili allows rapid recording. The easier is to go with the flow. (Faster that trying to edit).
- We suspect Chrome has some recording issues (noise, parasit clicks, speeding up) while Firefox does better.
- On step 4 while recording, you can hover upon words and play them to check if it works. At step 5 as well. Long list (>500 words) may have more of those bugs.
- We are investigating those issues but only some of us seems affected, which makes debugging slower. Yug (talk) 14:36, 7 May 2021 (UTC)
- @Yug Thank you so much. Unfortunately, my pronunciation sucks so instead I try to encourage native people to upload their own recordings. I asked for the list for that. Have a nice time! --Balyozxane (talk) 15:20, 8 May 2021 (UTC)
- Balyozxane, my pronunciation is only correct in French as well. Yet, the best is to use you experience of Lili to find motivated native speakers around you. Elder folks could be a good asset, as does kids. ;) Yug (talk) 19:11, 29 July 2021 (UTC)
- @Yug Thank you so much. Unfortunately, my pronunciation sucks so instead I try to encourage native people to upload their own recordings. I asked for the list for that. Have a nice time! --Balyozxane (talk) 15:20, 8 May 2021 (UTC)
Wikimania
Hi User:Ngangaesther, a light ping to you :D Yug (talk) 17:30, 17 August 2021 (UTC)
AAP
portail public avec des appels à projet de recherche. pas d'open data à première vue https://appelsprojetsrecherche.fr Yug (talk) 13:10, 15 October 2021 (UTC)
"Workshops: Personal Challenge" Section has to be modified
Hi Hugo ! I am afraid that the Personal Challenge section would have to be modified. I will not spend hours recording > 100.000 spanish words on Lingua Libre if there is no intention to add a search functionality. I am sorry... A hypothetical example could be put to ask others to follow suit. Best regards and good luck with the grant proposals ! Marreromarco (talk) 00:19, 29 November 2021 (UTC)
- @Marreromarco no problem, I agree with you. We have the Sound Library (hackable) to search your contribution and the Wiktionary bot is secretely under revamp at the moment (Spanish with be included), but it comes short of what contributors are in their right to expect. I am 100% with you. Your push + departure will hopefully shake things up a bit. Yug (talk) 11:31, 29 November 2021 (UTC)
Sound Library - Proposed changes
Yug,
I would like you to review these changes I made for the Sound Library. Here is the list:
- Filter out genders that are not related to any items in the sound library - This change will reduce the genders displayed on the Sound Library, so only genders related to recordings are showed. It is slightly more taxing in performance (takes 1-2 more seconds).
- SPARQL query optimizations by moving clauses and using BIND().
- ORDER BY RAND() is ineffective for random ordering. Using UUID() and FILTER() instead - The sound library results were not randomly ordered. Instead, I assigned a random UUID() for each result. Then I filtered out all UUIDs not starting with "1" (arbitrary choice). Finally, I ordered by the remaining UUIDs. The result is a much faster query (e.g. query for "gender = male" yields results in 20-30 seconds instead of 2-3 minutes).
Unfortunately I also ran into some strange problem, I believe this might be a bug with Blazegraph (phabricator:T316983).
Test out my changes here. Let me know your decision. Elfix (talk) 14:41, 3 September 2022 (UTC)
Wikisigns
Salut Yug et merci pour ton travail sur l'extension Signit. J'ai lu avec attention ton rapport sur le travail que tu as effectué et j'ai une remarque concernant Wikisigns. Tu écris que la licence des vidéos n'est pas claire. Or il me semble qu'elle sont aussi publiée sous CC by-sa (comme le site). Un exemple avec ce mot, dont la vidéo se trouve ici. La licence Creative Commons est bien indiquée dans la description de la vidéo. Donc pour moi, toutes leurs vidéos sont libres. On pourrait en être certain en leur écrivant. Pamputt (talk) 06:16, 9 October 2022 (UTC)
- Salut @Pamputt ,
- Toujours bon de recevoir des remerciement pour le travail fait. Merci également pour ce point sur wikisign. Ayant avancé largement seul j'ai naturellement accumulé de la wikifatique. Je laisse donc ces projets d'expansion en sommeil pour le moment. A voir si Adélaide et Melody (WMFR) ont la possibilité de prendre en charge cette demarche. Yug (talk) 14:15, 10 October 2022 (UTC)
Malayalam Language
Kindly help me to add my mother tongue "Malayalam" in lingualibre.org. Akbarali 06:45, 11 December 2022 (UTC)
- Hello @Akbarali , Malayalam is already available in Lingua Libre (Malayalam (Q437)). Do you have issues for selecting it in the Record Wizard? Welcome again, all the best — WikiLucas (🖋️) 14:28, 11 December 2022 (UTC)
- Thank you WikiLucas (🖋️). It is working. --Akbarali 06:24, 13 December 2022 (UTC)
Lingua Libre Bot
Salut Yug, j'ai pas eu le temps de discuter avec assez de monde à la Wikiconvention à propos de Lingua Libre donc je viens aux nouvelles. J'ai vu que tu indiquais que Lingua Libre Bot était pour le moment au point mort faute de développeur Python. C'est en effet exact et je m'interroge sur la raison. J'avais commencé à m'attaquer à la prise en charge de nouveaux projets Wiktionary mais comme c'était pas super actif et que j'ai été occupé avec d'autres choses, j'avais un peu laissé ça de côté. Maintenant, je veux bien essayer de m'y remettre mais j'ai besoin d'y voir un peu plus clair avant de me lancer.
Je vois que le bot tourne toujours car il ajoute des prononciations audio sur les Wiktionaires en français, chaoui et occitan au moins mais il semble qu'il n'ai jamais rien ajouté sur le Wiktionnaire en kurde, bien que le pull request ait été mergé il y a plusieurs mois. J'ai récemment laissé un message pour demander des nouvelles. Par ailleurs, la prise en charge de l'oriya est toujours en attente.
Donc, je ne sais pas qui a les clé pour faire tourner le bot et je peux comprendre qu'il n'ait pas trop de temps pour s'en occuper en ce moment mais j'aimerais juste savoir où on va avant de proposer la prise en charge de nouveaux Wiktionnaires.
Donc si tu as des infos, je suis preneur :) Pamputt (talk) 20:47, 12 December 2022 (UTC)
- Salut Pamputt, cc: user:Olaf (you can click the purple link toe google translates the dicussion.)
- Oui, j'aurai bien aimer discuter avec toi. On travaille souvent en parallele ou à des moments différents, mais la collaboration synchrone est très encourageante aussi. Ping moi si tu passes sur Paris on se prend un verre.
Coté bot, oui, le bot est un point majeur à débloquer. LLBot a été approuvé cet été pour le wikt@oc ([1]), son 5ème wiki.
Mais quand on examine l'activité globale du bot meta:Special:CentralAuth/Lingua_Libre_Bot), le bot contribue essentiellement sur fr:wikt. Un peu sur oc:wikt et shy:wikt, deux wiktionaries extrèmement peu visités. Donc essentiellement, et terme d'usage, *toutes les contributions Lingualibre depuis 6 ans servent uniquement le wiktionaire francais*, un pays occidentale riche avec une grande envie d'exotisme. Tous les efforts des polonais, punjabi, bengali, marathi, haitiens, brésiliens (Surui), rwandais, Cantonais, etc, ne sont consultés que par les utilisateurs du wiktionary francophone.
Nous avons besoin de plus de dévelopeurs, mieux équilibrer la charge parmis nos dévelopeurs, et de mieux planifier les sprints (pour économiser nos aidants). Dans ce cas, nous pourions alors prendre en charge la création des 6 bots en attente sur Lingualibre:Bot.
Si tu souhaites estimer la faisabiliter d'avancer sur ce front, il faudrait consulter User:Poslovitch and User:Lepticed, ce sont les derniers à avoir toucher à ce bot. Une factorisation du code était prévu mais pas de nouvelle depuis ~8 mois. (🗨️ translate) - [1]: Vote on wikt@oc Yug (talk) 21:37, 12 December 2022 (UTC)