User talk
Difference between revisions of "Yug"
Line 188: | Line 188: | ||
In general, I'm not good at Python. I use Java, JavaScript, and TypeScript every day at work, but I tried Python only once or twice in AI competitions on Kaggle.com. I would need probably a lot of learning to be able to contribute to your bot. Maybe in the future... [[User:Olaf|Olaf]] ([[User talk:Olaf|talk]]) 01:14, 25 February 2021 (UTC) | In general, I'm not good at Python. I use Java, JavaScript, and TypeScript every day at work, but I tried Python only once or twice in AI competitions on Kaggle.com. I would need probably a lot of learning to be able to contribute to your bot. Maybe in the future... [[User:Olaf|Olaf]] ([[User talk:Olaf|talk]]) 01:14, 25 February 2021 (UTC) | ||
:''This discussion should be moved on either the [[LinguaLibre:Technical board]]. Would you agree to this ?'' | :''This discussion should be moved on either the [[LinguaLibre:Technical board]]. Would you agree to this ?'' | ||
− | :Hello Olaf, | + | :Hello {{ping|Olaf}}, |
:'''Missing recordings:''' Thank you for attacking this "missing gaps" issues. It's indeed a topic of concern for us all, Wikt and LL. While we have 109 languages, only 22 have 2000 audios or more. Most languages approach recording sessions on a easy-peasy ''by interests topics'' approach, more pleasant to the speaker. This cause very irregular coverage and leaves numerous gaps hard to fill. | :'''Missing recordings:''' Thank you for attacking this "missing gaps" issues. It's indeed a topic of concern for us all, Wikt and LL. While we have 109 languages, only 22 have 2000 audios or more. Most languages approach recording sessions on a easy-peasy ''by interests topics'' approach, more pleasant to the speaker. This cause very irregular coverage and leaves numerous gaps hard to fill. | ||
:'''Bots policy ?:''' We currently have no policy on bots, we are leading by practice (Poslovitch, you) and building on the way. As long as you test respectfully it's ok I guess. Tips: Administrators have some batch revert tool if the need arises. :) There is also a [https://github.com/kanasimi/wikiapi WikiAPI JS bot] raising, which could be interesting for you to explore if you like JS more. The maintainer is highly active and added a suggested new function within 24hours, [https://github.com/kanasimi/wikiapi/issues/7 was impressive]. | :'''Bots policy ?:''' We currently have no policy on bots, we are leading by practice (Poslovitch, you) and building on the way. As long as you test respectfully it's ok I guess. Tips: Administrators have some batch revert tool if the need arises. :) There is also a [https://github.com/kanasimi/wikiapi WikiAPI JS bot] raising, which could be interesting for you to explore if you like JS more. The maintainer is highly active and added a suggested new function within 24hours, [https://github.com/kanasimi/wikiapi/issues/7 was impressive]. |
Revision as of 10:52, 25 February 2021
dev
Salut Yug,
Petit rappel que pour faire des tests, il y a l'instance de développement https://v2.lingualibre.fr sur laquelle tu peux faire tous les essais que tu veux, à n'appliquer ici qu'une fois que c'est fin prèt. Ça évitera à des gens de tomber sur des trucs bizarre en chargeant le site au mauvais moment ;).
Bisous — 0x010C ~talk~ 15:45, 26 December 2018 (UTC)
PS : et ne laisse pas traîner des pages inutiles (MediaWiki:Recordwizard) ; en l'occurence ça bloque notamment d'éventuelles futur changements et traductions poussés par translatewiki.
- Piouf! T'es vraiment tombé dessus !!! *0* !!! J'ai fais aussi vite que possible !!!! Merci pour ta vigilance, tout me semblait bien restoré, sauf si j'étais aveuglé par du cache !... Ah! v2 c'est vrai ! Tu peux me mettre admin ??? je voudrais tester le support d'images dans https://v2.lingualibre.fr/wiki/MediaWiki:Sidebar ! Yug (talk) 20:09, 26 December 2018 (UTC)
- PS: Je ne comprends pas bien ces questions / pages / balises de traductions... Yug (talk) 20:10, 26 December 2018 (UTC)
About
Ton intro et l'historique sont bien sur LinguaLibre:About, mais j'ai viré ce qui n'a rien à faire sur cette page (histoire que ça soit un minimum pro / propre / efficace). Tu peux retrouver le contenu sur la version historisé. — 0x010C ~talk~ 21:59, 26 December 2018 (UTC)
- Ok, cool! J'ai recupéré ca de github, j'y fais du nettoyage. J'ai pas encore décider ou mettre ces truc sur LinguaLibre... Je te tiens au jus.
- Sur github, le repository listes est supprimable ! Yug (talk) 18:08, 28 December 2018 (UTC)
Ongoing work
Get your data :
$git clone git@github.com:hermitdave/FrequencyWords.git
Then, save into dave-to-LL.mk
the following in the root of your directory:
# RUN: # make -f dave-to-LL.mk iso2=pl iso3=pol processing # to do the work # make -f dave-to-LL.mk iso2=pl iso3=pol all # to do the work AND print few messages iso2="pl" iso3="pol" all: processing messages processing: sed -E 's/ [0-9]+$$//g' $(iso2)_50k.txt | sed -E 's/^/# /g' > $(iso2)-words-LL.txt split -d -l 2000 --additional-suffix=".txt" $(iso2)-words-LL.txt "$(iso3)-words-by-frequency-" messages: head -n 5 $(iso2)_50k.txt head -n 5 $(iso2)-words-LL.txt head -n 5 "$(iso3)-words-by-frequency-00.txt" head -n 5 "$(iso3)-words-by-frequency-01.txt" wc -l $(iso2)-words-LL.txt wc -l "$(iso3)-words-by-frequency-01.txt"
Then, find your {iso2}_50k.txt file. Put both in the same folder, and run the command below with your needed iso2 and iso3 values :
make -f dave-to-LL.mk iso2=pl iso3=pol processing
Yug (talk) 18:17, 28 December 2018 (UTC)
Subtlex
iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt sed -E 's/(,[0-9]+.?[0-9]*)+//g' $iso2-words.txt | tail -n+4 | head -n 20000 | sed -E 's/^/# /g' > $iso2-words-LL.txt split -d -l 2000 --additional-suffix=".txt" $iso2-words-LL.txt "$iso3-words-by-frequency-"
Feature idea : table tacking existing languages on LinguaLibre.fr
I have difficulties to keep track all the languages I helped to add to LinguaLibre. Taiwan has 16 languages and 42 locals variations. Maybe it already exists... If not, It would be a positive have a sortable table such as below :
Wikidata qid | LinguaLibre qid | English name | Language group | Active ? | Numb. or recordings |
---|---|---|---|---|---|
Q715766 | Atayal (Q51302) | Atayal | Taiwanese | Low | 4 |
Q718269 | Sakizaya (Q51871) | Sakizaya | Taiwanese | Low | 6 |
... | ... | .... | ... | ... | ... |
Yug (talk) 12:16, 31 December 2018 (UTC)
- I'am finding out how LinguaLibre:Stats is coded, maybe I will be able to produce something :D Yug (talk) 12:59, 31 December 2018 (UTC)
WP query and frequency
Helper : https://codepen.io/hugolpz/pen/ByoKOK
curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt
Tamil
- https://mobile.twitter.com/IyalMozhi/status/1265550815051350019 Yug (talk) 10:31, 27 May 2020 (UTC)
Clean up corpus
find -iname 'fra-opus-100k.txt' -exec cat {} \; | grep -o '\w*' | sed -e "s/^c$/ce/gi" -e "s/^d$/du/gi" -e "s/^j$/je/gi" -e "s/^m$/moi/gi" -e "s/^n$/ne/gi" -e "s/^qu$/que/gi" -e "s/^s$/se/gi" -e "s/^t$/toi/gi" -e "/^[0-9]*$/d" | awk '/^[[:upper:]][^[:upper:]]*$/{$1=tolower(substr($1,1,1)) substr($1,2)}1' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > fra-subtlex.txt
cat userlist.txt | sed -e 's/^# //g' | tr "\n" "\|" | tr "\^" "\(" | sed -e "s/^/\\\(/" -e "s/|$/\\\)/g"
Lingua Libre Story for September 2020
- This is not an official story or newsletter. This is an attempt by a user to share some updates about the program. There might be more stories which I have missed.
September 2020 was an eventful month and we have seen a lot of activities of uploading new content and also around project-related discussion. Here are some of the best stories from September 2020.
- 300,000 files: On 10 September 2020 we completed 300,000 pronunciation uploads. After the launch in August 2018, the first 100,000 files were uploaded in April 2019, and the milestone of 200,000 files was reached on January 2020. As of 30 September 2020 there are 366 speakers at this project working in 92 languages.
- Maximum number of pronunciations in a month: In September 2020, 23,209 files were uploaded. This is the maximum number of files uploaded ever in a particular calendar month (earlier it was 22,963 files in June 2020, and 22,293 files in May 2019).
- Indian language in top 3 list: This month Bengali language came into the top three languages by the number of files uploaded using Lingua Libre. This is possibly the first time a non-European/Indian language came into the top three most-uploaded languages on the project.
- Project chat: Several discussion started on the Chat room, such as Bug testing (you may help), Technical preparations etc.
That's it. Have a good time. --টিটো দত্ত (Titodutta) (কথা) 13:35, 1 October 2020 (UTC)
This post is under CC0 license, feel to free share with anyone, anywhere, without any restriction
Re: Speeding bug
Bonjour/নমস্কার, I have been following the thread for sometime. However I did not understand it. It describes the bug as "nasty", do you mean, after recording the audio speed is becoming much faster? Yes, that happens. That's speeding up. If you mean while previewing the audio suddenly stops in between and try to load, that also happens, that's speeding down (and thankfully does not change the end result, so you can upload). Most probably you are talking about speeding up. I can add my opinion/view (and possibly ask you another question also). Regards. --টিটো দত্ত (Titodutta) (কথা) 02:03, 16 October 2020 (UTC)
- @Titodutta
- For some users, when they record a 2 seconds words, the audio produced is 0.3 second long : a speeded up version of what we want. Open and listen this one : 飞行员已经修复了他们的设备 (Q338365).
- Why nasty ? Because some people may record 5000 audios and 5% of the set is corrupt. This makes the whole dataset untrustworthy and unusable.
- As for time management we would go faster deleting the whole 2000 recordings and ask the speaker to record again. But since we work with volunteer, coming to them to tell them "hey, our web app failed, can you re-contribute the exact same thing for free again ?" is quite embarrassing for everyone to discuss.
- Half luck is... this bug appears on whole session. By example, on Luilui6666 uploads that day, only the 126 uploads around 04:5* am are corrupted. All those uploaded at 04:29 are fine. (Don't know why)
- Yug (talk) 09:14, 17 October 2020 (UTC)
- Oh, that's kind of common. I do not record so many words at a time, and double-check (speaker & earphone) before uploading (that's why it is actually a lot more time, and my upload rate is pretty slow). Anyway, this is not an uncommon thing. The question I wanted to ask you: do you face this error for any given list? I face it sometimes, let me explain suppose I uploaded 12, 20, 15, 18 words, and while I am doing the fifth list/batch, the speeding issue starts. Generally with 1–2 words out of 15–20 words. Next time, it will be more and almost 50% words will be corrupt. The easy solution I do is: reload the page and start the Record wizard from the beginning. That helps. Anyway, if anyone wants to test (the results on the Project chat page) and finds there is no error, that does not definitely mean there is no error. In my observation, the error happens after a certain time/number of uploads. That's my thought on this so far.
- The example you have given, yes that's the problem. However it is suggested to check every audio before uploading, perhaps. --টিটো দত্ত (Titodutta) (কথা) 09:32, 17 October 2020 (UTC)
- Plus I do not try with so many words, my highest has been 140 or so words, that too I try very rarely. But yes, you are right, if it is a big list of a few hundred words, the error may start anytime, and that's problematic. --টিটো দত্ত (Titodutta) (কথা) 09:35, 17 October 2020 (UTC)
- So you are affected as well. Please copy or move your report above into LinguaLibre:Chat_room#Speeding-up_bug_:_call_for_testers. We collect observed behaviors and brainstorm there. It's important to know how many people and which systems are affected. Yug (talk) 09:47, 17 October 2020 (UTC)
- Absolutely, I face this issue as well. The problem is I can not replicate it and do not know when this error is going to come. Mostly it is uploading after 50/100/any number of words. That's the reason (not able to replicate) I did not post there so far. Let me know if you think I should post this there. I am using Firefox 81.0.2 (64 bits), Windows 10, I am using an external microphone. Regards. --টিটো দত্ত (Titodutta) (কথা) 10:00, 17 October 2020 (UTC)
- @Titodutta Add your system info to the table. In the comment column, specify shortly that it's occasional. Then, in the discussion, paste what you wrote me above, the long explainer of what you see. Maybe someone will think about something. Following Luilui's corrupted set (126, long sentences) I suspect longer sentences to create a saturation effect. But we need more testimonies. Yug (talk) 10:28, 17 October 2020 (UTC)
- Absolutely, I face this issue as well. The problem is I can not replicate it and do not know when this error is going to come. Mostly it is uploading after 50/100/any number of words. That's the reason (not able to replicate) I did not post there so far. Let me know if you think I should post this there. I am using Firefox 81.0.2 (64 bits), Windows 10, I am using an external microphone. Regards. --টিটো দত্ত (Titodutta) (কথা) 10:00, 17 October 2020 (UTC)
- So you are affected as well. Please copy or move your report above into LinguaLibre:Chat_room#Speeding-up_bug_:_call_for_testers. We collect observed behaviors and brainstorm there. It's important to know how many people and which systems are affected. Yug (talk) 09:47, 17 October 2020 (UTC)
Project:Otherworldly languages
- Wistling:
- Canaries
- India > Khongtong > Melody-names https://www.bbc.com/news/av/world-asia-india-49953861
Referents
This table maps referents users to contact according to various needed.
Username & email | LinguaLibre.org (Wiki) | Github (Dev) | Phabricator (Tasks) | Social web (Communications) | Community contact of... |
---|---|---|---|---|---|
0x010C | Administrator Bureaucrat |
Owner (admin) | ? | tw:@LingLibre_WMFr | – |
Jixitris | User | Owner (admin)/Dev | – | – | |
Yug ✉ | Administrator | Owner (admin)/Organisation | Member | – | – |
Xenophôn | Administrator Bureaucrat |
Owner (admin) | – | Wikimedia France | |
MickeyBarber | User | Owner (admin) | – | Wikimedia France | |
Lyokoï | Administrator | Member | tw:@LingLibre_WMFr | – | |
Pamputt | Administrator Bureaucrat |
Member | Member | – | – |
Poslovitch | User | Owner (admin) | – | Eastern France's academics | |
Adélaïde Calais WMFr | User | Owner (admin) | – | Wikimedia France | |
Tshrinivasan | User | Member | – | India | |
Eavqwiki | Administrator | Member | – | Paris academics | |
WikiLucas00 | Administrator | – | Member | – | – |
[[User:|User:]] | User | Member | – | – | |
Bold: most active user(s), referent one. |
User essays
I have two user essays, did you see these? These will tell about the workflow I and we use for Bangla (Bengali)
Let me know if you find these interesting/worth-sharing. --টিটো দত্ত (Titodutta) (কথা) 00:06, 3 February 2021 (UTC)
- Hello user:Titodutta, I'am quite busy at the moment with Github clean up, coordinations, dev & fix. In addition to work and family life. I plan to turn back to LinguaLibre's community here next week and then check up your resources. But as a rules of thumb on wiki, better to go too fast than slow yourself down due to others ;) Yug (talk) 13:33, 4 February 2021 (UTC)
Userrights
That would be perfect. Thank you for submitting this request for me! Poemat (talk) 00:52, 18 February 2021 (UTC)
Autopatrol
Thank you! This limitation is stupid and counterproductive. I understand it's a Commons policy, and not Lingua Libre invention, but as long as it is in force, you could at least add a counter or a warning to the app, to tell people they are approaching their limit. It can be very disappointing for new users. Olaf (talk) 01:16, 18 February 2021 (UTC)
- Yes, it has been a sneaky things to discover. Earliest users are also heavy Commonist and generally already have autopatrol rights, so so spent 2 years without noticing, precisely because it only affects new users... We are on it and will do our best. Yug (talk) 01:40, 18 February 2021 (UTC)
- Regarding the bot - my bot is in Java, and it's a big complicated machine, which I have been developing for 12 years. Adding all the new pronunciation recordings the night after they appeared in Commons is just one of its functions. I'm afraid the different programming language and the focus on Polish Wiktionary makes my bot incompatible with your production. However, I can see another area where I could probably help. Each time I start the recording, I run my code to produce a list of words to record. I can see LL has a list feature but at least for the Polish language the lists are useless (sorry) - they contain inflected forms and words that have been already recorded by others. My list consists only of lemmas that have no recordings in Commons (both from Lingua Libre and in the old format), and it is sorted in descending frequency order. My bot could maintain similar lists for the most popular languages, updating them every night. It wouldn't be a problem, I've been maintaining similar frequency lists for years to help Polish Wiktionary editors create the Wiktionary articles about the most frequently used words in various languages, so most of the code is already done. The only things I would need are a way to upload the lists to the LL system, and a bit of configuration to make them possible to select in your record dialog. What do you think about this idea? Olaf (talk) 17:47, 19 February 2021 (UTC)
- p.s. If you want to take a look at my production, here is a list of French lemmas that have the biggest number of French sections in various language editions of Wiktionary, but have no French section in Polish Wiktionary: https://pl.wiktionary.org/wiki/Wikis%C5%82ownik:Ranking_brakuj%C4%85cych_s%C5%82%C3%B3w_wed%C5%82ug_wyst%C4%85pie%C5%84_w_innych_wikis%C5%82ownikach/francuski
- I maintain similar lists for about 60 other languages, updated nightly: https://pl.wiktionary.org/wiki/Kategoria:Rankingi_brakuj%C4%85cych_s%C5%82%C3%B3w_wed%C5%82ug_wyst%C4%85pie%C5%84_w_innych_wikis%C5%82ownikach
- For my own needs I generate a similar list of Polish words with no recordings in Commons. This is what I could offer for many languages. I additionally split the Polish list to separate words with the /r/ consonants and the rest, because I can't speak /r/ properly, so the words with /r/ are for my wife (Poemat), but I can do the split in Polish only. Olaf (talk) 18:08, 19 February 2021 (UTC)
- @Olaf Hello, and thank you for this bot summary. It's interesting to discover other bots helping in unexpected ways. FYI, Poslovitch is now managing the LinguaLibre Bot. He got access-rights recently, has Python abilities and is studying the bot.
- Poslovitch and myself are currently focus on Wimedia India's Wiki Meet online conference. I myself contributed an bit too much recently and need a light wikibreak in coming week(s). We should keep in mind your bot and its approach. Maybe migrate this conversation to LinguaLibre:Technical board ? Yug (talk) 15:11, 20 February 2021 (UTC)
LiLi video is up
Your session was great. The video is here m:Wikimedia_Wikimeet_India_2021/Program. --টিটো দত্ত (Titodutta) (কথা) 20:28, 23 February 2021 (UTC)
Olafbot
Yes, it was me :-) I'm trying to use the OAuth authentication in my code to be able to generate lists I wrote about, and refresh them automatically in Lingua Libre. I hope using a bot for this is not against the rules here. The lists, generated for 50 languages, will consist of words without recorded pronunciation (including pronunciation from other sources in Commons, not only LiLi), and will be sorted by the number of wiktionaries with a corresponding language section describing this word (not just a number of interwikis). From my experience, this approach produces long stable lists of lemmas, more useful in Wiktionary context than the classic frequency lists, which tend to cover 20k lemmas at most, usually include inflected forms, and are of poor quality or non-existing for most of the languages.
BTW, there are as many as 185000 recordings in French, but still many basic words have no French recording at all, for example, "centreuropéen" or "Grégoire", because everybody records "eau" and "chien". It looks like a waste of time of many people. In fact, only 2/3 of French lemmas in Polish Wiktionary have the pronunciation recorded, even if we have just 26000 French lemmas. It's much worse in the case of less covered languages. I believe the lack of regularly refreshed lists of needed audio is a major block. Let me fix this first.
In general, I'm not good at Python. I use Java, JavaScript, and TypeScript every day at work, but I tried Python only once or twice in AI competitions on Kaggle.com. I would need probably a lot of learning to be able to contribute to your bot. Maybe in the future... Olaf (talk) 01:14, 25 February 2021 (UTC)
- This discussion should be moved on either the LinguaLibre:Technical board. Would you agree to this ?
- Hello @Olaf ,
- Missing recordings: Thank you for attacking this "missing gaps" issues. It's indeed a topic of concern for us all, Wikt and LL. While we have 109 languages, only 22 have 2000 audios or more. Most languages approach recording sessions on a easy-peasy by interests topics approach, more pleasant to the speaker. This cause very irregular coverage and leaves numerous gaps hard to fill.
- Bots policy ?: We currently have no policy on bots, we are leading by practice (Poslovitch, you) and building on the way. As long as you test respectfully it's ok I guess. Tips: Administrators have some batch revert tool if the need arises. :) There is also a WikiAPI JS bot raising, which could be interesting for you to explore if you like JS more. The maintainer is highly active and added a suggested new function within 24hours, was impressive.
- Lists types: If you want to propose recording lists to users, there are two different approach :
- Canonical lists : Japanese JLPT, Chinese HSK, SWADESH, UNILEX frequency lists. → create a List:{iso}/{list name}. Referent: Yug
- Dynamic lists : Places near your, Category:Fruit on English wikipedia, etc. → create a Sparql query. Referents: VIGNERON (Sparql) then Poslovich (implementation on github).
- RecordWizard queries button ?: I'am not sure, but your project may be more in line with dynamic lists fetched via Sparql queries. It would take few months but we can head toward adding a
Record words without pronunciation
button to the RecordingWizard (github), when we load a list. - Others: Canonical lists vs Dynamic lists have been a long de facto situation within RecordWizard. It's now emerging as a needed policy, to be honest. As I'am keeping an eyes on Marathi lists, it becomes clear that we will have to put in place some better practices, guidelines, and mentoring of new users for lists creation. Yug (talk) 10:45, 25 February 2021 (UTC)