User talk
Difference between revisions of "Yug"
(→Lingua Libre Story for September 2020: new section) |
|||
Line 86: | Line 86: | ||
cat userlist.txt | sed -e 's/^# //g' | tr "\n" "\|" | tr "\^" "\(" | sed -e "s/^/\\\(/" -e "s/|$/\\\)/g" | cat userlist.txt | sed -e 's/^# //g' | tr "\n" "\|" | tr "\^" "\(" | sed -e "s/^/\\\(/" -e "s/|$/\\\)/g" | ||
</pre> | </pre> | ||
+ | |||
+ | == Lingua Libre Story for September 2020 == | ||
+ | |||
+ | :''This is not an official story or newsletter. This is an attempt by a user to share some updates about the program. There might be more stories which I have missed.''<br> | ||
+ | September 2020 was an eventful month and we have seen a lot of activities of uploading new content and also around project-related discussion. Here are some of the best stories from September 2020. | ||
+ | * '''300,000 files:''' On 10 September 2020 we completed 300,000 pronunciation uploads. After the launch in August 2018, the first 100,000 files were uploaded in April 2019, and the milestone of 200,000 files was reached on January 2020. As of 30 September 2020 there are 366 speakers at this project working in 92 languages. | ||
+ | * '''Maximum number of pronunciations in a month:''' In September 2020, 23,209 files were uploaded. This is the maximum number of files uploaded ever in a particular calendar month (earlier it was 22,963 files in June 2020, and 22,293 files in May 2019). | ||
+ | * '''Indian language in top 3 list:''' This month Bengali language came into the top three languages by the number of files uploaded using Lingua Libre. This is possibly the first time a non-European/Indian language came into the top three most-uploaded languages on the project. | ||
+ | * '''Project chat:''' Several discussion started on the Chat room, such as [[LinguaLibre:Chat_room#Speeding-up_bug_:_call_for_testers|Bug testing]] (you may help), [[LinguaLibre:Chat_room#0x010C_year_offgrid_:_preparations|Technical preparations]] etc. | ||
+ | That's it. Have a good time. --[[User:Titodutta|টিটো দত্ত (Titodutta)]] ([[User talk:Titodutta|কথা]]) 13:35, 1 October 2020 (UTC)<br> | ||
+ | <small>This post is under CC0 license, feel to free share with anyone, anywhere, without any restriction</small> |
Revision as of 13:35, 1 October 2020
dev
Salut Yug,
Petit rappel que pour faire des tests, il y a l'instance de développement https://v2.lingualibre.fr sur laquelle tu peux faire tous les essais que tu veux, à n'appliquer ici qu'une fois que c'est fin prèt. Ça évitera à des gens de tomber sur des trucs bizarre en chargeant le site au mauvais moment ;).
Bisous — 0x010C ~talk~ 15:45, 26 December 2018 (UTC)
PS : et ne laisse pas traîner des pages inutiles (MediaWiki:Recordwizard) ; en l'occurence ça bloque notamment d'éventuelles futur changements et traductions poussés par translatewiki.
- Piouf! T'es vraiment tombé dessus !!! *0* !!! J'ai fais aussi vite que possible !!!! Merci pour ta vigilance, tout me semblait bien restoré, sauf si j'étais aveuglé par du cache !... Ah! v2 c'est vrai ! Tu peux me mettre admin ??? je voudrais tester le support d'images dans https://v2.lingualibre.fr/wiki/MediaWiki:Sidebar ! Yug (talk) 20:09, 26 December 2018 (UTC)
- PS: Je ne comprends pas bien ces questions / pages / balises de traductions... Yug (talk) 20:10, 26 December 2018 (UTC)
About
Ton intro et l'historique sont bien sur LinguaLibre:About, mais j'ai viré ce qui n'a rien à faire sur cette page (histoire que ça soit un minimum pro / propre / efficace). Tu peux retrouver le contenu sur la version historisé. — 0x010C ~talk~ 21:59, 26 December 2018 (UTC)
- Ok, cool! J'ai recupéré ca de github, j'y fais du nettoyage. J'ai pas encore décider ou mettre ces truc sur LinguaLibre... Je te tiens au jus.
- Sur github, le repository listes est supprimable ! Yug (talk) 18:08, 28 December 2018 (UTC)
Ongoing work
Get your data :
$git clone git@github.com:hermitdave/FrequencyWords.git
Then, save into dave-to-LL.mk
the following in the root of your directory:
# RUN: # make -f dave-to-LL.mk iso2=pl iso3=pol processing # to do the work # make -f dave-to-LL.mk iso2=pl iso3=pol all # to do the work AND print few messages iso2="pl" iso3="pol" all: processing messages processing: sed -E 's/ [0-9]+$$//g' $(iso2)_50k.txt | sed -E 's/^/# /g' > $(iso2)-words-LL.txt split -d -l 2000 --additional-suffix=".txt" $(iso2)-words-LL.txt "$(iso3)-words-by-frequency-" messages: head -n 5 $(iso2)_50k.txt head -n 5 $(iso2)-words-LL.txt head -n 5 "$(iso3)-words-by-frequency-00.txt" head -n 5 "$(iso3)-words-by-frequency-01.txt" wc -l $(iso2)-words-LL.txt wc -l "$(iso3)-words-by-frequency-01.txt"
Then, find your {iso2}_50k.txt file. Put both in the same folder, and run the command below with your needed iso2 and iso3 values :
make -f dave-to-LL.mk iso2=pl iso3=pol processing
Yug (talk) 18:17, 28 December 2018 (UTC)
Subtlex
iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt sed -E 's/(,[0-9]+.?[0-9]*)+//g' $iso2-words.txt | tail -n+4 | head -n 20000 | sed -E 's/^/# /g' > $iso2-words-LL.txt split -d -l 2000 --additional-suffix=".txt" $iso2-words-LL.txt "$iso3-words-by-frequency-"
Feature idea : table tacking existing languages on LinguaLibre.fr
I have difficulties to keep track all the languages I helped to add to LinguaLibre. Taiwan has 16 languages and 42 locals variations. Maybe it already exists... If not, It would be a positive have a sortable table such as below :
Wikidata qid | LinguaLibre qid | English name | Language group | Active ? | Numb. or recordings |
---|---|---|---|---|---|
Q715766 | Atayal (Q51302) | Atayal | Taiwanese | Low | 4 |
Q718269 | Sakizaya (Q51871) | Sakizaya | Taiwanese | Low | 6 |
... | ... | .... | ... | ... | ... |
Yug (talk) 12:16, 31 December 2018 (UTC)
- I'am finding out how LinguaLibre:Stats is coded, maybe I will be able to produce something :D Yug (talk) 12:59, 31 December 2018 (UTC)
WP query and frequency
Helper : https://codepen.io/hugolpz/pen/ByoKOK
curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt
Tamil
- https://mobile.twitter.com/IyalMozhi/status/1265550815051350019 Yug (talk) 10:31, 27 May 2020 (UTC)
Clean up corpus
find -iname 'fra-opus-100k.txt' -exec cat {} \; | grep -o '\w*' | sed -e "s/^c$/ce/gi" -e "s/^d$/du/gi" -e "s/^j$/je/gi" -e "s/^m$/moi/gi" -e "s/^n$/ne/gi" -e "s/^qu$/que/gi" -e "s/^s$/se/gi" -e "s/^t$/toi/gi" -e "/^[0-9]*$/d" | awk '/^[[:upper:]][^[:upper:]]*$/{$1=tolower(substr($1,1,1)) substr($1,2)}1' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > fra-subtlex.txt
cat userlist.txt | sed -e 's/^# //g' | tr "\n" "\|" | tr "\^" "\(" | sed -e "s/^/\\\(/" -e "s/|$/\\\)/g"
Lingua Libre Story for September 2020
- This is not an official story or newsletter. This is an attempt by a user to share some updates about the program. There might be more stories which I have missed.
September 2020 was an eventful month and we have seen a lot of activities of uploading new content and also around project-related discussion. Here are some of the best stories from September 2020.
- 300,000 files: On 10 September 2020 we completed 300,000 pronunciation uploads. After the launch in August 2018, the first 100,000 files were uploaded in April 2019, and the milestone of 200,000 files was reached on January 2020. As of 30 September 2020 there are 366 speakers at this project working in 92 languages.
- Maximum number of pronunciations in a month: In September 2020, 23,209 files were uploaded. This is the maximum number of files uploaded ever in a particular calendar month (earlier it was 22,963 files in June 2020, and 22,293 files in May 2019).
- Indian language in top 3 list: This month Bengali language came into the top three languages by the number of files uploaded using Lingua Libre. This is possibly the first time a non-European/Indian language came into the top three most-uploaded languages on the project.
- Project chat: Several discussion started on the Chat room, such as Bug testing (you may help), Technical preparations etc.
That's it. Have a good time. --টিটো দত্ত (Titodutta) (কথা) 13:35, 1 October 2020 (UTC)
This post is under CC0 license, feel to free share with anyone, anywhere, without any restriction