LinguaLibre

Difference between revisions of "Interested communities"

Interested communities gather and share pointers toward linguistic communities or actors who expressed interest toward LinguaLibre recording. Those communities could be individuals or organisations, from Wikimedia, academia, civilian associations and cultural activists. These pointers are shared below in order to avoid loss of those valuable contacts via a low 'bus factor' on our side. Emails are not to be displayed, but organisation, individual names and links to webpages and discussion are welcome.

 
(40 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{#SUBTITLE:'''Interested communities''' gather and share pointers toward linguistic communities who expressed interest toward Lingualibre recording. Those communities could be individuals or organisations, from Wikimedia, academia, civilian associations and cultural activists. These pointers are shared below in order to avoid loss of those valuable contacts via a low [[:en:Bus factor]] on our side. Emails are not to be displayed, but organisation, individual names and links to webpages and discussion are welcome. }}
+
{{#SUBTITLE:'''Interested communities''' gather and share pointers toward linguistic communities or actors who expressed interest toward LinguaLibre recording. Those communities could be individuals or organisations, from Wikimedia, academia, civilian associations and cultural activists. These pointers are shared below in order to avoid loss of those valuable contacts via a low 'bus factor' on our side. Emails are not to be displayed, but organisation, individual names and links to webpages and discussion are welcome.}}
 +
 
 +
 
  
 
{| class="wikitable sortable"
 
{| class="wikitable sortable"
Line 8: Line 10:
 
| South America  || (1) Native American > Surui : srn || Aquaverde || Lili: Yug ; Association AquaVerde: Almir Surui, Thomas Pizer. || Ongoing, see [[:meta:LinguaLibre/Atelier_de_formation_à_LinguaLibre_pour_le_Surui/en]]
 
| South America  || (1) Native American > Surui : srn || Aquaverde || Lili: Yug ; Association AquaVerde: Almir Surui, Thomas Pizer. || Ongoing, see [[:meta:LinguaLibre/Atelier_de_formation_à_LinguaLibre_pour_le_Surui/en]]
 
|-  
 
|-  
| Europe || (1) French Sign Language || ? || Lili: Yug ; Toulouse: ||  
+
| Europe || (1) Catalan || ? || Lili: Yug ; Perpignan: Susanna Peidro I Sutil || Militante du Catalan en France, très motivée.
* Met at LREC<ref name="LREC">[https://etherpad.wikimedia.org/p/LREC Pad LREC 2022]</ref>
+
|-
 +
| World || (0) Wiki Journalist || Wikimedia || [[:meta:User_talk:Uprising_Man|User:Uprising Man]] || African Wikimedian with journalism skills, willing to co-author article on Sign language with Yug.
 +
|-
 +
| Europe || (1) French Sign Language || ? || Lili: Yug ; Toulouse: [[:meta:User:Seejayer|User:Seejayer]] ||
 +
* Created Sign2Sign peertube ([https://peertube.s2s.video/about/instance about], [http://wiki.s2s.video/index.php/Charte Charte])
 +
* Met at Forom<ref name="LREC">[https://etherpad.wikimedia.org/p/LREC Pad LREC 2022]</ref>
 +
|-
 +
| ? || (?) Sign Languages || DeepMind, Berkeley AI || Kayo Yin (en/fr/ja/zh) || * https://kayoyin.github.io
 +
|-
 +
| World || (8) Sign Languages || Wikisigns || ? || http://wikisigns.org / tw: @wikisigns
 +
|-
 +
| West || (3) English/American Sign Languages || [[:en:WP:WikiProject Deaf]] || ? || 2022.09: Minimal contact [[:meta:Talk:Deaf Wikimedians]]
 +
|-
 +
| Central Asia || (1) Kazakh Sign Language || [https://nu.edu.kz Nazarbayev University] || ? ||
 +
* Met at LREC<ref name="LREC" />
 +
|-
 +
| Africa || (1) Ghana Sign Language || Special Education Department at the University of Education Winneba<br>Wikimedia Ghana, || || [https://diff.wikimedia.org/2022/09/06/sign-language-and-wikipedia-ghanaian-hearing-impaired-students-undergo-training-on-wiki-projects/ Sign Language and Wikipedia: Ghanaian hearing-impaired students undergo training on Wiki projects]
 +
|-
 +
| Europe || (1) Ladino || CollectivaT, Barcelona || Lili:Yug, CollectivaT: Alp on CollectivaT-dev.cat ||
 +
* Met at LREC<ref name="LREC" />
 +
* Small team who created a Ladino translation tool and text-to-speech
 +
* Vivid example of what we could do: machine learning-based text-to-speech.
 +
* Possible partner for {{tl|Grants table}} "Alliance fund".
 +
* Summary: Highly advanced project on endangered diaspora Jewish language with 2000 speakers. Funded by Europe and technically as good as Gascons or better. They also use Tacotron2, an easy Google machine learning tool, to create translation and text to speech system.
 +
* Website: https://data.sefarad.com.tr : CC data !
 +
** Translate and t2s: https://translate.sefarad.com.tr
 +
** Uses Tacotron2 !
 +
* Github: https://github.com/CollectivaT-dev/
 +
** /judeo-espanyol-resources/blob/main/resources/dictionaries/diksionaryo_ladino_espanyol.txt
 +
* Team (size): few people.
 +
|-
 +
| Cameroun || 200 || Wikimedia Cameroun || ArnoBOUJIKA || microfi, lingualibre,
 +
|-
 +
| Rwanda || Langues (3) || Wikimedia Rwanda || Cnyirahabihirwe123 - WMRD cofounder || Toute aide.
 +
|-
 +
| Europe || (1) Breton || Bretagne numerique  || David Lesvenan, Laurence Le Goff || Besoin: Accompagnement atelier.
 +
|-
 +
| Europe || (1) Breton || Research center IRISA.fr || Lili: [[User:VIGNERON|VIGNERON]]; IRISA: Annie Foret ||
 +
# Annie.foret on Irisa : travaille sur un annotateur syntactic tree pour le breton.<br>Pourrait organiser l'audio documentation du breton dans sont labo a Rennes.<br>Phase 1: Ecrire email + demander de 1000+30 mots pour Lingualibre.
 +
# Dastum: collecte de chansons bretonnes en audio. (Sous droits d'auteur)
 +
# Melalie Jouiteau, linguiste CNRS: etat de l'art des resources en breton
 +
#* arbres.iker.cnrs.fr
 +
#* Wikigrammaire du Breton
 +
|-
 +
| Europe || (1) Tatar || ? || ? ||
 +
* Corpus: https://www.corpus.tatar/stat_en.htm
 
|-
 
|-
 
| Sub-Sahara Africa || (1) Kenya > [[:en:Kikuyu language]]/[[Q329]] || Wikimedia || Lili: Yug ↔ [[User:Ngangaesther|Ngangaesther]] [[:en:User_talk:Ngangaesther#Follow_up_!]] ||  
 
| Sub-Sahara Africa || (1) Kenya > [[:en:Kikuyu language]]/[[Q329]] || Wikimedia || Lili: Yug ↔ [[User:Ngangaesther|Ngangaesther]] [[:en:User_talk:Ngangaesther#Follow_up_!]] ||  
 
* Wordlist: no wordlist → translation suggested.
 
* Wordlist: no wordlist → translation suggested.
 
|-
 
|-
| N. Africa and Middle East || Palestinian Arabic research (7) || ? || ? ||  
+
| N. Africa and Middle East || (7) Arabic languages || Palestinian Arabic research || Mustafa Jarrar ||  
 +
* Libyan - Sudanese - Yemeni - Palestinian - Levantine - Iraqi - Egyptian (?)
 +
* Hopefully make a presentation to them tomorrow.
 
* Met at LREC<ref name="LREC" />
 
* Met at LREC<ref name="LREC" />
 
|-
 
|-
| Southern Asia || (17) Indian languages || ? || ? ||  
+
| Southern Asia || (17) Indian languages || Universal Knowledge Core (UKC)<br>Tentro University<br>India<br>Europe || UKC: Nandu Chandra Nair PhD ||  
 +
* Site: http://Ukc.disi.unitn.it
 +
* Send email to start recording, enlighten diversity.
 +
* Video: [https://www.youtube.com/watch?v=toWaSF2UezU IndoUKC: a Concept-Centered Indian Multilingual Lexical Resource]
 
* Met at LREC<ref name="LREC" />
 
* Met at LREC<ref name="LREC" />
 
|-
 
|-
| Central Asia || (1) Kazakh Sign Language || ? || ? ||  
+
| Southern Asia || (1+) Tibetans || Cambridge, SOAS, Dublin. || ||  
* Met at LREC<ref name="LREC" />
+
* Repository: https://github.com/lothelanor/actib
 
|-
 
|-
 
| Eastern Asia || (17) Taiwan Aboriginal Languages || Center for Aboriginal Studies, NCCU || vickylin771015 (2016) or Ûi-iū Kán <iyumu> (2018-present) ||  
 
| Eastern Asia || (17) Taiwan Aboriginal Languages || Center for Aboriginal Studies, NCCU || vickylin771015 (2016) or Ûi-iū Kán <iyumu> (2018-present) ||  
Line 30: Line 82:
 
File:Progress of the Taiwanese Aboriginal Languages Wikipedias-2019.pdf|2019
 
File:Progress of the Taiwanese Aboriginal Languages Wikipedias-2019.pdf|2019
 
</gallery>
 
</gallery>
 +
|-
 +
| Eastern Asia || (1) Ancient Korean || ? || Park Chanjun ||
 +
* Neural Machine Translation
 +
* Repository: https://parkchanjun.github.io
 +
* Site: kunmt.org
 
|-
 
|-
 
| Global actors || (1000s) multiple, native Americans || SIL || Aaron_Hemphill || See also [[Lingualibre:Apps]]
 
| Global actors || (1000s) multiple, native Americans || SIL || Aaron_Hemphill || See also [[Lingualibre:Apps]]
 +
|-
 +
| Global actors || (1000s) multiple || [https://panlex.org Panlex.org] || ? || Seems to be web scrapping or low quality data.
 +
 +
|-
 +
| Global actors || (1000s) multiple || Google || WMFR: [[User:Adélaïde Calais WMFr]] ; Google: Daan van Esch ||
 +
|-
 +
| Global actors || (100s?) multiple || Facebook || WMFR/FB: [[User:Exilexi]] || Wikimedian and Facebook employee in Paris. Knows the i18n team.
 +
|-
 +
| Global actors || (10s) multiple || Endangered and Lesser-resourced Languages in Eurasia (EURALI) || ? ||
 +
* Met at LREC<ref name="LREC" />
 +
|-
 +
| Global actors || (10s?) multiple || International Standard Language Resource Number (ISLRN, www.islrn.org) || ? ||
 +
* Met at LREC<ref name="LREC" />
 +
 +
|-
 +
| Global actors || (10s?) multiple || Global Alliance for Lexicography (Globalex) || ? ||
 +
* Met at LREC<ref name="LREC" />
 +
|-
 +
| Global actors || (10s?) multiple || [https://elex.is ELEXIS]: European Lexicographic Infrastructure || ? ||
 +
* Met at LREC<ref name="LREC" />
 +
|-
 +
| Global actors || (10s?) multiple ||  NexusLinguarum – European network for Web-centred linguistic data science || ? ||
 +
* Met at LREC<ref name="LREC" />
 +
|-
 +
| Global actors || (10s?) multiple || [https://euralex2022.ids-mannheim.de/ EURALEX Conference]: || ? ||
 +
* Met at LREC<ref name="LREC" />
 +
|-
 +
| Global actors || (375) multiple || [https://cls.corpora.uni-leipzig.de Corpora by University of Leipzig] || ? ||
 +
* Contains 375 languages and far more corpora, extracted from online resources, including wikipedias.
 +
* Re-run periodically (!)
 +
* Partly copyrighted, partly CC-BY.
 +
* [https://wortschatz.uni-leipzig.de/en/download/ Download page] has CC-BY sentenses corpora, from which frequency list can be created.
 +
|-
 +
| Global actors || (1001) multiple || [https://github.com/unicode-org/unilex UNILEX], Google/UNICODE's freelance || Lili: Yug ; Unilex: ? ||
 +
* Contains 1001 languages and their frequency lists
 +
* MIT-like license.
 +
* One shoot, barely maintained.
 +
|-
 +
| Global actors || (130+) multiple || [https://universaldependencies.orguniversaldependencies.org] || No contact ||
 +
* Collects researchers' treebanks.
 +
* Has 130 languages
 +
* Could have few rare languages
  
 
|-
 
|-
| Global actors || (1000s) multiple || Google || WMFR: [[User:User:Adélaïde Calais WMFr]] ; Google: Daan van Esch ||
+
| Global actors || (?) multiple || [[:meta:Oral Culture Transcription Toolkit]] || [[:meta:Amrit Sufi|Amrit Sufi]] ||  
 +
Has documentation for consensual recording with local speakers.
 
|-
 
|-
| Global actors || (100s?) multiple || Facebook || WMFR/FB: [[User:Exilexi]] ||  
+
| France || (1) French, other? || [https://didac-ressources.eu/2021/03/24/massalia-vox-tiers-lieu-inclusif-notre-nouvelle-adresse/ Massalia VoX] || [[:meta:Special:EmailUser/FiloSophie|@FiloSophie]] || Comment: French association with diversity and languages-enthusiastic focus, can provides rentable rooms for recording session. See    [http://didac-ressources.eu/wp-content/uploads/2021/03/espaces-et-grille-tarifaire-massaliavoX.pdf Location de salle].<br>'''Address:''' Massalia VoX, 15 boulevard de la liberté, Marseille https://goo.gl/maps/1PhRX4b6EJK3xoWb8
 +
|-
 +
| France || (?) multiple || Sorosoro [https://www.sorosoro.org/le-programme-sorosoro/le-conseil-scientifique/ Team] || [[:fr:Rozenn Milin|Rozenn Milin]] || [https://www.sorosoro.org/mentions-legales/ CC-BY-NC-ND]
 +
|-
 +
| Global actors || (1000+) multiple || Facebook<br>* [https://ai.meta.com/blog/multilingual-model-speech-recognition/ Introducing speech-to-text, text-to-speech, and more for 1,100+ languages]<br>* [https://huggingface.co/spaces/mms-meta/MMS MMS: Scaling Speech Technology to 1000+ languages demo] || || ?
 
|}
 
|}
 +
 +
== Researchers ==
 +
:''Below are researchers who did not express interests but who could be interested. See also [[LinguaLibre talk:Citations]].''
 +
* [https://fr.linkedin.com/in/karenfort Karën Fort] : Maîtresse de conférences HDR en informatique, spécialisée en traitement automatique des langues, création de ressources et en éthique du TAL.
 +
 +
Word lists by Google / Unilex researches
 +
* https://research.google/pubs/pub47206/ for mining wordlists (Unilex-style) from 2,000+ languages
 +
** Prasad, Manasa; Breiner, Theresa; Esch, Daan van (2018). "Mining Training Data for Language Modeling across the World's Languages" (PDF). Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU 2018).
 +
* https://research.google/pubs/pub46952/ cleaning them up;
 +
** Chua, Mason; Esch, Daan van; Coccaro, Noah; Cho, Eunjoon; Bhandari, Sujeet; Jia, Libin (2018). "Text Normalization Infrastructure that Scales to Hundreds of Language Varieties". Proceedings of the 11th edition of the Language Resources and Evaluation Conference.
 +
* https://arxiv.org/abs/2103.15845 open-sourced;
 +
** Zupon, Andrew; Crew, Evan; Ritchie, Sandy (2021-03-29). "Text Normalization for Low-Resource Languages of Africa". arXiv:2103.15845 [cs].
 +
* https://research.google/pubs/pub49814/ using these wordlists to find sentences using our web crawler
 +
** Caswell, Isaac; Breiner, Theresa; Esch, Daan van; Bapna, Ankur (2020). "Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus".
 +
* https://research.google/pubs/pub50211/ cleaning up web-crawled text
 +
** Kreutzer, Julia; Caswell, Isaac; Wang, Lisa; Wahab, Ahsan; Esch, Daan van; Ulzii-Orshikh, Nasanbayar; Tapo, Allahsera Auguste; Subramani, Nishant; Sokolov, Artem; Sikasote, Claytone; Setyawan, Monang (2022). "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets". TACL.
 +
* https://arxiv.org/abs/2205.03983 building machine translation systems from them
 +
** Bapna, Ankur; Caswell, Isaac; Kreutzer, Julia; Firat, Orhan; van Esch, Daan; Siddhant, Aditya; Niu, Mengmeng; Baljekar, Pallavi; Garcia, Xavier; Macherey, Wolfgang; Breiner, Theresa (2022-05-16). "Building Machine Translation Systems for the Next Thousand Languages". arXiv:2205.03983 [cs].
 +
* https://ai.googleblog.com/2022/05/24-new-languages-google-translate.html blog post
 +
** "Unlocking Zero-Resource Machine Translation to Support New Languages in Google Translate". Google AI Blog. Retrieved 2022-06-30.
 +
 +
== To process ==
 +
* [https://en.wikiversity.org/wiki/OpenSpeaks OpenSpeaks] on Wikiversity.
  
 
== See also ==
 
== See also ==
Line 44: Line 171:
 
* [[LinguaLibre:Hackathon‎]]
 
* [[LinguaLibre:Hackathon‎]]
 
* [[Help:List translation]]
 
* [[Help:List translation]]
 +
* https://wikitongues.org/cohort/
 +
{{Helps}}
  
 
== References ==
 
== References ==
 
<references />
 
<references />

Latest revision as of 08:35, 8 January 2024



Region Language(s) Organisation Contact Project and Comments
Northern America (17) Native Americans SIL _ _
South America (1) Native American > Surui : srn Aquaverde Lili: Yug ; Association AquaVerde: Almir Surui, Thomas Pizer. Ongoing, see meta:LinguaLibre/Atelier_de_formation_à_LinguaLibre_pour_le_Surui/en
Europe (1) Catalan ? Lili: Yug ; Perpignan: Susanna Peidro I Sutil Militante du Catalan en France, très motivée.
World (0) Wiki Journalist Wikimedia User:Uprising Man African Wikimedian with journalism skills, willing to co-author article on Sign language with Yug.
Europe (1) French Sign Language ? Lili: Yug ; Toulouse: User:Seejayer
? (?) Sign Languages DeepMind, Berkeley AI Kayo Yin (en/fr/ja/zh) * https://kayoyin.github.io
World (8) Sign Languages Wikisigns ? http://wikisigns.org / tw: @wikisigns
West (3) English/American Sign Languages en:WP:WikiProject Deaf ? 2022.09: Minimal contact meta:Talk:Deaf Wikimedians
Central Asia (1) Kazakh Sign Language Nazarbayev University ?
Africa (1) Ghana Sign Language Special Education Department at the University of Education Winneba
Wikimedia Ghana,
Sign Language and Wikipedia: Ghanaian hearing-impaired students undergo training on Wiki projects
Europe (1) Ladino CollectivaT, Barcelona Lili:Yug, CollectivaT: Alp on CollectivaT-dev.cat
  • Met at LREC[1]
  • Small team who created a Ladino translation tool and text-to-speech
  • Vivid example of what we could do: machine learning-based text-to-speech.
  • Possible partner for {{Grants table}} "Alliance fund".
  • Summary: Highly advanced project on endangered diaspora Jewish language with 2000 speakers. Funded by Europe and technically as good as Gascons or better. They also use Tacotron2, an easy Google machine learning tool, to create translation and text to speech system.
  • Website: https://data.sefarad.com.tr : CC data !
  • Github: https://github.com/CollectivaT-dev/
    • /judeo-espanyol-resources/blob/main/resources/dictionaries/diksionaryo_ladino_espanyol.txt
  • Team (size): few people.
Cameroun 200 Wikimedia Cameroun ArnoBOUJIKA microfi, lingualibre,
Rwanda Langues (3) Wikimedia Rwanda Cnyirahabihirwe123 - WMRD cofounder Toute aide.
Europe (1) Breton Bretagne numerique David Lesvenan, Laurence Le Goff Besoin: Accompagnement atelier.
Europe (1) Breton Research center IRISA.fr Lili: VIGNERON; IRISA: Annie Foret
  1. Annie.foret on Irisa : travaille sur un annotateur syntactic tree pour le breton.
    Pourrait organiser l'audio documentation du breton dans sont labo a Rennes.
    Phase 1: Ecrire email + demander de 1000+30 mots pour Lingualibre.
  2. Dastum: collecte de chansons bretonnes en audio. (Sous droits d'auteur)
  3. Melalie Jouiteau, linguiste CNRS: etat de l'art des resources en breton
    • arbres.iker.cnrs.fr
    • Wikigrammaire du Breton
Europe (1) Tatar ? ?
Sub-Sahara Africa (1) Kenya > en:Kikuyu language/Gikuyu (Q329) Wikimedia Lili: Yug ↔ Ngangaesther en:User_talk:Ngangaesther#Follow_up_!
  • Wordlist: no wordlist → translation suggested.
N. Africa and Middle East (7) Arabic languages Palestinian Arabic research Mustafa Jarrar
  • Libyan - Sudanese - Yemeni - Palestinian - Levantine - Iraqi - Egyptian (?)
  • Hopefully make a presentation to them tomorrow.
  • Met at LREC[1]
Southern Asia (17) Indian languages Universal Knowledge Core (UKC)
Tentro University
India
Europe
UKC: Nandu Chandra Nair PhD
Southern Asia (1+) Tibetans Cambridge, SOAS, Dublin.
Eastern Asia (17) Taiwan Aboriginal Languages Center for Aboriginal Studies, NCCU vickylin771015 (2016) or Ûi-iū Kán <iyumu> (2018-present)
  • Site: https://web.alcd.center
  • Center for Aboriginal Studies, National Chengchi University (NCCU), is an academic team dedicated to Taiwanese aboriginal studies, maintaining and animating 16 Wikipedias in native languages. Founded in 1999, initially “Center for Aboriginal Languages, Cultures and eDucation” (ALCD).
Eastern Asia (1) Ancient Korean ? Park Chanjun
Global actors (1000s) multiple, native Americans SIL Aaron_Hemphill See also Lingualibre:Apps
Global actors (1000s) multiple Panlex.org ? Seems to be web scrapping or low quality data.
Global actors (1000s) multiple Google WMFR: User:Adélaïde Calais WMFr ; Google: Daan van Esch
Global actors (100s?) multiple Facebook WMFR/FB: User:Exilexi Wikimedian and Facebook employee in Paris. Knows the i18n team.
Global actors (10s) multiple Endangered and Lesser-resourced Languages in Eurasia (EURALI) ?
Global actors (10s?) multiple International Standard Language Resource Number (ISLRN, www.islrn.org) ?
Global actors (10s?) multiple Global Alliance for Lexicography (Globalex) ?
Global actors (10s?) multiple ELEXIS: European Lexicographic Infrastructure ?
Global actors (10s?) multiple NexusLinguarum – European network for Web-centred linguistic data science ?
Global actors (10s?) multiple EURALEX Conference: ?
Global actors (375) multiple Corpora by University of Leipzig ?
  • Contains 375 languages and far more corpora, extracted from online resources, including wikipedias.
  • Re-run periodically (!)
  • Partly copyrighted, partly CC-BY.
  • Download page has CC-BY sentenses corpora, from which frequency list can be created.
Global actors (1001) multiple UNILEX, Google/UNICODE's freelance Lili: Yug ; Unilex: ?
  • Contains 1001 languages and their frequency lists
  • MIT-like license.
  • One shoot, barely maintained.
Global actors (130+) multiple [1] No contact
  • Collects researchers' treebanks.
  • Has 130 languages
  • Could have few rare languages
Global actors (?) multiple meta:Oral Culture Transcription Toolkit Amrit Sufi

Has documentation for consensual recording with local speakers.

France (1) French, other? Massalia VoX @FiloSophie Comment: French association with diversity and languages-enthusiastic focus, can provides rentable rooms for recording session. See Location de salle.
Address: Massalia VoX, 15 boulevard de la liberté, Marseille https://goo.gl/maps/1PhRX4b6EJK3xoWb8
France (?) multiple Sorosoro Team Rozenn Milin CC-BY-NC-ND
Global actors (1000+) multiple Facebook
* Introducing speech-to-text, text-to-speech, and more for 1,100+ languages
* MMS: Scaling Speech Technology to 1000+ languages demo
?

Researchers

Below are researchers who did not express interests but who could be interested. See also LinguaLibre talk:Citations.
  • Karën Fort : Maîtresse de conférences HDR en informatique, spécialisée en traitement automatique des langues, création de ressources et en éthique du TAL.

Word lists by Google / Unilex researches

  • https://research.google/pubs/pub47206/ for mining wordlists (Unilex-style) from 2,000+ languages
    • Prasad, Manasa; Breiner, Theresa; Esch, Daan van (2018). "Mining Training Data for Language Modeling across the World's Languages" (PDF). Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU 2018).
  • https://research.google/pubs/pub46952/ cleaning them up;
    • Chua, Mason; Esch, Daan van; Coccaro, Noah; Cho, Eunjoon; Bhandari, Sujeet; Jia, Libin (2018). "Text Normalization Infrastructure that Scales to Hundreds of Language Varieties". Proceedings of the 11th edition of the Language Resources and Evaluation Conference.
  • https://arxiv.org/abs/2103.15845 open-sourced;
    • Zupon, Andrew; Crew, Evan; Ritchie, Sandy (2021-03-29). "Text Normalization for Low-Resource Languages of Africa". arXiv:2103.15845 [cs].
  • https://research.google/pubs/pub49814/ using these wordlists to find sentences using our web crawler
    • Caswell, Isaac; Breiner, Theresa; Esch, Daan van; Bapna, Ankur (2020). "Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus".
  • https://research.google/pubs/pub50211/ cleaning up web-crawled text
    • Kreutzer, Julia; Caswell, Isaac; Wang, Lisa; Wahab, Ahsan; Esch, Daan van; Ulzii-Orshikh, Nasanbayar; Tapo, Allahsera Auguste; Subramani, Nishant; Sokolov, Artem; Sikasote, Claytone; Setyawan, Monang (2022). "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets". TACL.
  • https://arxiv.org/abs/2205.03983 building machine translation systems from them
    • Bapna, Ankur; Caswell, Isaac; Kreutzer, Julia; Firat, Orhan; van Esch, Daan; Siddhant, Aditya; Niu, Mengmeng; Baljekar, Pallavi; Garcia, Xavier; Macherey, Wolfgang; Breiner, Theresa (2022-05-16). "Building Machine Translation Systems for the Next Thousand Languages". arXiv:2205.03983 [cs].
  • https://ai.googleblog.com/2022/05/24-new-languages-google-translate.html blog post
    • "Unlocking Zero-Resource Machine Translation to Support New Languages in Google Translate". Google AI Blog. Retrieved 2022-06-30.

To process

See also

Lingua Libre Help pages
General help pages Help:InterfaceHelp:Your first recordHelp:Choosing a microphoneHelp:Configure your microphoneHelp:TranslateHelp:LangtagsLinguaLibre:Language codes systems used across LinguaLibreLinguaLibre:List of languages
Linguistic help pages Help:Add a new languageHelp:HomographsHelp:List translationHelp:Ethics
Lists help pages Help:Create your own listsHelp:How to create a frequency list?Help:Why wordlists matter?Help:Swadesh listsHelp:ListsHelp:Create a new generator
Events, Outreach Lingualibre:EventsLingualibre:RolesLingualibre:WorkshopsLingualibre:HackathonLingualibre:Interested communitiesLingualibre:Events/2022 Public Relations CampaignLingualibre:MailingLingualibre:JargonLingualibre:AppsLingualibre:CitationsService civique 2022-2023
Strategy Lingualibre 2022 Review (including outreach)2022-2023 Lingualibre wishlist • {{Wikimedia Language Diversity/Projects}} • Speakers map • Voices gender • StatsLingua Libre SignIt/2022 report • {{Grants}}


References