LinguaLibre talk

Citations

Revision as of 15:45, 11 July 2024 by Yug (talk | contribs) (→‎Alphabet)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Massively hyper lingual projects

Network

Glot500

Alphabet

https://research.google/pubs/pub47206/ for mining wordlists (Unilex-style) from 2,000+ languages
https://research.google/pubs/pub46952/ cleaning them up; open-sourced in https://arxiv.org/abs/2103.15845
https://research.google/pubs/pub49814/ using these wordlists to find sentences using our web crawler
https://research.google/pubs/pub50211/ cleaning up web-crawled text
https://arxiv.org/abs/2205.03983 building machine translation systems from them; blog post https://ai.googleblog.com/2022/05/24-new-languages-google-translate.html
https://arxiv.org/abs/2305.13516 https://huggingface.co/spaces/mms-meta/MMS
https://aclanthology.org/2024.lrec-main.331/ review of available languages resources
https://aclanthology.org/2022.lrec-1.538/ Writing system and speaker demographics for 2,800+ language

Facebook

https://ai.meta.com/blog/multilingual-model-speech-recognition/ Introducing speech-to-text, text-to-speech, and more for 1,100+ languages
https://arxiv.org/abs/2305.13516 Scaling Speech Technology to 1,000+ Languages
https://arxiv.org/abs/2305.12182 Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Others

https://www.semanticscholar.org/paper/e4aa101556fc5b238a88d99c07c1055fe3bc4764 Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Retrieved from "https://lingualibre.org/index.php?title=LinguaLibre_talk:Citations&oldid=1384123"