Events/2021 UNILEX-Lingualibre

https://etherpad.wikimedia.org/p/UNILEX-Lingualibre – eitherpad to cleanup.

03/07

Context's map

Wikimedia LinguaLibre online recording app provides :
- an Open Source, rapid recording system : 1000 words or 500 short sentences per hour.
- a multilingual community
- local event organization
- world lists (unilex)
Unilex 1,000 lists :
- I assume unilex is stable : already done its job very well.
Google Tacotron (/täkōˌträn/) end-to-end speech synthesis system (demo, recent research)
- converts 8000+ recorded audio sentences with associated texts into human-like TTS system since 2017.
- various open source implimentations on github
Mozilla TTS's deep learning for Text to Speech (github).
Google Translate TTS is frequently lame...
- epo - Esperanto
- cat - Catalan

Collaborations avenues

Summer of Code 2021
Deadline	Element
February 20	Mentoring organization application deadline
March 10	Mentoring organizations announced
March 29	Student application period begins
April 13	Student application deadline
May 17 (previously May 4)	Student projects announced
May 17 to June 7 (previously May 4 to May 31)	Community bonding period
June 7 to August 16 (June 1 to August 24)	Coding period

This is just exploratory.

Summer of code
- I assume the hosting institution to be Wikimedia (France), in Paris.
- I have really low level of understanding of Google Summer of Code process at the moment (will dive in later)
- Project idea 1 (?): General public words learning app feeding on LinguaLibre audios and Wikidata (VueJS)
- Project idea 2 (?): Coding a TTS (python). See recent open source Tacotron2 implementations.
- Project idea 3 (?) : HanDeDict-like user interface for Wikidata lexeme.
- ...
Google-Wikimedia partnership ?^[1]
- LinguaLibre provides an Open Source, rapid recording system tool to feed the Tacotron.
- Google pays some intern for organization ? Pay minority speakers ?

Corpora discussion

Following our discussion I made a quick new review of online corpora mainly using OPUS.nlpl.eu as my entry point. You may already know them. They are a corpus linguistic research center specialized in parallel corpora but also provide their monolingual corpus in raw text and tokenized formats. I noticed their data for Wikipedias (/wikipedia.php: 20 languages), Wikipedia Content Translations (/wikimedia.php: 288), Tatoeba (/tatoeba.php: 359), TED (/TED2020.php: 108), Bible (/bible-uedin.php: 102).

Wikipedia corpora (288 languages) are "noisy", and this noise can vary based on each community's policies. After verification, Wikipedias indeed have a few dozen languages UNILEX don't have. But the more I dive into its raw text corpus and the more I'am uncomfortable with it: it's really noisy, the clean up is hard, multilingualism (English) and mediawiki items are sneaking in. If I remember well, smaller wikis tend to import some english codes (templates) and translate UI-facing-sentences only.

https://opus.nlpl.eu/wikimedia.php > Statistics and TMX/Moses Downloads : pick a small file, ex: see ca.txt.gz

Tatoeba corpora (359 languages) seems to have very clean data. Content is mainly conversational : short sentences, which is interesting for language teaching purposes.

https://opus.nlpl.eu/Tatoeba.php > Statistics and TMX/Moses Downloads : pick a small file, ex: see ca.txt.gz (raw) and ca.tok.gz (tokenized).

Language ID's, first row: monolingual plain text files (tokenized)

Language ID's, first column: monolingual plain text files (untokenized)

I'am skeptical of the interest of crawling Wikipedia dumps due to this time consuming cleaning issue. Especially when the corpora above are easier to process (Tatoeba, Ted, Bible), more relevant, and 1.5 times more diverse (Tatoeba).

Overall feedback

I haven't yet pulled all the possibilities resulting from our discussion. But I now assume Unilex has done its works and volunteers resources should be spent where it would have the best interest, in both ROI and in enjoyment... I would suggest the following points :

Document corpuscrawler better. Objective: an outside volunteer with beginner python skills could create a crawler.
Assess benevolents contacts at Google and Unicode, and the type of help they can provide : knowledge, data, funding, symbolic support, "evangelism" (bring the news of Lili).
Assess institutional interest at Google & Unicode : is such an open license, rapid audio recording tool something actually interesting for Google ? Do they have alternatives to feed their TTS ? Maybe they don't care ? Who in google may care ? etc.

The goals would be to get them to sponsor LinguaLibre events and minority speakers in exchanges of clean open data that Google and Unicode's members will be first to know of. Win win.

Not now. But I will hopefully be able to contact the Google Tacotron team and Mozilla TTS teams later on.

References

↑ Disclaimer: I'ma not WM France's staff. This is exploratory only.

[1] Disclaimer: I'ma not WM France's staff. This is exploratory only.

[1]

LinguaLibre