Events/2021 UNILEX-Lingualibre

2021, March 4th

Objectives

Medium strategy:

collaborate on github

High strategy

assess if I can connect your (Google/Unicode) community with mine (Wikimedia/LinguaLibre)
assess avenues of fundings from Google/Unicode/Wikimedia.
assess the possibility of an open source maintainer with strong dev abilities : Python, JS, Java, NLP, Git.
see how I can help.

Participants

@Brawer

NLP trained in the 90s, Google, UNICODE past. Creator of Google/corpuscrawler and UNILEX data. Python and others.

Knows benevolent people at Google people.
"Frequency read" analysis of Concept

Experienced Open source coordinator, volunteer, currently coordinating github.com/Lingua-Libre. I have experience on as a forward volunteer contributors, consolidating projects, project grant applications and delegating small scales development (2h to 1 months) to relevant Wikimedian developers, either volunteers or paid depending on the workload. It's the ramp up of open source. I'm not staff, but I'am as close as you can be from Wikimedia France, which supports LinguaLibre.

Wikimedia

Wikimedia's community is currently investing energy to expand linguistic diversity and resources online. Our emerging goal is to audio document 500~1,000+ languages over the next 3 years or so, thanks in part to UNILEX's data which gives up a roadmap to walk for each of 1000 languages. We are still in the early stage of community building but we seem on a solid growth track, aside from dozens of bottlenecks to kicks.

Context's map

Wikimedia LinguaLibre online recording app provides :
- an Open Source, rapid recording system : 1000 words or 500 short sentences per hour.
- a multilingual community
- local event organization
- world lists (unilex)
Unilex 1,000 lists :
- I assume unilex is stable : already done its job very well.
Google Tacotron (/täkōˌträn/) end-to-end speech synthesis system (demo, recent research)
- converts 8000+ recorded audio sentences with associated texts into human-like TTS system since 2017.
- various open source implimentations on github
Mozilla TTS's deep learning for Text to Speech (github).
Google Translate TTS is frequently lame...
- epo - Esperanto
- cat - Catalan

Collaborations avenues

Summer of Code 2021
Deadline	Element
February 20	Mentoring organization application deadline
March 10	Mentoring organizations announced
March 29	Student application period begins
April 13	Student application deadline
May 17	Student projects announced
May 17 to June 7	Community bonding period
June 7 to August 16	Coding period

This is just exploratory.

Summer of code (see mw:Google Summer of Code/2021)
- I assume the hosting institution to be Wikimedia (France), in Paris.
- I have really low level of understanding of Google Summer of Code process at the moment (will dive in later)
Google-Wikimedia partnership ?^[1]
- LinguaLibre provides an Open Source, rapid recording system tool to feed the Tacotron.
- Wikimedia France provides organizational, mentorship resources.
- Google pays some intern for organization ? Pay minority speakers ?

Projects ideas :
- Project idea 1 (?): General public words learning app feeding on LinguaLibre audios and Wikidata (VueJS)
- Project idea 2 (?): Coding a TTS (python). See recent open source Tacotron2 implementations.
- Project idea 3 (?) : HanDeDict-like user interface for Wikidata lexeme.
- Project idea 4 (?) : Refresh & maintain google/corpuscrawler ; add cleaners ; add "submit correction / excluded words" feature per language.

Wikimedia projects

Wikidata Lexeme - still slow and chaotic https://ordia.toolforge.org/language/
LinguaLibre
- LinguaLibre:Stats – still mainly 10~20 major languages, with long tail of 100~90 more languages.
- LinguaLibre:Events – organizes real-world and online events
- LinguaLibre:NewsRoom (planed) – review, reporting, communication

Issues

Actionable: WikiDump to Extractor working monthly to extract raw text → requires contacts dump team.
- By project (`en.wikipedia`) > See phabricator:T276723
  - and « Main topic classifications » (`en.wikipedia`)
Question: how to attack the long tail

Non-Wikimedia projects

Americal Evangelical Network. https://sil.org
Generate IPA pronunctions https://github.com/brawer/ipa-speaker
Endangered Language Alliance https://elalliance.org NYC-based group collecting audio samples from migrants.

Issues

Info: UNICODE github lowly active.
Info: There is a proposal to add Wikidata Qid to UNICODE language table out there.
Info: Google could "pay people to use the software" (?)
Info: ICU unicode library word segmentation
Question: How to attack the long tail.
- Look for new language ? In language x, use rare word in this language to search for websites.
- Wikimedians can be eyes for the crawler project. → how

Corpora discussion

Following our discussion I made a quick new review of online corpora mainly using OPUS.nlpl.eu as my entry point. You may already know them. They are a corpus linguistic research center specialized in parallel corpora but also provide their monolingual corpus in raw text and tokenized formats. I noticed their data for Wikipedias (/wikipedia.php: 20 languages), Wikipedia Content Translations (/wikimedia.php: 288), Tatoeba (/tatoeba.php: 359), TED (/TED2020.php: 108), Bible (/bible-uedin.php: 102).

Wikipedia corpora (288 languages) are "noisy", and this noise can vary based on each community's policies. After verification, Wikipedias indeed have a few dozen languages UNILEX don't have. But the more I dive into its raw text corpus and the more I'am uncomfortable with it: it's really noisy, the clean up is hard, multilingualism (English) and mediawiki items are sneaking in. If I remember well, smaller wikis tend to import some english codes (templates) and translate UI-facing-sentences only.

https://opus.nlpl.eu/wikimedia.php > Statistics and TMX/Moses Downloads : pick a small file, ex: see ca.txt.gz

Tatoeba corpora (359 languages) seems to have very clean data. Content is mainly conversational : short sentences, which is interesting for language teaching purposes.

https://opus.nlpl.eu/Tatoeba.php > Statistics and TMX/Moses Downloads : pick a small file, ex: see ca.txt.gz (raw) and ca.tok.gz (tokenized).

Language ID's, first row: monolingual plain text files (tokenized)

Language ID's, first column: monolingual plain text files (untokenized)

I'am skeptical of the interest of crawling Wikipedia dumps due to this time consuming cleaning issue. Especially when the corpora above are easier to process (Tatoeba, Ted, Bible), more relevant, and 1.5 times more diverse (Tatoeba).

Overall feedback

I haven't yet pulled all the possibilities resulting from our discussion. But I now assume Unilex has done its works and volunteers resources should be spent where it would have the best interest, in both ROI and in enjoyment... I would suggest the following points :

Document corpuscrawler better. Objective: an outside volunteer with beginner python skills could create a crawler.
Assess benevolents contacts at Google and Unicode, and the type of help they can provide : knowledge, data, funding, symbolic support, "evangelism" (bring the news of Lili).
Assess institutional interest at Google & Unicode : is such an open license, rapid audio recording tool something actually interesting for Google ? Do they have alternatives to feed their TTS ? Maybe they don't care ? Who in google may care ? etc.

The goals would be to get them to sponsor LinguaLibre events and minority speakers in exchanges of clean open data that Google and Unicode's members will be first to know of. Win win.

Not now. But I will hopefully be able to contact the Google Tacotron team and Mozilla TTS teams later on.

References

↑ For clarity, I'm not WM France's staff. This is exploratory only.

[1] For clarity, I'm not WM France's staff. This is exploratory only.

[1]

LinguaLibre