2021, March 4th
- collaborate on github
- assess if I can connect your (Google/Unicode) community with mine (Wikimedia/LinguaLibre)
- assess avenues of fundings from Google/Unicode/Wikimedia.
- assess the possibility of an open source maintainer with strong dev abilities : Python, JS, Java, NLP, Git.
- see how I can help.
NLP trained in the 90s, Google, UNICODE past. Creator of Google/corpuscrawler and UNILEX data. Python and others.
- Knows benevolent people at Google people.
- "Frequency read" analysis of Concept
Experienced Open source coordinator, volunteer, currently coordinating github.com/Lingua-Libre. I have experience on as a forward volunteer contributors, consolidating projects, project grant applications and delegating small scales development (2h to 1 months) to relevant Wikimedian developers, either volunteers or paid depending on the workload. It's the ramp up of open source. I'm not staff, but I'am as close as you can be from Wikimedia France, which supports LinguaLibre.
Wikimedia's community is currently investing energy to expand linguistic diversity and resources online. Our emerging goal is to audio document 500~1,000+ languages over the next 3 years or so, thanks in part to UNILEX's data which gives up a roadmap to walk for each of 1000 languages. We are still in the early stage of community building but we seem on a solid growth track, aside from dozens of bottlenecks to kicks.
- Wikimedia LinguaLibre online recording app provides :
- an Open Source, rapid recording system : 1000 words or 500 short sentences per hour.
- a multilingual community
- local event organization
- world lists (unilex)
- Unilex 1,000 lists :
- I assume unilex is stable : already done its job very well.
- Google Tacotron (/täkōˌträn/) end-to-end speech synthesis system (demo, recent research)
- converts 8000+ recorded audio sentences with associated texts into human-like TTS system since 2017.
- various open source implimentations on github
- Mozilla TTS's deep learning for Text to Speech (github).
- Google Translate TTS is frequently lame...
|February 20||Mentoring organization application deadline|
|March 10||Mentoring organizations announced|
|March 29||Student application period begins|
|April 13||Student application deadline|
|May 17||Student projects announced|
|May 17 to June 7||Community bonding period|
|June 7 to August 16||Coding period|
This is just exploratory.
- Summer of code (see mw:Google Summer of Code/2021)
- I assume the hosting institution to be Wikimedia (France), in Paris.
- I have really low level of understanding of Google Summer of Code process at the moment (will dive in later)
- Google-Wikimedia partnership ?
- LinguaLibre provides an Open Source, rapid recording system tool to feed the Tacotron.
- Wikimedia France provides organizational, mentorship resources.
- Google pays some intern for organization ? Pay minority speakers ?
- Wikidata Lexeme - still slow and chaotic https://ordia.toolforge.org/language/
- Americal Evangelical Network. https://sil.org
- Generate IPA pronunctions https://github.com/brawer/ipa-speaker
- Endangered Language Alliance https://elalliance.org NYC-based group collecting audio samples from migrants.
Following our discussion I made a quick new review of online corpora mainly using OPUS.nlpl.eu as my entry point. You may already know them. They are a corpus linguistic research center specialized in parallel corpora but also provide their monolingual corpus in raw text and tokenized formats. I noticed their data for Wikipedias (/wikipedia.php: 20 languages), Wikipedia Content Translations (/wikimedia.php: 288), Tatoeba (/tatoeba.php: 359), TED (/TED2020.php: 108), Bible (/bible-uedin.php: 102).
Wikipedia corpora (288 languages) are "noisy", and this noise can vary based on each community's policies. After verification, Wikipedias indeed have a few dozen languages UNILEX don't have. But the more I dive into its raw text corpus and the more I'am uncomfortable with it: it's really noisy, the clean up is hard, multilingualism (English) and mediawiki items are sneaking in. If I remember well, smaller wikis tend to import some english codes (templates) and translate UI-facing-sentences only.
- https://opus.nlpl.eu/wikimedia.php > Statistics and TMX/Moses Downloads : pick a small file, ex: see ca.txt.gz
Tatoeba corpora (359 languages) seems to have very clean data. Content is mainly conversational : short sentences, which is interesting for language teaching purposes.
- https://opus.nlpl.eu/Tatoeba.php > Statistics and TMX/Moses Downloads : pick a small file, ex: see ca.txt.gz (raw) and ca.tok.gz (tokenized).
- Language ID's, first row: monolingual plain text files (tokenized)
- Language ID's, first column: monolingual plain text files (untokenized)
I'am skeptical of the interest of crawling Wikipedia dumps due to this time consuming cleaning issue. Especially when the corpora above are easier to process (Tatoeba, Ted, Bible), more relevant, and 1.5 times more diverse (Tatoeba).
I haven't yet pulled all the possibilities resulting from our discussion. But I now assume Unilex has done its works and volunteers resources should be spent where it would have the best interest, in both ROI and in enjoyment... I would suggest the following points :
The goals would be to get them to sponsor LinguaLibre events and minority speakers in exchanges of clean open data that Google and Unicode's members will be first to know of. Win win.
Not now. But I will hopefully be able to contact the Google Tacotron team and Mozilla TTS teams later on.
- For clarity, I'm not WM France's staff. This is exploratory only.