Help

Difference between revisions of "Lists"

Line 32: Line 32:
 
As of 2018, a large amount of open source wordlists are available online for most major languages.
 
As of 2018, a large amount of open source wordlists are available online for most major languages.
 
* UNILEX : https://github.com/unicode-org/unilex.<br>'''License :''' Semi-free given ''"(a) this copyright and permission notice appear with all copies of the Data Files or Software, or (b) this copyright and permission notice appear in associated Documentation."''
 
* UNILEX : https://github.com/unicode-org/unilex.<br>'''License :''' Semi-free given ''"(a) this copyright and permission notice appear with all copies of the Data Files or Software, or (b) this copyright and permission notice appear in associated Documentation."''
LinguaLibre provides Sparkles queries to generate wordlists based on the nearly [https://en.wikipedia.org/wiki/List_of_Wikipedias 300 wikipedias].
+
 
 +
=== Sparql-based lists ===
 +
LinguaLibre provides Sparkl queries to generate wordlists based on the nearly [https://en.wikipedia.org/wiki/List_of_Wikipedias 300 wikipedias].
  
 
=== Hand editing wordlists ===
 
=== Hand editing wordlists ===

Revision as of 16:28, 8 January 2019

Wordlists are at the core of language protection. LinguaLibre has a dedicated namespace to store and document lists. For how to on how to process lists, see Help:Main.

Guidelines

  1. Find or create an open source wordlist
  2. Format it so each line has one single item, be it a word or phrase. Prefixes # or * are valid.
  3. Create a wikipage with name List:{iso-639-3}/{topic's_title} (ex: List:Fra/Fruits).
    • {iso-639-3} : see your language article in Wikipedia, the iso_639-3 code will be in the infobox.
      For generalities on the matter, see[1]
    • {topic's_title} : topic's title, in target language or English.
Template

Example : [[List:eng/legendary_creatures]]

# dragon
# unicorn
# dahu
# phoenix
# centaur
# pegasus

To start recording, copy the wordlist above 'Source' into the recording studio.

Collaboration policies

  • Lists can be expanded, edited, replaced, moved.
  • If on same language, same topic : add the author's name.

Licenses

Public domain, MIT, GPL, open Creative commons (cc-by, cc-by-sa) are acceptable.

Access wordlists

Finding wordlists

As of 2018, a large amount of open source wordlists are available online for most major languages.

  • UNILEX : https://github.com/unicode-org/unilex.
    License : Semi-free given "(a) this copyright and permission notice appear with all copies of the Data Files or Software, or (b) this copyright and permission notice appear in associated Documentation."

Sparql-based lists

LinguaLibre provides Sparkl queries to generate wordlists based on the nearly 300 wikipedias.

Hand editing wordlists

Wordlists copyrights are rather lose. Exact copy of copyrighted dictionaries indexes or wordlists is forbidden. But a new wordlist made from several sources by picking up individuals words, which are not individually copyrighted, is itself your own work.

Datamining

Taking as input a list of text files in the target language (aka "written corpus"), it is possible, for many languages, to create wordlists via terminal and bash commands. Expect half a day of work. Depending on the language, this can present different challenges : East Asian language don't have spaces marking clear word separation, other languages have rich agglomerative behaviors so meaningful roots get diluted. For these, more NLP or hand editin is required.

List of lists

  • Help:Swadesh -- renowed serie of basic 207 words (English), existing in about one hundred languages.
  • List:IPA/letters -- IPA sounds, volunteer speaker with mastery over the whole range wanted !

Many other lists are moved here. Once moved into the List:... namespace, the list is available and suggested when recording.