Help

Difference between revisions of "Lists"

 
(22 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
:''This page is overlaping with [[Help:Create your own lists]]. Merging or differentiation welcome.''
 
Wordlists are at the core of language protection. LinguaLibre has a dedicated namespace to store and document lists. For how to on how to process lists, see [[Help:Main]].
 
Wordlists are at the core of language protection. LinguaLibre has a dedicated namespace to store and document lists. For how to on how to process lists, see [[Help:Main]].
  
 
== Guidelines ==
 
== Guidelines ==
 +
:See [[Help:Create your own lists]]
 
# Find or create an open source wordlist
 
# Find or create an open source wordlist
 
# Format it so each line has one single item, be it a word or phrase. Prefixes <code># </code> or <code>* </code> are valid.  
 
# Format it so each line has one single item, be it a word or phrase. Prefixes <code># </code> or <code>* </code> are valid.  
Line 24: Line 26:
 
* Lists can be expanded, edited, replaced, moved.
 
* Lists can be expanded, edited, replaced, moved.
 
* If on same language, same topic : add the author's name.
 
* If on same language, same topic : add the author's name.
 +
* It is recommended to create lists with a reasonable number of words for two reasons :
 +
** It makes it easier for speakers to exclude words from a list and then have a break and finally start with the second list without having to exclude the same words again.
  
 
== Licenses ==
 
== Licenses ==
Public domain, MIT, GPL, open Creative commons (cc-by, cc-by-sa) are acceptable.
+
Public domain, MIT, GPL, GNU, open Creative commons (cc-by, cc-by-sa) and variations are acceptable.
  
 
== Access wordlists ==
 
== Access wordlists ==
=== Finding wordlists ===
+
=== Open frequency lists online ===
 
As of 2018, a large amount of open source wordlists are available online for most major languages.
 
As of 2018, a large amount of open source wordlists are available online for most major languages.
* UNILEX : https://github.com/unicode-org/unilex.<br>'''License :''' Semi-free given ''"(a) this copyright and permission notice appear with all copies of the Data Files or Software, or (b) this copyright and permission notice appear in associated Documentation."''
+
* UNILEX : https://github.com/unicode-org/unilex - about 1000 languages.<br>'''License :''' Semi-free given ''"(a) this copyright and permission notice appear with all copies of the Data Files or Software, or (b) this copyright and permission notice appear in associated Documentation."''
 +
* [[Template:Hermite Dave|Hermite Dave]] (2016) : [https://invokeit.wordpress.com/frequency-word-lists/ page], [https://github.com/hermitdave/FrequencyWords/tree/master/content/2016/pl data]. 62 languages, some pollution in the data (English words).<br>'''License :''' CC-by-sa-4.0
 +
* [[:en:Word lists by frequency#SUBTLEX movement|Subtlex series]] - about 8 high quality academic wordlists based on oral speech from OpenSubtitles website.<br>'''License :''' variable, see each article.
 +
** Subtlex-pl : [http://crr.ugent.be/papers/subtlex-pl.pdf article], [http://crr.ugent.be/papers/subtlex-pl.pdf http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-pl data], available but "for research usage".
 +
** [[Template:Subtlex-ch|Subtlex-ch]] : CC-by.
 +
* Worldlex : [https://link.springer.com/article/10.3758/s13428-015-0621-0 article], [http://worldlex.lexique.org data], available but unstated license.
  
=== Sparql-based lists ===
+
=== SPARQL-based lists ===
LinguaLibre provides Sparkl queries to generate wordlists based on the nearly [https://en.wikipedia.org/wiki/List_of_Wikipedias 300 wikipedias].
+
:For tutorial, see [[Help:Create your own lists#Generate_lists from queries]]
 +
LinguaLibre provides SPARQL queries to generate wordlists based on the nearly [https://en.wikipedia.org/wiki/List_of_Wikipedias 300 wikipedias].
  
 
=== Hand editing wordlists ===
 
=== Hand editing wordlists ===
 +
:For tutorial, see [[Help:Create your own lists]]
 
Wordlists copyrights are rather lose. Exact copy of copyrighted dictionaries indexes or wordlists is forbidden. But a new wordlist made from several sources by picking up individuals words, which are not individually copyrighted, is itself your own work.
 
Wordlists copyrights are rather lose. Exact copy of copyrighted dictionaries indexes or wordlists is forbidden. But a new wordlist made from several sources by picking up individuals words, which are not individually copyrighted, is itself your own work.
  
 
=== Datamining by yourself ===
 
=== Datamining by yourself ===
:For tutorial, see [[Help:How to create a frequency list ?]]
+
:For tutorial, see [[Help:How to create a frequency list?]]
 
Taking as input a list of text files in the target language (aka "written corpus"), it is possible, for many languages, to create wordlists via terminal and bash commands. Expect half a day of work. Depending on the language, this can present different challenges : East Asian language don't have spaces marking clear word separation, other languages have rich agglomerative behaviors so meaningful roots get diluted. For these, more NLP or hand editing is required.
 
Taking as input a list of text files in the target language (aka "written corpus"), it is possible, for many languages, to create wordlists via terminal and bash commands. Expect half a day of work. Depending on the language, this can present different challenges : East Asian language don't have spaces marking clear word separation, other languages have rich agglomerative behaviors so meaningful roots get diluted. For these, more NLP or hand editing is required.
  
Line 47: Line 58:
 
* [[List:IPA/letters]] -- IPA sounds, volunteer speaker with mastery over the whole range wanted !
 
* [[List:IPA/letters]] -- IPA sounds, volunteer speaker with mastery over the whole range wanted !
 
Many other lists are moved here. Once moved into the <code>List:...</code> namespace, the list is available and suggested when [[Special:RecordWizard|recording]].
 
Many other lists are moved here. Once moved into the <code>List:...</code> namespace, the list is available and suggested when [[Special:RecordWizard|recording]].
 +
 +
== See also ==
 +
{{Helps}}

Latest revision as of 18:15, 14 September 2022

This page is overlaping with Help:Create your own lists. Merging or differentiation welcome.

Wordlists are at the core of language protection. LinguaLibre has a dedicated namespace to store and document lists. For how to on how to process lists, see Help:Main.

Guidelines

See Help:Create your own lists
  1. Find or create an open source wordlist
  2. Format it so each line has one single item, be it a word or phrase. Prefixes # or * are valid.
  3. Create a wikipage with name List:{iso-639-3}/{topic's_title} (ex: List:Fra/Fruits).
    • {iso-639-3} : see your language article in Wikipedia, the iso_639-3 code will be in the infobox.
      For generalities on the matter, see[1]
    • {topic's_title} : topic's title, in target language or English.
Template

Example : [[List:eng/legendary_creatures]]

# dragon
# unicorn
# dahu
# phoenix
# centaur
# pegasus

To start recording, copy the wordlist above 'Source' into the recording studio.

Collaboration policies

  • Lists can be expanded, edited, replaced, moved.
  • If on same language, same topic : add the author's name.
  • It is recommended to create lists with a reasonable number of words for two reasons :
    • It makes it easier for speakers to exclude words from a list and then have a break and finally start with the second list without having to exclude the same words again.

Licenses

Public domain, MIT, GPL, GNU, open Creative commons (cc-by, cc-by-sa) and variations are acceptable.

Access wordlists

Open frequency lists online

As of 2018, a large amount of open source wordlists are available online for most major languages.

SPARQL-based lists

For tutorial, see Help:Create your own lists#Generate_lists from queries

LinguaLibre provides SPARQL queries to generate wordlists based on the nearly 300 wikipedias.

Hand editing wordlists

For tutorial, see Help:Create your own lists

Wordlists copyrights are rather lose. Exact copy of copyrighted dictionaries indexes or wordlists is forbidden. But a new wordlist made from several sources by picking up individuals words, which are not individually copyrighted, is itself your own work.

Datamining by yourself

For tutorial, see Help:How to create a frequency list?

Taking as input a list of text files in the target language (aka "written corpus"), it is possible, for many languages, to create wordlists via terminal and bash commands. Expect half a day of work. Depending on the language, this can present different challenges : East Asian language don't have spaces marking clear word separation, other languages have rich agglomerative behaviors so meaningful roots get diluted. For these, more NLP or hand editing is required.

List of lists

  • Help:Swadesh -- renowed serie of basic 207 words (English), existing in about one hundred languages.
  • List:IPA/letters -- IPA sounds, volunteer speaker with mastery over the whole range wanted !

Many other lists are moved here. Once moved into the List:... namespace, the list is available and suggested when recording.

See also

Lingua Libre Help pages
General help pages Help:InterfaceHelp:Your first recordHelp:Choosing a microphoneHelp:Configure your microphoneHelp:TranslateHelp:LangtagsLinguaLibre:Language codes systems used across LinguaLibreLinguaLibre:List of languages
Linguistic help pages Help:Add a new languageHelp:HomographsHelp:List translationHelp:Ethics
Lists help pages Help:Create your own listsHelp:How to create a frequency list?Help:Why wordlists matter?Help:Swadesh listsHelp:ListsHelp:Create a new generator
Events, Outreach Lingualibre:EventsLingualibre:RolesLingualibre:WorkshopsLingualibre:HackathonLingualibre:Interested communitiesLingualibre:Events/2022 Public Relations CampaignLingualibre:MailingLingualibre:JargonLingualibre:AppsLingualibre:CitationsService civique 2022-2023
Strategy Lingualibre 2022 Review (including outreach)2022-2023 Lingualibre wishlist • {{Wikimedia Language Diversity/Projects}} • Speakers map • Voices gender • StatsLingua Libre SignIt/2022 report • {{Grants}}