Help

Difference between revisions of "How to create a frequency list?"

Words lists sorted by frequency are a very good way to cover one language methodically. After reading this page you will be able to find or create your own frequency list, clean and split it into easy-to-handle files.

(117 intermediate revisions by 5 users not shown)
Line 1: Line 1:
'''Nutshell :''' To start a recording session you needs
+
{{#Subtitle:Words lists sorted by frequency are a very good way to cover one language methodically. After reading this page you will be able to '''find or create your own frequency list''', clean and split it into easy-to-handle files.
 +
}}
 +
{| class="wikitable"
 +
! Reminder : to start a recording session you need
 +
|-
 +
|
 +
# One LinguaLibre user,
 +
# One willing speaker, and
 +
# '''One list of items to record''' with one item by line.<br> One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language.
 +
|}
 +
== Reusing open license frequency lists ==
 +
 
 +
=== Hermite Dave's lists ===
 +
Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages under CC license. This data requires minor clean up, example with Korean (<code>ko</code>) :
  
# one LinguaLibre user,
+
<pre>
# one willing speaker, and
+
mkdir -p ./clean                                                              # create a folder
# a list of items to record with one item by line. One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language.
+
google-chrome github.com/hermitdave/FrequencyWords/tree/master/content/2018  # open in web-browse to browse available languages
 +
iso=ko                                                                        # defined your target language
 +
curl https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/${iso}/${iso}_50k.txt | sort -k 2,2 -n -r | cut -d' ' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt
 +
# download, sort by 2nd column numerical value descendant, cut by space then keep first field, add # to make a list, print all to file.
 +
split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-
 +
# split in files of 5000 items
 +
</pre>
  
After reading this page you will be able to create your own frequency list, clean and split into easy-to-handle files.
+
On LinguaLibre.org, [[Special:MyLanguage/Help:Create_your_own_lists#Create_a_new_list|create your lists]] as <code>List:{Iso3}/words-by-frequency-00001-to-5000</code>, etc. Ex. <code>[[List:Pol/words-by-frequency-00001-to-02000]]</code>. <br>
  
== Some context ==
+
After creating the list on LinguaLibre, add the following to its talkpage:
 +
<pre>
 +
==== Source ====
 +
{{Hermite Dave}}
 +
</pre>
  
'''Priorities :''' With limited recording abilities, it is better to use frequency list to record the most frequent words first. With unlimited recording abilities, the order doesn’t matter much since we we assume that all the target words will eventually be recorded.
+
=== UNILEX's lists ===
 +
UNILEX is an Unicode Consortium project which curates 1001 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up, example with Igbo (<code>ig</code>) :
 +
<pre>
 +
mkdir -p ./clean                                                              # create a folder
 +
google-chrome https://github.com/unicode-org/unilex/tree/main/data/frequency  # open in web-browse to browse available languages
 +
iso=ig                                                                        # defined your target language
 +
curl https://raw.githubusercontent.com/unicode-org/unilex/main/data/frequency/${iso}.txt | tail -n +5 | sort -k 2,2 -n -r | cut -d$'\t' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt
 +
# download, remove first 5 lines, sort by 2nd column numerical value descendant, cut and keep first field, add # to make a list, print all to file.
 +
split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-
 +
# split in files of 5000 items
 +
</pre>
  
'''Corpus’purpose :''' As for language’s learning, written transcripts of spoken language such as films’ subtitles are known to be better materials (see [https://en.wikipedia.org/wiki/Word_lists_by_frequency#SUBTLEX_movement SUBTLEX studies], 2007). Other corpuses will also allows you to do a good work to provide audio recording. For lexicographic purposes as Wiktionary, rare words are as interesting as frequent words, and the aim is to provide all items with their audio.
+
On LinguaLibre.org, [[Special:MyLanguage/Help:Create_your_own_lists#Create_a_new_list|create your lists]] as <code>List:{Iso3}/words-by-frequency-00001-to-5000</code>, etc. Ex. <code>[[List:Pol/words-by-frequency-00001-to-02000]]</code>. <br>
  
'''Consistency :''' It is best to provide consistent audio data, with same neutral or enhousiastic tone and same speaker.
+
After creating the list on LinguaLibre, add the following to its talkpage:
 +
<pre>
 +
==== Source ====
 +
{{UNILEX License}}
 +
</pre>
  
'''Lexicon range for learners :''' For language learners and assuming learning via the most frequent words, a minimum vocabulary of 2000-2500 base-words is required to move the learner to autonomous level. Language teaching academics name this level the “threshold level”. The [https://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages CEFR (Common European Framework of Reference for Languages: Learning, Teaching, Assessment)] ([https://rm.coe.int/1680459f97 doc]), Chinese’s HSK and some academic statements lead to the following relation between lexicon size, CEFR level and competence :
+
=== Subtlex's lists ===
 +
The [[:en:Word_lists_by_frequency#SUBTLEX_movement|Subtlex movement]], a group of academic frequency list studies based on open subtitles, also provides about 10 of the highest quality frequency lists. Items are better cleaned up, etc. These resources are published under various licenses. Their usage on LinguaLibre must be on a case by case basis.
 +
 
 +
== Corpus ==
 +
Requirements for relevant corpus :
 +
*Size: 2M+ words.
 +
*Type: raw text.
 +
*Language: monolingual or close to be.
 +
 
 +
==== Download a corpus ====
 +
You can download available corpuses in your language or collect your own corpus via some datamining. Corpura are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.
 +
 
 +
Some research centers are curating the web to provide large corpura to linguists and netizens alike.
 +
 
 +
{| class="wikitable sortable"
 +
! Project introduction || Type || Languages (2024) || Portal all || Language specific || Download link || Comments
  
{| class="wikitable"
 
!width="16%"| Lexicon(*)
 
!width="5%"| Levels
 
!width="78%"| CEFR’s descriptors
 
 
|-
 
|-
| 600 || A1 || “Basic user. Breakthrough or beginner”. Survival communication, expressing basic needs.
+
| [https://github.com/googlei18n/corpuscrawler Google Corpus Crawler] is an open-source crawler to building corpora || crawler (Python)<br>sentences (if run)<br>frequency list || 1000+ languages || [https://github.com/googlei18n/corpuscrawler home] || n.a. ||[https://www.gstatic.com/i18n/corpora/wordcounts/aai.txt aai] (freq) || Python 3 project. Easy to add a crawler, send pull requests.
|-
+
 
| 1,200 || A2 || “Basic user. Waystage or elementary”
 
|-
 
| 2,500 || B1 || “Independant user. Threshold or intermediate”.
 
 
|-
 
|-
| 5,000 || B2 || “Independant user. Vantage or upper intermediate”
+
| OpenSubtitles 2016/2018<br> || Subtitles<br>Parallel sentences<br>Monolingual sentences || 75 || [https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles Portal] || [https://opus.nlpl.eu/OpenSubtitles/br&en/v2018/OpenSubtitles `br&en`] || [https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/mono/bre.txt.gz bre] (mono) || '''Source:''' * P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf .
|-
+
'''Licence:''' unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).  
| 20,000 || C2 || “Mastery or proficiency”. Native after graduation from highschool.
 
|}
 
(*) : Assuming the most frequent word-families learnt first.
 
  
See also [https://rm.coe.int/1680459f97 CEFR 5.2.1.1] ([https://i.stack.imgur.com/1fLE2.png image]), with the most relevant section cited below :
 
{| class="wikitable"
 
! || VOCABULARY RANGE
 
 
|-
 
|-
| C2 || Has a good command of a very broad lexical repertoire including idiomatic expressions and
+
| Wortschatz by Leipzig || Sentences<br>Monolingual || 290+ || || [https://wortschatz.uni-leipzig.de/en/download/bre bre] || [https://downloads.wortschatz-leipzig.de/corpora/bre_wikipedia_2021_100K.tar.gz bre]: 100k sentences, WP 2021 || List of sentences corpora : [https://api.wortschatz-leipzig.de/ws/swagger-ui/index.html#/Corpora/getAvailableCorpora API reference] > https://api.wortschatz-leipzig.de/ws/corpora
colloquialisms; shows awareness of connotative levels of meaning.
 
|-
 
| C1 || Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with
 
circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of
 
idiomatic expressions and colloquialisms.
 
|-
 
| B2 || Has a good range of vocabulary for matters connected to his/her field and most general topics. Can
 
vary formulation to avoid frequent repetition, but lexical gaps can still cause hesitation and
 
circumlocution.
 
|-
 
| B1 || Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to
 
his/her everyday life such as family, hobbies and interests, work, travel, and current events.
 
Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and
 
topics.
 
|-
 
| A2 || Has a sufficient vocabulary for the expression of basic communicative needs.<br>
 
Has a sufficient vocabulary for coping with simple survival needs.
 
|-
 
| A1 || Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete
 
situations.
 
|}
 
  
{| class="wikitable"
 
!  || VOCABULARY CONTROL
 
|-
 
| C2 || Consistently correct and appropriate use of vocabulary.
 
|-
 
| C1 || Occasional minor slips, but no significant vocabulary errors.
 
|-
 
| B2 || Lexical accuracy is generally high, though some confusion and incorrect word choice does occur without
 
hindering communication.
 
|-
 
| B1 || Shows good control of elementary vocabulary but major errors still occur when expressing more complex
 
thoughts or handling unfamiliar topics and situations.
 
 
|-
 
|-
| A2 || Can control a narrow repertoire dealing with concrete everyday needs.
+
| CC-100 || Sentences<br>Monolingual || 115 || [https://data.statmt.org/cc-100/ Portal] || n.a. || [https://data.statmt.org/cc-100/br.txt.xz br] (mono) || « No claims of intellectual property are made on the work of preparation of the corpus. »
|-
 
| A1 || No descriptor available
 
|}
 
{| class="wikitable"
 
!
 
|-
 
| Users of the Framework may wish to consider and where appropriate state:
 
• which lexical elements (fixed expressions and single word forms) the learner will need/be
 
equipped/be required to recognise and/or use;
 
• how they are selected and ordered
 
 
|}
 
|}
  
== Getting my corpus ==
+
==== Wiki(p)edia dumps ====
==== Download corpuses ====
+
One possibility is to harvest Wikipedia's contents. See:
You can download available corpuses in your language or collect your own corpus via some datamining. Corpuses are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.
+
* [https://github.com/hugolpz/introduction-wikipedia-corpus Introduction-wikipedia-corpus]
 +
* [https://github.com/google/corpuscrawler/issues/78 Discussion on harvesting Wikipedia]
  
Some research centers are curating the web to provide large corpuses to linguists and netizens alike.  
+
== From corpus to frequency data `{occurrences} {item}` ==
 +
Main tools will be <code>[https://ss64.com/bash/grep.html grep]</code> to grab the text strings, <code>[https://ss64.com/bash/awk.html awk]</code> to count them, <code>[https://ss64.com/bash/sort.html sort]</code> to sort and rank them.
  
* Jörg Tiedemann, 2009, [http://stp.lingfil.uu.se/~joerg/published/ranlp-V.pdf News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces]. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia
+
==== Characters frequency (+sorted) ====
** http://opus.lingfil.uu.se/OpenSubtitles2016.php
+
<pre class="console">
 +
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 &gt; sorted-letters.txt
 +
</pre>
  
==== Datamining ====
+
==== Space-separated Words frequency (+sorted) ====
When you have a solid corpus with 2 millions words, you can process it so you get a '''words frequency list'''.
+
<pre class="console">
For datamining, <code>Python</code> and other languages are your friends to gather data and/or process various those directories of files.
+
# Spaces or punctuation separated words
 +
$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  &gt; sorted-words.txt
 +
# Space or punctuation separated words, except if punctuation is : '-
 +
cat longtext.txt | sed 's/^\(.\)/\L\1/' | sed -E "s/((['-]*\w+)*)/\1\n/gi" | sed -E "s/ //g" | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  &gt; sorted-words.txt
 +
# Space or line jump separated words
 +
$ cat longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" | sort -n -r -t' ' -k1,1 &gt; sorted-words.txt
 +
</pre>
  
== From corpus to frequency data {item}{occurences} ==
+
==== Loop on all .txt, recursively within folders ====
==== Characters frequency (+sorted!) ====
+
<pre class="console">
<pre>$ grep -o '\S' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort &gt; myoutput.txt</pre>
+
find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 &gt; sorted-words.txt
 +
</pre>
  
==== Space-separated Words frequency (+sorted!): ====
+
==== Output ====
<pre>$ grep -o '\w*' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort &gt; myoutput.txt
+
<pre class="console">
# or
+
39626 aš
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=&quot; |\n&quot; myfile.txt | sort &gt; myfileout.txt</pre>
 
 
 
==== On all .txt of a folder and its subfolders ====
 
<pre>find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort &gt; myfileout.txt</pre>
 
 
 
==== Output : myfileout.txt ====
 
<pre>39626 aš
 
 
35938 ir
 
35938 ir
 
33361 tai
 
33361 tai
 
28520 tu'21th
 
28520 tu'21th
 
26213 kad'toto
 
26213 kad'toto
...</pre>
+
...
 +
</pre>
  
== From frequency data to clean list of {item}s ==
+
== Cleaning up frequency lists ==
Most sources provide wordlists with <code>&lt;word&gt; &lt;number_of_apparitions&gt;</code> such as :
+
Most sources provide wordlists with <code>{number_of_apparitions}{separator}{item}</code> or its mirror <code>{item}{separator}{number_of_apparitions}</code>, already sorted from most frequent to less ones. We what to keep the field <code>{item}</code> and drop the <code>{separator}</code> and <code>{number_of_apparitions}</code>.
  
==== Input : frequency-list.txt ====
+
{| class="wikitable" style="width:100%"
<pre>39626 aš
+
! Input data we have || Output data we want
 +
|- valign="top"
 +
|
 +
<pre class="console">
 +
$ cat frequency-list.txt
 +
39626 aš
 
35938 ir
 
35938 ir
 
33361 tai
 
33361 tai
 
28520 tu'21th
 
28520 tu'21th
 
26213 kad'toto
 
26213 kad'toto
...</pre>
+
...
 +
</pre>
 +
|
 +
<pre class="console">
 +
$ cat words-list.txt
 +
# aš
 +
# ir
 +
# tai
 +
# tu'21th
 +
# kad'toto
 +
# ...
 +
</pre>
 +
|-
 +
!colspan=2| Command
 +
|-
 +
|colspan=2|
 +
<pre class="console">
 +
cut frequency-list.txt -d$' ' -f2 | sed -E 's/^/# /g' > words-list.txt
 +
# load file line by line, cut by space then keep field 2, replace start of line by # on all lines, print in file.
 +
</pre>
 +
|}
 +
This final result is what you want for LinguaLibre [[Help:Create your own lists]].
  
==== Command ====
+
== Additional helpers ==
To clean up, we recommend [http://www.linuxcommand.org/man_pages/sed1.html sed]’s <code>-r</code> or <code>-E</code>:
+
=== Sort command ===
 +
See <code>[https://ss64.com/bash/sort.html man sort]</code> for details.
 +
:<code>-n:</code> numeric sort
 +
:<code>-r:</code> reverse (descending)
 +
:<code>-t:</code> changes field separator to ' ' character
 +
:<code>-k:</code> as <code>-k:1,1</code>, sort key starts on field 1 and ends on field 1
  
<pre>sed  -E 's/^[0-9]+ //g' frequency-list.txt &gt; words-list.txt</pre>
+
=== Counting lines of a file ===
 +
<pre class="console">wc -l filename.txt       # -l : lines</pre>
  
==== Output : words-list.txt ====
+
=== See sample of a file ===
<pre>
+
<pre class="console">head -n 50 filename.txt      # -n : number of line</pre>
ir
 
tai
 
tu'21th
 
kad'toto
 
...</pre>
 
This final result is what you want for LinguaLibre.
 
  
== Splitting the file ==
+
=== Splitting a very long file ===
Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good andintense, we recommend sub-files of : - 1000 lines, so you use 1, 2 or 3 files per session ; - 3000 lines, so you use 1 file per session and kill it off like a warrior
+
<pre class="console">
 +
split -d -l 2000 --additional-suffix=".txt" YUE-words-by-frequency.txt  YUE-words-by-frequency-
 +
</pre>
 +
Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good and intense, we recommend sub-files of :  
 +
* 1000 lines, so you use 1, 2 or 3 files per session  
 +
* 3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives.
  
 
See [https://stackoverflow.com/a/2016918 How to split a large text file into smaller files with equal number of lines in terminal?]
 
See [https://stackoverflow.com/a/2016918 How to split a large text file into smaller files with equal number of lines in terminal?]
 
<pre>split -d -l 2000 FRA-mysource-words-list.txt  FRA-mysource-words-list-</pre>
 
<pre>    $ split --help
 
    Usage: split [OPTION] [INPUT [PREFIX]]
 
    Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
 
    size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
 
    is -, read standard input.
 
   
 
    Mandatory arguments to long options are mandatory for short options too.
 
      -a, --suffix-length=N  use suffixes of length N (default 2)
 
      -b, --bytes=SIZE        put SIZE bytes per output file
 
      -C, --line-bytes=SIZE  put at most SIZE bytes of lines per output file
 
      -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
 
      -l, --lines=NUMBER      put NUMBER lines per output file
 
          --verbose          print a diagnostic to standard error just
 
                                before each output file is opened
 
          --help    display this help and exit
 
          --version  output version information and exit</pre>
 
 
 
<!--  
 
<!--  
 
[How to select range of line from file via terminal ?](https://stackoverflow.com/questions/83329/how-can-i-extract-a-predetermined-range-of-lines-from-a-text-file-on-unix)
 
[How to select range of line from file via terminal ?](https://stackoverflow.com/questions/83329/how-can-i-extract-a-predetermined-range-of-lines-from-a-text-file-on-unix)
Line 174: Line 187:
 
```
 
```
 
-->
 
-->
=== Counting lines of a file ===
 
  
<pre>wc -l filename.txt       # -l : lines</pre>
+
=== Convert encoding ===
=== See sample of a file ===
+
<pre class="console">
 +
iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt
 +
</pre>
 +
 
 +
=== Create frequency list from [[:en:Dragon]] ===
 +
:''See also [https://codepen.io/hugolpz/pen/ByoKOK 101 Wikidata/Wikipedia API via JS]''
 +
<pre class="console">
 +
curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt
 +
</pre>
 +
 
 +
=== How to compare lists ? ===
 +
:[Section status: Draft, to continue.] ([https://github.com/hugolpz/audio-cmn/blob/master/hsk-missing-audios.bash example]).
 +
 
 +
<pre>
 +
comm - compare two sorted files line by line
 +
</pre>
 +
 
 +
== See also ==
 +
* https://dumps.wikimedia.org
  
<pre>head -n 50 filename.txt      # -n : number of line</pre>
+
{{Helps}}

Revision as of 19:49, 17 February 2024

Reminder : to start a recording session you need
  1. One LinguaLibre user,
  2. One willing speaker, and
  3. One list of items to record with one item by line.
    One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language.

Reusing open license frequency lists

Hermite Dave's lists

Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages under CC license. This data requires minor clean up, example with Korean (ko) :

mkdir -p ./clean                                                              # create a folder
google-chrome github.com/hermitdave/FrequencyWords/tree/master/content/2018   # open in web-browse to browse available languages
iso=ko                                                                        # defined your target language
curl https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/${iso}/${iso}_50k.txt | sort -k 2,2 -n -r | cut -d' ' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt
# download, sort by 2nd column numerical value descendant, cut by space then keep first field, add # to make a list, print all to file.
split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-
# split in files of 5000 items

On LinguaLibre.org, create your lists as List:{Iso3}/words-by-frequency-00001-to-5000, etc. Ex. List:Pol/words-by-frequency-00001-to-02000.

After creating the list on LinguaLibre, add the following to its talkpage:

==== Source ====
{{Hermite Dave}} 

UNILEX's lists

UNILEX is an Unicode Consortium project which curates 1001 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up, example with Igbo (ig) :

mkdir -p ./clean                                                              # create a folder
google-chrome https://github.com/unicode-org/unilex/tree/main/data/frequency  # open in web-browse to browse available languages
iso=ig                                                                        # defined your target language
curl https://raw.githubusercontent.com/unicode-org/unilex/main/data/frequency/${iso}.txt | tail -n +5 | sort -k 2,2 -n -r | cut -d$'\t' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt
# download, remove first 5 lines, sort by 2nd column numerical value descendant, cut and keep first field, add # to make a list, print all to file.
split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-
# split in files of 5000 items

On LinguaLibre.org, create your lists as List:{Iso3}/words-by-frequency-00001-to-5000, etc. Ex. List:Pol/words-by-frequency-00001-to-02000.

After creating the list on LinguaLibre, add the following to its talkpage:

==== Source ====
{{UNILEX License}} 

Subtlex's lists

The Subtlex movement, a group of academic frequency list studies based on open subtitles, also provides about 10 of the highest quality frequency lists. Items are better cleaned up, etc. These resources are published under various licenses. Their usage on LinguaLibre must be on a case by case basis.

Corpus

Requirements for relevant corpus :

  • Size: 2M+ words.
  • Type: raw text.
  • Language: monolingual or close to be.

Download a corpus

You can download available corpuses in your language or collect your own corpus via some datamining. Corpura are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.

Some research centers are curating the web to provide large corpura to linguists and netizens alike.

Project introduction Type Languages (2024) Portal all Language specific Download link Comments
Google Corpus Crawler is an open-source crawler to building corpora crawler (Python)
sentences (if run)
frequency list
1000+ languages home n.a. aai (freq) Python 3 project. Easy to add a crawler, send pull requests.
OpenSubtitles 2016/2018
Subtitles
Parallel sentences
Monolingual sentences
75 Portal `br&en` bre (mono) Source: * P. Lison and J. Tiedemann (2016), "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles", http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf .

Licence: unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).

Wortschatz by Leipzig Sentences
Monolingual
290+ bre bre: 100k sentences, WP 2021 List of sentences corpora : API reference > https://api.wortschatz-leipzig.de/ws/corpora
CC-100 Sentences
Monolingual
115 Portal n.a. br (mono) « No claims of intellectual property are made on the work of preparation of the corpus. »

Wiki(p)edia dumps

One possibility is to harvest Wikipedia's contents. See:

From corpus to frequency data `{occurrences} {item}`

Main tools will be grep to grab the text strings, awk to count them, sort to sort and rank them.

Characters frequency (+sorted)

$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt

Space-separated Words frequency (+sorted)

# Spaces or punctuation separated words
$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  > sorted-words.txt
# Space or punctuation separated words, except if punctuation is : '-
cat longtext.txt | sed 's/^\(.\)/\L\1/' | sed -E "s/((['-]*\w+)*)/\1\n/gi" | sed -E "s/ //g" | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  > sorted-words.txt
# Space or line jump separated words
$ cat longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" | sort -n -r -t' ' -k1,1 > sorted-words.txt

Loop on all .txt, recursively within folders

find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt

Output

39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

Cleaning up frequency lists

Most sources provide wordlists with {number_of_apparitions}{separator}{item} or its mirror {item}{separator}{number_of_apparitions}, already sorted from most frequent to less ones. We what to keep the field {item} and drop the {separator} and {number_of_apparitions}.

Input data we have Output data we want
$ cat frequency-list.txt
39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...
$ cat words-list.txt
# aš
# ir
# tai
# tu'21th
# kad'toto
# ...
Command
cut frequency-list.txt -d$' ' -f2 | sed -E 's/^/# /g' > words-list.txt
# load file line by line, cut by space then keep field 2, replace start of line by # on all lines, print in file.

This final result is what you want for LinguaLibre Help:Create your own lists.

Additional helpers

Sort command

See man sort for details.

-n: numeric sort
-r: reverse (descending)
-t: changes field separator to ' ' character
-k: as -k:1,1, sort key starts on field 1 and ends on field 1

Counting lines of a file

wc -l filename.txt       # -l : lines

See sample of a file

head -n 50 filename.txt       # -n : number of line

Splitting a very long file

split -d -l 2000 --additional-suffix=".txt" YUE-words-by-frequency.txt  YUE-words-by-frequency-

Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good and intense, we recommend sub-files of :

  • 1000 lines, so you use 1, 2 or 3 files per session
  • 3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives.

See How to split a large text file into smaller files with equal number of lines in terminal?

Convert encoding

iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt

Create frequency list from en:Dragon

See also 101 Wikidata/Wikipedia API via JS
curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt

How to compare lists ?

[Section status: Draft, to continue.] (example).
 comm - compare two sorted files line by line

See also

Lingua Libre Help pages
General help pages Help:InterfaceHelp:Your first recordHelp:Choosing a microphoneHelp:Configure your microphoneHelp:TranslateHelp:LangtagsLinguaLibre:Language codes systems used across LinguaLibreLinguaLibre:List of languages
Linguistic help pages Help:Add a new languageHelp:HomographsHelp:List translationHelp:Ethics
Lists help pages Help:Create your own listsHelp:How to create a frequency list?Help:Why wordlists matter?Help:Swadesh listsHelp:ListsHelp:Create a new generator
Events, Outreach Lingualibre:EventsLingualibre:RolesLingualibre:WorkshopsLingualibre:HackathonLingualibre:Interested communitiesLingualibre:Events/2022 Public Relations CampaignLingualibre:MailingLingualibre:JargonLingualibre:AppsLingualibre:CitationsService civique 2022-2023
Strategy Lingualibre 2022 Review (including outreach)2022-2023 Lingualibre wishlist • {{Wikimedia Language Diversity/Projects}} • Speakers map • Voices gender • StatsLingua Libre SignIt/2022 report • {{Grants}}