Help

Difference between revisions of "How to create a frequency list?"

Line 10: Line 10:
 
|}
 
|}
  
== Start from a corpus ==
+
== Reusing open license lists ==
==== Download corpuses ====
 
You can download available corpuses in your language or collect your own corpus via some datamining. Corpuses are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.
 
 
 
Some research centers are curating the web to provide large corpuses to linguists and netizens alike.
 
 
 
* Corpus Crawler is an open-source crawler for building corpora in 1000+ languages (easy to add more, send pull requests); https://github.com/googlei18n/corpuscrawler
 
* P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
 
** "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).
 
** http://opus.nlpl.eu/OpenSubtitles2018.php
 
 
 
==== Datamining ====
 
When you have a solid corpus with 2 millions words, you can process it so you get a '''words frequency list'''.
 
For datamining, <code>Python</code> and other languages are your friends to gather data and/or process various those directories of files.
 
  
== Create list from Hermite Dave's lists ==
+
=== Create list from Hermite Dave's lists ===
 
Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages under CC license. This data requires minor clean up, example with Korean (<code>ko</code>) :
 
Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages under CC license. This data requires minor clean up, example with Korean (<code>ko</code>) :
  
Line 46: Line 33:
 
</pre>
 
</pre>
  
== Create a list from UNILEX's lists ==
+
=== Create a list from UNILEX's lists ===
 
UNILEX is an Unicode Consortium project which curates 999 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up, example with Igbo (<code>ig</code>) :
 
UNILEX is an Unicode Consortium project which curates 999 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up, example with Igbo (<code>ig</code>) :
 
<pre>
 
<pre>
Line 63: Line 50:
 
{{UNILEX License}}  
 
{{UNILEX License}}  
 
</pre>
 
</pre>
 +
 +
== Corpus ==
 +
==== Download a corpus ====
 +
You can download available corpuses in your language or collect your own corpus via some datamining. Corpuses are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.
 +
 +
Some research centers are curating the web to provide large corpuses to linguists and netizens alike.
 +
 +
* Corpus Crawler is an open-source crawler for building corpora in 1000+ languages (easy to add more, send pull requests); https://github.com/googlei18n/corpuscrawler
 +
* P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
 +
** "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).
 +
** http://opus.nlpl.eu/OpenSubtitles2018.php
 +
 +
==== Datamine your corpus ====
 +
When you have a solid corpus with 2+ millions words, you can process it so you get a '''words frequency list'''.
 +
For datamining, <code>Python</code> and other languages are your friends to gather data and/or process various those directories of files.
  
 
== From corpus to frequency data `{occurences} {item}` ==
 
== From corpus to frequency data `{occurences} {item}` ==
Line 68: Line 70:
  
 
For <code>sort</code> :  
 
For <code>sort</code> :  
-n: rumeric sort
+
:<code>-n:</code> numeric sort
-r: reverse (descending)
+
:<code>-r:</code> reverse (descending)
-t: changes field separator to ' ' character
+
:<code>-t:</code> changes field separator to ' ' character
-k: as <code>-k:1,1</code>, sort key starts on field 1 and ends on field 1
+
:<code>-k:</code> as <code>-k:1,1</code>, sort key starts on field 1 and ends on field 1
  
==== Characters frequency (+sorted!) ====
+
==== Characters frequency (+sorted) ====
 
<pre class="console">
 
<pre class="console">
 
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 &gt; sorted-letters.txt
 
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 &gt; sorted-letters.txt
 
</pre>
 
</pre>
  
==== Space-separated Words frequency (+sorted!): ====
+
==== Space-separated Words frequency (+sorted): ====
 
<pre class="console">
 
<pre class="console">
 
$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  &gt; sorted-words.txt
 
$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  &gt; sorted-words.txt
Line 85: Line 87:
 
</pre>
 
</pre>
  
==== On all .txt of a folder and its subfolders ====
+
==== Loop on all .txt, recursively within folders ====
 
<pre class="console">
 
<pre class="console">
 
find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 &gt; sorted-words.txt
 
find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 &gt; sorted-words.txt
Line 101: Line 103:
  
 
== From frequency data to clean list of {item}s ==
 
== From frequency data to clean list of {item}s ==
Most sources provide wordlists with <code>number_of_apparitions item</code> such as :
+
Most sources provide wordlists with <code>{number_of_apparitions}{separator}{item}</code> or its mirror <code>{item}{separator}{number_of_apparitions}}</code>.
  
 
==== Input : frequency-list.txt ====
 
==== Input : frequency-list.txt ====
Line 116: Line 118:
 
To clean up, we recommend sed’s <code>-r</code> or <code>-E</code>:
 
To clean up, we recommend sed’s <code>-r</code> or <code>-E</code>:
  
<pre class="console">sed -E 's/^[0-9]+ /# /g' frequency-list.txt &gt; words-list.txt</pre>
+
<pre class="console">
 +
cut frequency-list.txt -d$' ' -f2 | sed -E 's/^/# /g' > words-list.txt
 +
# load file line by line, cut by space then keep field 2, replace start of line by # on all lines, print in file.
 +
</pre>
  
 
==== Output : words-list.txt ====
 
==== Output : words-list.txt ====

Revision as of 09:31, 24 February 2021

Words lists sorted by frequency are a very good way to cover one language methodically. After reading this page you will be able to find or create your own frequency list, clean and split it into easy-to-handle files.

Reminder : to start a recording session you need
  1. One LinguaLibre user,
  2. One willing speaker, and
  3. One list of items to record with one item by line.
    One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language.

Reusing open license lists

Create list from Hermite Dave's lists

Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages under CC license. This data requires minor clean up, example with Korean (ko) :

mkdir -p ./clean                                                              # create a folder
google-chrome github.com/hermitdave/FrequencyWords/tree/master/content/2018   # open in web-browse to browse available languages
iso=ko                                                                        # defined your target language
curl https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/${iso}/${iso}_50k.txt | sort -k 2,2 -n -r | cut -d' ' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt
# download, sort by 2nd column numerical value descendant, cut by space then keep first field, add # to make a list, print all to file.
split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-
# split in files of 5000 items

On LinguaLibre.org, create your lists as List:{Iso3}/words-by-frequency-00001-to-2000, etc. Ex. List:Pol/words-by-frequency-00001-to-02000.

After creating the list on LinguaLibre, add the following to its talkpage:

==== Source ====
{{Hermite Dave}} 

Create a list from UNILEX's lists

UNILEX is an Unicode Consortium project which curates 999 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up, example with Igbo (ig) :

mkdir -p ./clean                                                              # create a folder
google-chrome github.com/lingua-libre/unilex/tree/master/data/frequency       # open in web-browse to browse available languages
iso=ig                                                                        # defined your target language
curl https://raw.githubusercontent.com/lingua-libre/unilex/master/data/frequency/${iso}.txt | tail -n +5 | sort -k 2,2 -n -r | cut -d$'\t' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt
# download, remove first 5 lines, sort by 2nd column numerical value descendant, cut and keep first field, add # to make a list, print all to file.
split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-
# split in files of 5000 items

After creating the list on LinguaLibre, add the following to its talkpage:

==== Source ====
{{UNILEX License}} 

Corpus

Download a corpus

You can download available corpuses in your language or collect your own corpus via some datamining. Corpuses are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.

Some research centers are curating the web to provide large corpuses to linguists and netizens alike.

Datamine your corpus

When you have a solid corpus with 2+ millions words, you can process it so you get a words frequency list. For datamining, Python and other languages are your friends to gather data and/or process various those directories of files.

From corpus to frequency data `{occurences} {item}`

Main tools will be grep to grab the text strings, awk to count them, sort to sort and rank them.

For sort :

-n: numeric sort
-r: reverse (descending)
-t: changes field separator to ' ' character
-k: as -k:1,1, sort key starts on field 1 and ends on field 1

Characters frequency (+sorted)

$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt

Space-separated Words frequency (+sorted):

$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  > sorted-words.txt
# or 
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort -n -r -t' ' -k1,1 > sorted-words.txt

Loop on all .txt, recursively within folders

find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt

Output

39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

From frequency data to clean list of {item}s

Most sources provide wordlists with {number_of_apparitions}{separator}{item} or its mirror {item}{separator}{number_of_apparitions}}.

Input : frequency-list.txt

39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

Command

To clean up, we recommend sed’s -r or -E:

cut frequency-list.txt -d$' ' -f2 | sed -E 's/^/# /g' > words-list.txt
# load file line by line, cut by space then keep field 2, replace start of line by # on all lines, print in file.

Output : words-list.txt

$ cat words-list.txt
# aš
# ir
# tai
# tu'21th
# kad'toto
# ...

This final result is what you want for LinguaLibre Help:Create your own lists.

Additional helpers

Counting lines of a file

wc -l filename.txt       # -l : lines

See sample of a file

head -n 50 filename.txt       # -n : number of line

Splitting a very long file

split -d -l 2000 --additional-suffix=".txt" YUE-words-by-frequency.txt  YUE-words-by-frequency-

Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good and intense, we recommend sub-files of : - 1000 lines, so you use 1, 2 or 3 files per session ; - 3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives.

See How to split a large text file into smaller files with equal number of lines in terminal?

Convert encoding

iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt

Create frequency list from en:Dragon

See also 101 Wikidata API via JS
curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt

How to compare lists ?

[Section status: Draft, to continue.] (example).
 comm - compare two sorted files line by line