Help
Difference between revisions of "How to create a frequency list?"
(→Hermite Dave data to LinguaLibre lists' format: add Unilex commands, clean up Hermite Dave.) |
|||
Line 25: | Line 25: | ||
For datamining, <code>Python</code> and other languages are your friends to gather data and/or process various those directories of files. | For datamining, <code>Python</code> and other languages are your friends to gather data and/or process various those directories of files. | ||
− | == Hermite Dave | + | == Create list from Hermite Dave's lists == |
− | Hermite Dave created 61 frequency lists from OpenSubtitle data | + | Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages. This data requires minor clean up : |
− | + | <pre> | |
− | + | mkdir -p ./clean # create a folder | |
− | + | google-chrome github.com/hermitdave/FrequencyWords/tree/master/content/2018 # open in web-browse to browse available languages | |
− | + | iso=ko # defined your target language | |
− | + | curl https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/${iso}/${iso}_50k.txt | sort -k 2,2 -n -r | cut -d' ' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt | |
− | + | # download, sort by 2nd column numerical value descendant, cut by space then keep first field, add # to make a list, print all to file. | |
− | + | split -d -l 5000 --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency- | |
− | + | # split in files of 5000 items | |
− | split -d -l | ||
</pre> | </pre> | ||
− | |||
− | |||
On LinguaLibre.org, [[Special:MyLanguage/Help:Create_your_own_lists#Create_a_new_list|create your lists]] as <code>List:{Iso3}/words-by-frequency-00001-to-2000</code>, etc. Ex. <code>[[List:Pol/words-by-frequency-00001-to-02000]]</code>. <br> | On LinguaLibre.org, [[Special:MyLanguage/Help:Create_your_own_lists#Create_a_new_list|create your lists]] as <code>List:{Iso3}/words-by-frequency-00001-to-2000</code>, etc. Ex. <code>[[List:Pol/words-by-frequency-00001-to-02000]]</code>. <br> | ||
− | + | ||
+ | After creating the list on LinguaLibre, add the following to its talkpage: | ||
<pre> | <pre> | ||
==== Source ==== | ==== Source ==== | ||
{{Hermite Dave}} | {{Hermite Dave}} | ||
+ | </pre> | ||
+ | |||
+ | == Create a list from UNILEX's lists == | ||
+ | UNILEX is an Unicode Consortium project which curates 999 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up : | ||
+ | <pre> | ||
+ | mkdir -p ./clean # create a folder | ||
+ | google-chrome github.com/lingua-libre/unilex/tree/master/data/frequency # open in web-browse to browse available languages | ||
+ | iso=ig # defined your target language | ||
+ | curl https://raw.githubusercontent.com/lingua-libre/unilex/master/data/frequency/${iso}.txt | tail -n +5 | sort -k 2,2 -n -r | cut -d$'\t' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt | ||
+ | # download, remove first 5 lines, sort by 2nd column numerical value descendant, cut and keep first field, add # to make a list, print all to file. | ||
+ | split -d -l 5000 --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency- | ||
+ | # split in files of 5000 items | ||
+ | </pre> | ||
+ | |||
+ | After creating the list on LinguaLibre, add the following to its talkpage: | ||
+ | <pre> | ||
+ | ==== Source ==== | ||
+ | {{UNILEX License}} | ||
</pre> | </pre> | ||
Revision as of 09:10, 24 February 2021
Words lists sorted by frequency are a very good way to cover one language methodically. After reading this page you will be able to find or create your own frequency list, clean and split it into easy-to-handle files.
Reminder : to start a recording session you need |
---|
|
Start from a corpus
Download corpuses
You can download available corpuses in your language or collect your own corpus via some datamining. Corpuses are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.
Some research centers are curating the web to provide large corpuses to linguists and netizens alike.
- Corpus Crawler is an open-source crawler for building corpora in 1000+ languages (easy to add more, send pull requests); https://github.com/googlei18n/corpuscrawler
- P. Lison and J. Tiedemann (2016), "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles", http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
- "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).
- http://opus.nlpl.eu/OpenSubtitles2018.php
Datamining
When you have a solid corpus with 2 millions words, you can process it so you get a words frequency list.
For datamining, Python
and other languages are your friends to gather data and/or process various those directories of files.
Create list from Hermite Dave's lists
Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages. This data requires minor clean up :
mkdir -p ./clean # create a folder google-chrome github.com/hermitdave/FrequencyWords/tree/master/content/2018 # open in web-browse to browse available languages iso=ko # defined your target language curl https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/${iso}/${iso}_50k.txt | sort -k 2,2 -n -r | cut -d' ' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt # download, sort by 2nd column numerical value descendant, cut by space then keep first field, add # to make a list, print all to file. split -d -l 5000 --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency- # split in files of 5000 items
On LinguaLibre.org, create your lists as List:{Iso3}/words-by-frequency-00001-to-2000
, etc. Ex. List:Pol/words-by-frequency-00001-to-02000
.
After creating the list on LinguaLibre, add the following to its talkpage:
==== Source ==== {{Hermite Dave}}
Create a list from UNILEX's lists
UNILEX is an Unicode Consortium project which curates 999 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up :
mkdir -p ./clean # create a folder google-chrome github.com/lingua-libre/unilex/tree/master/data/frequency # open in web-browse to browse available languages iso=ig # defined your target language curl https://raw.githubusercontent.com/lingua-libre/unilex/master/data/frequency/${iso}.txt | tail -n +5 | sort -k 2,2 -n -r | cut -d$'\t' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt # download, remove first 5 lines, sort by 2nd column numerical value descendant, cut and keep first field, add # to make a list, print all to file. split -d -l 5000 --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency- # split in files of 5000 items
After creating the list on LinguaLibre, add the following to its talkpage:
==== Source ==== {{UNILEX License}}
From corpus to frequency data `{occurences} {item}`
Main tools will be grep
to grab the text strings, awk
to count them, sort
to sort and rank them.
For sort
:
-n: rumeric sort
-r: reverse (descending)
-t: changes field separator to ' ' character
-k: as -k:1,1
, sort key starts on field 1 and ends on field 1
Characters frequency (+sorted!)
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt
Space-separated Words frequency (+sorted!):
$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt # or $ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort -n -r -t' ' -k1,1 > sorted-words.txt
On all .txt of a folder and its subfolders
find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt
Output
39626 aš 35938 ir 33361 tai 28520 tu'21th 26213 kad'toto ...
From frequency data to clean list of {item}s
Most sources provide wordlists with number_of_apparitions item
such as :
Input : frequency-list.txt
39626 aš 35938 ir 33361 tai 28520 tu'21th 26213 kad'toto ...
Command
To clean up, we recommend sed’s -r
or -E
:
sed -E 's/^[0-9]+ /# /g' frequency-list.txt > words-list.txt
Output : words-list.txt
$ cat words-list.txt # aš # ir # tai # tu'21th # kad'toto # ...
This final result is what you want for LinguaLibre Help:Create your own lists.
Additional helpers
Counting lines of a file
wc -l filename.txt # -l : lines
See sample of a file
head -n 50 filename.txt # -n : number of line
Splitting a very long file
split -d -l 2000 --additional-suffix=".txt" YUE-words-by-frequency.txt YUE-words-by-frequency-
Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good and intense, we recommend sub-files of : - 1000 lines, so you use 1, 2 or 3 files per session ; - 3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives.
See How to split a large text file into smaller files with equal number of lines in terminal?
Convert encoding
iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt
Create frequency list from en:Dragon
- See also 101 Wikidata API via JS
curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt
How to compare lists ?
- [Section status: Draft, to continue.] (example).
comm - compare two sorted files line by line