Help
How to create a frequency list?
Nutshell : To start a recording session you needs
- one LinguaLibre user,
- one willing speaker, and
- a list of items to record with one item by line. One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language.
After reading this page you will be able to create your own frequency list, clean and split into easy-to-handle files.
Getting my corpus
Download corpuses
You can download available corpuses in your language or collect your own corpus via some datamining. Corpuses are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.
Some research centers are curating the web to provide large corpuses to linguists and netizens alike.
- Jörg Tiedemann, 2009, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia
Datamining
When you have a solid corpus with 2 millions words, you can process it so you get a words frequency list.
For datamining, Python
and other languages are your friends to gather data and/or process various those directories of files.
Hermite Dave data to LinguaLibre lists' format
Hermite Dave created 61 frequency lists from OpenSubtitle data. This cover most major languages. This data still needs a light cleanup exposed below to match LinguaLibre's lists format.
Get your data from github.com/hermitdave/FrequencyWords :
$git clone git@github.com:hermitdave/FrequencyWords.git
Find your {iso2}_50k.txt
file. Ex. for Polish language, open a terminal in the folder of pl_50k.txt
, then :
iso2=pl iso3=pol sed -E 's/ [0-9]+$//g' "$iso2"_50k.txt | sed -E 's/^/# /g' > "$iso2"-words-LL.txt split -d -l 2000 --additional-suffix=".txt" "$iso2"-words-LL.txt "$iso3"-words-by-frequency-
You obtain 25 files of 2000 lines.
On LinguaLibre.fr, create your lists as List:{Iso3}/words-by-frequency-0001-to-2000
, etc. Ex. List:Pol/words-by-frequency-0001-to-2000
.
Reminder: Hermite Dave's data is under CC-by-sa-3.0. When you create the list page, add at the end of the list's wikipage:
==== Source ==== {{Hermite Dave}}
From corpus to frequency data `{occurences} {item}`
Main tools will be grep
to grab the text strings, awk
to count them, sort
to sort and rank them.
For sort
:
- -n: rumeric sort
- -r: reverse (descending)
- -t: changes field separator to ' ' character
- -k: as
-k:1,1
, sort key starts on field 1 and ends on field 1
Characters frequency (+sorted!)
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt
Space-separated Words frequency (+sorted!):
$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt # or $ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort -n -r -t' ' -k1,1 > sorted-words.txt
On all .txt of a folder and its subfolders
find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt
Output
39626 aš 35938 ir 33361 tai 28520 tu'21th 26213 kad'toto ...
From frequency data to clean list of {item}s
Most sources provide wordlists with number_of_apparitions item
such as :
Input : frequency-list.txt
39626 aš 35938 ir 33361 tai 28520 tu'21th 26213 kad'toto ...
Command
To clean up, we recommend sed’s -r
or -E
:
sed -E 's/^[0-9]+ /# /g' frequency-list.txt > words-list.txt
Output : words-list.txt
$ cat words-list.txt # aš # ir # tai # tu'21th # kad'toto # ...
This final result is what you want for LinguaLibre Help:Create your own lists.
How to compare lists ?
- [Section status: Draft, to continue.]
It is frequently needed to compare different words lists A and B, so to find which word is not in list A but is in list B. Use shell tool named `comm` (example).
comm - compare two sorted files line by line
Splitting a very long file
Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good and intense, we recommend sub-files of : - 1000 lines, so you use 1, 2 or 3 files per session ; - 3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives.
See How to split a large text file into smaller files with equal number of lines in terminal?
split -d -l 2000 --additional-suffix=".txt" YUE-words-by-frequency.txt YUE-words-by-frequency-
Others utilities
Counting lines of a file
wc -l filename.txt # -l : lines
See sample of a file
head -n 50 filename.txt # -n : number of line