Help
Difference between revisions of "How to create a frequency list?"
Words lists sorted by frequency are a very good way to cover one language methodically. After reading this page you will be able to find or create your own frequency list, clean and split it into easy-to-handle files.
(95 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
− | ''' | + | {{#Subtitle:Words lists sorted by frequency are a very good way to cover one language methodically. After reading this page you will be able to '''find or create your own frequency list''', clean and split it into easy-to-handle files. |
+ | }} | ||
+ | {| class="wikitable" | ||
+ | ! Reminder : to start a recording session you need | ||
+ | |- | ||
+ | | | ||
+ | # One LinguaLibre user, | ||
+ | # One willing speaker, and | ||
+ | # '''One list of items to record''' with one item by line.<br> One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language. | ||
+ | |} | ||
+ | == Reusing open license frequency lists == | ||
− | + | === Hermite Dave's lists === | |
− | + | Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages under CC license. This data requires minor clean up, example with Korean (<code>ko</code>) : | |
− | |||
− | + | <pre> | |
+ | mkdir -p ./clean # create a folder | ||
+ | google-chrome github.com/hermitdave/FrequencyWords/tree/master/content/2018 # open in web-browse to browse available languages | ||
+ | iso=ko # defined your target language | ||
+ | curl https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/${iso}/${iso}_50k.txt | sort -k 2,2 -n -r | cut -d' ' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt | ||
+ | # download, sort by 2nd column numerical value descendant, cut by space then keep first field, add # to make a list, print all to file. | ||
+ | split -d -l 5000 --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency- | ||
+ | # split in files of 5000 items | ||
+ | </pre> | ||
− | + | On LinguaLibre.org, [[Special:MyLanguage/Help:Create_your_own_lists#Create_a_new_list|create your lists]] as <code>List:{Iso3}/words-by-frequency-00001-to-5000</code>, etc. Ex. <code>[[List:Pol/words-by-frequency-00001-to-02000]]</code>. <br> | |
− | |||
− | |||
− | + | After creating the list on LinguaLibre, add the following to its talkpage: | |
+ | <pre> | ||
+ | ==== Source ==== | ||
+ | {{Hermite Dave}} | ||
+ | </pre> | ||
− | + | === UNILEX's lists === | |
− | + | UNILEX is an Unicode Consortium project which curates 1001 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up, example with Igbo (<code>ig</code>) : | |
+ | <pre> | ||
+ | mkdir -p ./clean # create a folder | ||
+ | google-chrome https://github.com/unicode-org/unilex/tree/main/data/frequency # open in web-browse to browse available languages | ||
+ | iso=ig # defined your target language | ||
+ | curl https://raw.githubusercontent.com/unicode-org/unilex/main/data/frequency/${iso}.txt | tail -n +5 | sort -k 2,2 -n -r | cut -d$'\t' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt | ||
+ | # download, remove first 5 lines, sort by 2nd column numerical value descendant, cut and keep first field, add # to make a list, print all to file. | ||
+ | split -d -l 5000 --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency- | ||
+ | # split in files of 5000 items | ||
+ | </pre> | ||
− | + | On LinguaLibre.org, [[Special:MyLanguage/Help:Create_your_own_lists#Create_a_new_list|create your lists]] as <code>List:{Iso3}/words-by-frequency-00001-to-5000</code>, etc. Ex. <code>[[List:Pol/words-by-frequency-00001-to-02000]]</code>. <br> | |
− | |||
− | |||
− | == | + | After creating the list on LinguaLibre, add the following to its talkpage: |
− | + | <pre> | |
+ | ==== Source ==== | ||
+ | {{UNILEX License}} | ||
+ | </pre> | ||
+ | |||
+ | === Subtlex's lists === | ||
+ | The [[:en:Word_lists_by_frequency#SUBTLEX_movement|Subtlex movement]], a group of academic frequency list studies based on open subtitles, also provides about 10 of the highest quality frequency lists. Items are better cleaned up, etc. These resources are published under various licenses. Their usage on LinguaLibre must be on a case by case basis. | ||
+ | |||
+ | == Corpus == | ||
+ | Requirements for relevant corpus : | ||
+ | *Size: 2M+ words. | ||
+ | *Type: raw text. | ||
+ | *Language: monolingual or close to be. | ||
+ | |||
+ | ==== Download a corpus ==== | ||
+ | You can download available corpuses in your language or collect your own corpus via some datamining. Corpura are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining. | ||
+ | |||
+ | Some research centers are curating the web to provide large corpura to linguists and netizens alike. | ||
+ | |||
+ | {| class="wikitable sortable" | ||
+ | ! Project introduction || Type || Languages (2024) || Portal all || Language specific || Download link || Comments | ||
+ | |||
+ | |- | ||
+ | | [https://github.com/googlei18n/corpuscrawler Google Corpus Crawler] is an open-source crawler to building corpora || crawler (Python)<br>sentences (if run)<br>frequency list || 1000+ languages || [https://github.com/googlei18n/corpuscrawler home] || n.a. ||[https://www.gstatic.com/i18n/corpora/wordcounts/aai.txt aai] (freq) || Python 3 project. Easy to add a crawler, send pull requests. | ||
+ | |||
+ | |- | ||
+ | | OpenSubtitles 2016/2018<br> || Subtitles<br>Parallel sentences<br>Monolingual sentences || 75 || [https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles Portal] || [https://opus.nlpl.eu/OpenSubtitles/br&en/v2018/OpenSubtitles `br&en`] || [https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/mono/bre.txt.gz bre] (mono) || '''Source:''' * P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . | ||
+ | '''Licence:''' unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016). | ||
+ | |||
+ | |- | ||
+ | | Wortschatz by Leipzig || Sentences<br>Monolingual || 290+ || || [https://wortschatz.uni-leipzig.de/en/download/bre bre] || [https://downloads.wortschatz-leipzig.de/corpora/bre_wikipedia_2021_100K.tar.gz bre]: 100k sentences, WP 2021 || List of sentences corpora : [https://api.wortschatz-leipzig.de/ws/swagger-ui/index.html#/Corpora/getAvailableCorpora API reference] > https://api.wortschatz-leipzig.de/ws/corpora | ||
+ | |||
+ | |- | ||
+ | | CC-100 || Sentences<br>Monolingual || 115 || [https://data.statmt.org/cc-100/ Portal] || n.a. || [https://data.statmt.org/cc-100/br.txt.xz br] (mono) || « No claims of intellectual property are made on the work of preparation of the corpus. » | ||
+ | |} | ||
− | + | ==== Wiki(p)edia dumps ==== | |
− | + | One possibility is to harvest Wikipedia's contents. See: | |
− | + | * [https://github.com/hugolpz/introduction-wikipedia-corpus Introduction-wikipedia-corpus] | |
− | + | * [https://github.com/google/corpuscrawler/issues/78 Discussion on harvesting Wikipedia] | |
− | :- | ||
+ | == From corpus to frequency data `{occurrences} {item}` == | ||
+ | Main tools will be <code>[https://ss64.com/bash/grep.html grep]</code> to grab the text strings, <code>[https://ss64.com/bash/awk.html awk]</code> to count them, <code>[https://ss64.com/bash/sort.html sort]</code> to sort and rank them. | ||
− | ==== Characters frequency (+sorted | + | ==== Characters frequency (+sorted) ==== |
− | <pre> | + | <pre class="console"> |
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt | $ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt | ||
</pre> | </pre> | ||
− | ==== Space-separated Words frequency (+sorted | + | ==== Space-separated Words frequency (+sorted) ==== |
− | <pre> | + | <pre class="console"> |
+ | # Spaces or punctuation separated words | ||
$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt | $ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt | ||
− | # or | + | # Space or punctuation separated words, except if punctuation is : '- |
− | $ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS= | + | cat longtext.txt | sed 's/^\(.\)/\L\1/' | sed -E "s/((['-]*\w+)*)/\1\n/gi" | sed -E "s/ //g" | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt |
+ | # Space or line jump separated words | ||
+ | $ cat longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" | sort -n -r -t' ' -k1,1 > sorted-words.txt | ||
</pre> | </pre> | ||
− | ==== | + | ==== Loop on all .txt, recursively within folders ==== |
− | <pre> | + | <pre class="console"> |
find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt | find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt | ||
</pre> | </pre> | ||
==== Output ==== | ==== Output ==== | ||
− | <pre> | + | <pre class="console"> |
39626 aš | 39626 aš | ||
35938 ir | 35938 ir | ||
Line 57: | Line 121: | ||
</pre> | </pre> | ||
− | == | + | == Cleaning up frequency lists == |
− | Most sources provide wordlists with <code>number_of_apparitions item</code> | + | Most sources provide wordlists with <code>{number_of_apparitions}{separator}{item}</code> or its mirror <code>{item}{separator}{number_of_apparitions}</code>, already sorted from most frequent to less ones. We want to keep the field <code>{item}</code>, remove both the <code>{separator}</code> and <code>{number_of_apparitions}</code>, and add the prefix <code># </code>. |
− | ==== | + | {| class="wikitable" style="width:100%" |
− | + | ! Input data we have || Output data we want | |
+ | |- valign="top" | ||
+ | | | ||
+ | <pre class="console"> | ||
+ | $ cat frequency-list.txt | ||
39626 aš | 39626 aš | ||
35938 ir | 35938 ir | ||
Line 69: | Line 137: | ||
... | ... | ||
</pre> | </pre> | ||
− | + | | | |
− | + | <pre class="console"> | |
− | + | $ cat words-list.txt | |
− | |||
− | <pre> | ||
− | |||
− | |||
− | |||
# aš | # aš | ||
# ir | # ir | ||
Line 84: | Line 147: | ||
# ... | # ... | ||
</pre> | </pre> | ||
+ | |- | ||
+ | !colspan=2| Command | ||
+ | |- | ||
+ | |colspan=2| | ||
+ | <pre class="console"> | ||
+ | cut frequency-list.txt -d$' ' -f2 | sed -E 's/^/# /g' > words-list.txt | ||
+ | # load file line by line, cut by space then keep field 2, replace start of line by # on all lines, print in file. | ||
+ | </pre> | ||
+ | |} | ||
This final result is what you want for LinguaLibre [[Help:Create your own lists]]. | This final result is what you want for LinguaLibre [[Help:Create your own lists]]. | ||
− | == | + | == Additional helpers == |
− | + | === Sort command === | |
− | + | See <code>[https://ss64.com/bash/sort.html man sort]</code> for details. | |
+ | :<code>-n:</code> numeric sort | ||
+ | :<code>-r:</code> reverse (descending) | ||
+ | :<code>-t:</code> changes field separator to ' ' character | ||
+ | :<code>-k:</code> as <code>-k:1,1</code>, sort key starts on field 1 and ends on field 1 | ||
− | + | === Counting lines of a file === | |
+ | <pre class="console">wc -l filename.txt # -l : lines</pre> | ||
+ | |||
+ | === See sample of a file === | ||
+ | <pre class="console">head -n 50 filename.txt # -n : number of line</pre> | ||
− | == Splitting a very long file == | + | === Splitting a very long file === |
− | Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good | + | <pre class="console"> |
+ | split -d -l 2000 --additional-suffix=".txt" YUE-words-by-frequency.txt YUE-words-by-frequency- | ||
+ | </pre> | ||
+ | Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good and intense, we recommend sub-files of : | ||
+ | * 1000 lines, so you use 1, 2 or 3 files per session | ||
+ | * 3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives. | ||
See [https://stackoverflow.com/a/2016918 How to split a large text file into smaller files with equal number of lines in terminal?] | See [https://stackoverflow.com/a/2016918 How to split a large text file into smaller files with equal number of lines in terminal?] | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<!-- | <!-- | ||
[How to select range of line from file via terminal ?](https://stackoverflow.com/questions/83329/how-can-i-extract-a-predetermined-range-of-lines-from-a-text-file-on-unix) | [How to select range of line from file via terminal ?](https://stackoverflow.com/questions/83329/how-can-i-extract-a-predetermined-range-of-lines-from-a-text-file-on-unix) | ||
Line 124: | Line 187: | ||
``` | ``` | ||
--> | --> | ||
− | |||
− | |||
− | |||
− | === | + | === Convert encoding === |
− | <pre> | + | <pre class="console"> |
+ | iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt | ||
+ | </pre> | ||
+ | |||
+ | === Create frequency list from [[:en:Dragon]] === | ||
+ | :''See also [https://codepen.io/hugolpz/pen/ByoKOK 101 Wikidata/Wikipedia API via JS]'' | ||
+ | <pre class="console"> | ||
+ | curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt | ||
+ | </pre> | ||
+ | |||
+ | === How to compare lists ? === | ||
+ | :[Section status: Draft, to continue.] ([https://github.com/hugolpz/audio-cmn/blob/master/hsk-missing-audios.bash example]). | ||
+ | |||
+ | <pre> | ||
+ | comm - compare two sorted files line by line | ||
+ | </pre> | ||
+ | |||
+ | == See also == | ||
+ | * https://dumps.wikimedia.org | ||
+ | |||
+ | {{Helps}} |
Latest revision as of 13:27, 22 November 2024
Reminder : to start a recording session you need |
---|
|
Reusing open license frequency lists
Hermite Dave's lists
Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages under CC license. This data requires minor clean up, example with Korean (ko
) :
mkdir -p ./clean # create a folder google-chrome github.com/hermitdave/FrequencyWords/tree/master/content/2018 # open in web-browse to browse available languages iso=ko # defined your target language curl https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/${iso}/${iso}_50k.txt | sort -k 2,2 -n -r | cut -d' ' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt # download, sort by 2nd column numerical value descendant, cut by space then keep first field, add # to make a list, print all to file. split -d -l 5000 --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency- # split in files of 5000 items
On LinguaLibre.org, create your lists as List:{Iso3}/words-by-frequency-00001-to-5000
, etc. Ex. List:Pol/words-by-frequency-00001-to-02000
.
After creating the list on LinguaLibre, add the following to its talkpage:
==== Source ==== {{Hermite Dave}}
UNILEX's lists
UNILEX is an Unicode Consortium project which curates 1001 languages. As many frequency lists are available under GNU-like license. This data requires minor clean up, example with Igbo (ig
) :
mkdir -p ./clean # create a folder google-chrome https://github.com/unicode-org/unilex/tree/main/data/frequency # open in web-browse to browse available languages iso=ig # defined your target language curl https://raw.githubusercontent.com/unicode-org/unilex/main/data/frequency/${iso}.txt | tail -n +5 | sort -k 2,2 -n -r | cut -d$'\t' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt # download, remove first 5 lines, sort by 2nd column numerical value descendant, cut and keep first field, add # to make a list, print all to file. split -d -l 5000 --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency- # split in files of 5000 items
On LinguaLibre.org, create your lists as List:{Iso3}/words-by-frequency-00001-to-5000
, etc. Ex. List:Pol/words-by-frequency-00001-to-02000
.
After creating the list on LinguaLibre, add the following to its talkpage:
==== Source ==== {{UNILEX License}}
Subtlex's lists
The Subtlex movement, a group of academic frequency list studies based on open subtitles, also provides about 10 of the highest quality frequency lists. Items are better cleaned up, etc. These resources are published under various licenses. Their usage on LinguaLibre must be on a case by case basis.
Corpus
Requirements for relevant corpus :
- Size: 2M+ words.
- Type: raw text.
- Language: monolingual or close to be.
Download a corpus
You can download available corpuses in your language or collect your own corpus via some datamining. Corpura are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.
Some research centers are curating the web to provide large corpura to linguists and netizens alike.
Project introduction | Type | Languages (2024) | Portal all | Language specific | Download link | Comments |
---|---|---|---|---|---|---|
Google Corpus Crawler is an open-source crawler to building corpora | crawler (Python) sentences (if run) frequency list |
1000+ languages | home | n.a. | aai (freq) | Python 3 project. Easy to add a crawler, send pull requests. |
OpenSubtitles 2016/2018 |
Subtitles Parallel sentences Monolingual sentences |
75 | Portal | `br&en` | bre (mono) | Source: * P. Lison and J. Tiedemann (2016), "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles", http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf .
Licence: unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016). |
Wortschatz by Leipzig | Sentences Monolingual |
290+ | bre | bre: 100k sentences, WP 2021 | List of sentences corpora : API reference > https://api.wortschatz-leipzig.de/ws/corpora | |
CC-100 | Sentences Monolingual |
115 | Portal | n.a. | br (mono) | « No claims of intellectual property are made on the work of preparation of the corpus. » |
Wiki(p)edia dumps
One possibility is to harvest Wikipedia's contents. See:
From corpus to frequency data `{occurrences} {item}`
Main tools will be grep
to grab the text strings, awk
to count them, sort
to sort and rank them.
Characters frequency (+sorted)
$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt
Space-separated Words frequency (+sorted)
# Spaces or punctuation separated words $ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt # Space or punctuation separated words, except if punctuation is : '- cat longtext.txt | sed 's/^\(.\)/\L\1/' | sed -E "s/((['-]*\w+)*)/\1\n/gi" | sed -E "s/ //g" | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt # Space or line jump separated words $ cat longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" | sort -n -r -t' ' -k1,1 > sorted-words.txt
Loop on all .txt, recursively within folders
find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt
Output
39626 aš 35938 ir 33361 tai 28520 tu'21th 26213 kad'toto ...
Cleaning up frequency lists
Most sources provide wordlists with {number_of_apparitions}{separator}{item}
or its mirror {item}{separator}{number_of_apparitions}
, already sorted from most frequent to less ones. We want to keep the field {item}
, remove both the {separator}
and {number_of_apparitions}
, and add the prefix #
.
Input data we have | Output data we want |
---|---|
$ cat frequency-list.txt 39626 aš 35938 ir 33361 tai 28520 tu'21th 26213 kad'toto ... |
$ cat words-list.txt # aš # ir # tai # tu'21th # kad'toto # ... |
Command | |
cut frequency-list.txt -d$' ' -f2 | sed -E 's/^/# /g' > words-list.txt # load file line by line, cut by space then keep field 2, replace start of line by # on all lines, print in file. |
This final result is what you want for LinguaLibre Help:Create your own lists.
Additional helpers
Sort command
See man sort
for details.
-n:
numeric sort-r:
reverse (descending)-t:
changes field separator to ' ' character-k:
as-k:1,1
, sort key starts on field 1 and ends on field 1
Counting lines of a file
wc -l filename.txt # -l : lines
See sample of a file
head -n 50 filename.txt # -n : number of line
Splitting a very long file
split -d -l 2000 --additional-suffix=".txt" YUE-words-by-frequency.txt YUE-words-by-frequency-
Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good and intense, we recommend sub-files of :
- 1000 lines, so you use 1, 2 or 3 files per session
- 3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives.
See How to split a large text file into smaller files with equal number of lines in terminal?
Convert encoding
iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt
Create frequency list from en:Dragon
- See also 101 Wikidata/Wikipedia API via JS
curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt
How to compare lists ?
- [Section status: Draft, to continue.] (example).
comm - compare two sorted files line by line