How to create a frequency list?

Reminder : to start a recording session you need
One LinguaLibre user, One willing speaker, and One list of items to record with one item by line. One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language.

Reusing open license frequency lists

Hermite Dave's lists

Hermite Dave created 61 frequency lists from OpenSubtitle data, covering most major languages under CC license. This data requires minor clean up, example with Korean (ko) :

Source

Product (sorted by frequency)

# 그
# 난
# 내
# 1안
# 수
# 1이
# 1네
# 0거야
# 1좀

mkdir -p ./clean                                                              # create a folder
google-chrome github.com/hermitdave/FrequencyWords/tree/master/content/2018   # open in web-browse to browse available languages
iso=ko                                                                        # defined your target language
curl https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/${iso}/${iso}_50k.txt | sort -k 2,2 -n -r | cut -d' ' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt
# download, sort by 2nd column numerical value descendant, cut by space then keep first field, add # to make a list, print all to file.
split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-
# split in files of 5000 items

On LinguaLibre.org, create your lists as List:{Iso3}/words-by-frequency-00001-to-5000, etc. Ex. List:Pol/words-by-frequency-00001-to-02000.

After creating the list on LinguaLibre, add the following to its talkpage:

==== Source ====
{{Hermite Dave}}

UNILEX's lists

UNILEX is an Unicode Consortium project providing lists for 1001 languages, published under GNU-like license (SPDX-License-Identifier: Unicode-DFS-2016). This data requires numeral sorting and minor cleaning, example with Euskara (eu) :

Source

Product (sorted by frequency)

rm	Frequency

# SPDX-License-Identifier: Unicode-DFS-2016
# Corpus-Size: 122937

A	24402
Aaronen	24402
Abba	24402
Abel	24402
Abelek	16268
aberasbide	16268
aberastasun	32536
aberastasuna	65073
aberastasunak	56939
aberastu	24402
aberats	203356
aberatsa	81342
aberatsak	65073
aberatsok	16268
abere	40671
…

# eta
# ez
# ere
# egin
# zuen
# da
# zen
# izan
# esan
# baina
# bere
# bat
# hau
# du
# Jainkoaren
# Jainkoak
# egiten
# zion
# Jesusek
# zuten
…

mkdir -p ./clean                                                              # create a folder
google-chrome https://github.com/unicode-org/unilex/tree/main/data/frequency  # open in web-browse to browse available languages
iso=ig                                                                        # defined your target language
curl https://raw.githubusercontent.com/unicode-org/unilex/main/data/frequency/eu.txt | head -n 25 | tail -n +5 | sort -k 2,2 -n -r | cut -d$'\t' -f1 | sed -E 's/^/# /g' > ./clean/${iso}-all.txt
# download, remove first 5 lines, sort by 2nd column numerical value descendant, cut and keep first field, add # to make a list, print all to file.
split -d -l 5000  --additional-suffix=".txt" ./clean/${iso}-all.txt ./clean/${iso}-words-by-frequency-
# split in files of 5000 items

On LinguaLibre.org, create your lists as List:{Iso3}/words-by-frequency-00001-to-5000, etc. Ex. List:Pol/words-by-frequency-00001-to-02000.

After creating the list on LinguaLibre, add the following to its talkpage:

==== Source ====
{{UNILEX License}}

Panlex's lists

Panlex provides Swadesh 110 lists in 2,111 languages and Swadesh 207 in 776 languages, published under CC0 licence. This data requires minor clean up, example with Spanish (spa) :

Source

Product

cada	cada uno	todo
ceniza	cenizas
corteza
barriga	panza	vientre
grande
ave	pájaro
morder
negro
sangre
hueso
pecho	seno

garra
nube
frío
…

# cada
# cada uno
# todo
# ceniza
# cenizas
# corteza
# barriga
# panza
# vientre
# grande
# ave
# pájaro
# morder
# negro
# sangre
# hueso
# pecho
# seno
# 
# garra
# nube
# frío

…

mkdir -p ./clean                                                              # create a folder
google-chrome https://db.panlex.org/                                          # open in web-browse to browse available languages
iso=ig                                                                        # defined your target language
curl -O https://db.panlex.org/panlex_swadesh.zip && unzip panlex_swadesh.zip  # download, unzip
cat panlex_swadesh/swadesh110/${iso}-000.txt | sed 's/\t/\n/g' | sed -E 's/^/# /g' > ./clean/${iso}-all.txt 
# split by separator "\t", add # to make a list, print all to file.

On LinguaLibre.org, create your lists as List:{Iso3}/Swadesh110 or List:{Iso3}/Swadesh207. Ex. List:Spa/Swadesh110.
After creating the list on LinguaLibre, add the following to its talkpage:

==== Source ====
{{Panlex License}}

Subtlex's lists

The Subtlex movement, a group of academic frequency list studies based on open subtitles, also provides about 10 of the highest quality frequency lists. Items are better cleaned up, etc. These resources are published under various licenses. Their usage on LinguaLibre must be on a case by case basis.

Corpus

Requirements for relevant corpus :

Size: 2M+ words.
Type: raw text.
Language: monolingual or close to be.

Download a corpus

You can download available corpuses in your language or collect your own corpus via some datamining. Corpura are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.

Some research centers are curating the web to provide large corpura to linguists and netizens alike.

Project introduction	Type	Languages (2024)	Portal all	Language specific	Download link	Comments
Google Corpus Crawler is an open-source crawler to building corpora	crawler (Python) sentences (if run) frequency list	1000+ languages	home	n.a.	aai (freq)	Python 3 project. Easy to add a crawler, send pull requests.
OpenSubtitles 2016/2018	Subtitles Parallel sentences Monolingual sentences	75	Portal	`br&en`	bre (mono)	Source: * P. Lison and J. Tiedemann (2016), "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles", http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . Licence: unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).
Wortschatz by Leipzig	Sentences Monolingual	290+		bre	bre: 100k sentences, WP 2021	List of sentences corpora : API reference > https://api.wortschatz-leipzig.de/ws/corpora
CC-100	Sentences Monolingual	115	Portal	n.a.	br (mono)	« No claims of intellectual property are made on the work of preparation of the corpus. »

Wiki(p)edia dumps

One possibility is to harvest Wikipedia's contents. See:

From corpus to frequency data `{occurrences} {item}`

Main tools will be grep to grab the text strings, awk to count them, sort to sort and rank them.

Characters frequency (+sorted)

$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt

Space-separated Words frequency (+sorted)

# Spaces or punctuation separated words
$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  > sorted-words.txt
# Space or punctuation separated words, except if punctuation is : '-
cat longtext.txt | sed 's/^\(.\)/\L\1/' | sed -E "s/((['-]*\w+)*)/\1\n/gi" | sed -E "s/ //g" | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  > sorted-words.txt
# Space or line jump separated words
$ cat longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" | sort -n -r -t' ' -k1,1 > sorted-words.txt

Loop on all .txt, recursively within folders

find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt

Output

39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

Cleaning up frequency lists

Most sources provide wordlists with {number_of_apparitions}{separator}{item} or its mirror {item}{separator}{number_of_apparitions}, already sorted from most frequent to less ones. We want to keep the field {item}, remove both the {separator} and {number_of_apparitions}, and add the prefix # .

Input data we have

Output data we want

$ cat frequency-list.txt
39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

$ cat words-list.txt
# aš
# ir
# tai
# tu'21th
# kad'toto
# ...

Command

cut frequency-list.txt -d$' ' -f2 | sed -E 's/^/# /g' > words-list.txt
# load file line by line, cut by space then keep field 2, replace start of line by # on all lines, print in file.

This final result is what you want for LinguaLibre Help:Create your own lists.

Additional helpers

Sort command

See man sort for details.

-n: numeric sort

-r: reverse (descending)

-t: changes field separator to ' ' character

-k: as -k:1,1, sort key starts on field 1 and ends on field 1

Counting lines of a file

wc -l filename.txt       # -l : lines

See sample of a file

head -n 50 filename.txt       # -n : number of line

Splitting a very long file

split -d -l 2000 --additional-suffix=".txt" YUE-words-by-frequency.txt  YUE-words-by-frequency-

Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good and intense, we recommend sub-files of :

1000 lines, so you use 1, 2 or 3 files per session
3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives.

See How to split a large text file into smaller files with equal number of lines in terminal?

Convert encoding

iconv -f "GB18030" -t "UTF-8" SUBTLEX-CH-WF.csv -o $iso2-words.txt

Create frequency list from en:Dragon

See also 101 Wikidata/Wikipedia API via JS

curl 'https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xml' | tr '\040' '\012' | sort | uniq -c | sort -k 1,1 -n -r > output.txt

How to compare lists ?

[Section status: Draft, to continue.] (example).

 comm - compare two sorted files line by line

Lingua Libre Help pages
General help pages	Help:Interface • Help:Your first record • Help:Choosing a microphone • Help:Configure your microphone • Help:Translate • Help:Langtags • LinguaLibre:Language codes systems used across LinguaLibre • LinguaLibre:List of languages
Linguistic help pages	Help:Add a new language • Help:Homographs • Help:List translation • Help:Ethics
Lists help pages	Help:Create your own lists • Help:How to create a frequency list? • Help:Why wordlists matter? • Help:Swadesh lists • Help:Lists • Help:Create a new generator
Events, Outreach	Lingualibre:Events • Lingualibre:Roles • Lingualibre:Workshops • Lingualibre:Hackathon • Lingualibre:Interested communities • Lingualibre:Events/2022 Public Relations Campaign • Lingualibre:Mailing • Lingualibre:Jargon • Lingualibre:Apps • Lingualibre:Citations • Service civique 2022-2023
Strategy	Lingualibre 2022 Review (including outreach) • 2022-2023 Lingualibre wishlist • {{Wikimedia Language Diversity/Projects}} • Speakers map • Voices gender • Stats • Lingua Libre SignIt/2022 report • {{Grants}}

Help

How to create a frequency list?

(Redirected from Help:How to create a wordlist ?)
Words lists sorted by frequency are a very good way to cover one language methodically. After reading this page you will be able to find or create your own frequency list, clean and split it into easy-to-handle files.

Contents

Reusing open license frequency lists

Hermite Dave's lists

UNILEX's lists

Panlex's lists

Subtlex's lists

Corpus

Download a corpus

Wiki(p)edia dumps

From corpus to frequency data `{occurrences} {item}`

Characters frequency (+sorted)

Space-separated Words frequency (+sorted)

Loop on all .txt, recursively within folders

Output

Cleaning up frequency lists

Additional helpers

Sort command

Counting lines of a file

See sample of a file

Splitting a very long file

Convert encoding

Create frequency list from en:Dragon

How to compare lists ?

See also