Difference between revisions of "How to create a frequency list?"

Revision as of 16:25, 28 December 2018

Nutshell : To start a recording session you needs

one LinguaLibre user,
one willing speaker, and
a list of items to record with one item by line. One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language.

After reading this page you will be able to create your own frequency list, clean and split into easy-to-handle files.

Getting my corpus

Download corpuses

You can download available corpuses in your language or collect your own corpus via some datamining. Corpuses are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.

Some research centers are curating the web to provide large corpuses to linguists and netizens alike.

Jörg Tiedemann, 2009, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia
- http://opus.nlpl.eu/OpenSubtitles2016.php
- http://opus.nlpl.eu/OpenSubtitles2018.php

Datamining

When you have a solid corpus with 2 millions words, you can process it so you get a words frequency list. For datamining, Python and other languages are your friends to gather data and/or process various those directories of files.

From corpus to frequency data `{occurences} {item}`

Main tools will be grep to grab the text strings, awk to count them, sort to sort and rank them.

For sort :

-n: rumeric sort

-r: reverse (descending)

-t: changes field separator to ' ' character

-k: as -k:1,1, sort key starts on field 1 and ends on field 1

Characters frequency (+sorted!)

$ grep -o '\S' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-letters.txt

Space-separated Words frequency (+sorted!):

$ grep -o '\w*' longtext.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1  > sorted-words.txt
# or 
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort -n -r -t' ' -k1,1 > sorted-words.txt

On all .txt of a folder and its subfolders

find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort -n -r -t' ' -k1,1 > sorted-words.txt

Output

39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

From frequency data to clean list of {item}s

Most sources provide wordlists with number_of_apparitions item such as :

Input : frequency-list.txt

39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

Command

To clean up, we recommend sed’s -r or -E:

sed  -E 's/^[0-9]+ /# /g' frequency-list.txt > words-list.txt

Output : words-list.txt

$ cat words-list.txt
# aš
# ir
# tai
# tu'21th
# kad'toto
# ...

This final result is what you want for LinguaLibre Help:Create your own lists.

How to compare lists ?

[Section status: Draft, to continue.]

It is frequently needed to compare different words lists A and B, so to find which word is not in list A but is in list B. Use shell tool named `comm` (example).

comm - compare two sorted files line by line

Splitting a very long file

Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good andintense, we recommend sub-files of : - 1000 lines, so you use 1, 2 or 3 files per session ; - 3000 lines, so you use 1 file per session and kill it off like a warrior ... if your speaker and yourself survives.

See How to split a large text file into smaller files with equal number of lines in terminal?

    split -d -l 2000 FRA-mysource-words-list.txt  FRA-mysource-words-list-

Manual:

    $ split --help
    Usage: split [OPTION] [INPUT [PREFIX]]
    Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
    size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
    is -, read standard input.
    
    Mandatory arguments to long options are mandatory for short options too.
      -a, --suffix-length=N   use suffixes of length N (default 2)
      -b, --bytes=SIZE        put SIZE bytes per output file
      -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
      -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
      -l, --lines=NUMBER      put NUMBER lines per output file
          --verbose           print a diagnostic to standard error just
                                before each output file is opened
          --help     display this help and exit
          --version  output version information and exit

Others utilities

Counting lines of a file

wc -l filename.txt       # -l : lines

See sample of a file

head -n 50 filename.txt       # -n : number of line

@@ Line 14: / Line 14: @@
 * Jörg Tiedemann, 2009, [http://stp.lingfil.uu.se/~joerg/published/ranlp-V.pdf News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces]. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia
-** http://opus.lingfil.uu.se/OpenSubtitles2016.php
+** http://opus.nlpl.eu/OpenSubtitles2016.php
+** http://opus.nlpl.eu/OpenSubtitles2018.php
 ==== Datamining ====