Difference between revisions of "How to create a frequency list?"

Revision as of 18:16, 19 May 2018

Nutshell : To start a recording session you needs

one LinguaLibre user,
one willing speaker, and
a list of items to record with one item by line. One item can be any easy to read sign, word, sentence or paragraph. The most common use-case is to record a comprehensive words list for your target language.

After reading this page you will be able to create your own frequency list, clean and split into easy-to-handle files.

Some context

Priorities : With limited recording abilities, it is better to use frequency list to record the most frequent words first. With unlimited recording abilities, the order doesn’t matter much since we we assume that all the target words will eventually be recorded.

Corpus’purpose : As for language’s learning, written transcripts of spoken language such as films’ subtitles are known to be better materials (see SUBTLEX studies, 2007). Other corpuses will also allows you to do a good work to provide audio recording. For lexicographic purposes as Wiktionary, rare words are as interesting as frequent words, and the aim is to provide all items with their audio.

Consistency : It is best to provide consistent audio data, with same neutral or enhousiastic tone and same speaker.

Lexicon range for learners : For language learners and assuming learning via the most frequent words, a minimum vocabulary of 2000-2500 base-words is required to move the learner to autonomous level. Language teaching academics name this level the “threshold level”. The CEFR (Common European Framework of Reference for Languages: Learning, Teaching, Assessment) (doc), Chinese’s HSK and some academic statements lead to the following relation between lexicon size, CEFR level and competence :

Lexicon(*)	Levels	CEFR’s descriptors
600	A1	“Basic user. Breakthrough or beginner”. Survival communication, expressing basic needs.
1,200	A2	“Basic user. Waystage or elementary”
2,500	B1	“Independant user. Threshold or intermediate”.
5,000	B2	“Independant user. Vantage or upper intermediate”
20,000	C2	“Mastery or proficiency”. Native after graduation from highschool.

(*) : Assuming the most frequent word-families learnt first.

See also CEFR 5.2.1.1 (image) :

	VOCABULARY RANGE
C2	Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning.
C1	Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms.
B2	Has a good range of vocabulary for matters connected to his/her field and most general topics. Can vary formulation to avoid frequent repetition, but lexical gaps can still cause hesitation and circumlocution.
B1	Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to his/her everyday life such as family, hobbies and interests, work, travel, and current events. Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics.
A2	Has a sufficient vocabulary for the expression of basic communicative needs. Has a sufficient vocabulary for coping with simple survival needs.
A1	Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete situations.

	VOCABULARY CONTROL
C2	Consistently correct and appropriate use of vocabulary.
C1	Occasional minor slips, but no significant vocabulary errors.
B2	Lexical accuracy is generally high, though some confusion and incorrect word choice does occur without hindering communication.
B1	Shows good control of elementary vocabulary but major errors still occur when expressing more complex thoughts or handling unfamiliar topics and situations.
A2	Can control a narrow repertoire dealing with concrete everyday needs.
A1	No descriptor available


Users of the Framework may wish to consider and where appropriate state: • which lexical elements (fixed expressions and single word forms) the learner will need/be equipped/be required to recognise and/or use; • how they are selected and ordered

Getting my corpus

Download corpuses

You can download available corpuses in your language or collect your own corpus via some datamining. Corpuses are easily available for about 60 languages. Corpuses for rare language are likely missing, you will likely have to do some data mining.

Some research centers are curating the web to provide large corpuses to linguists and netizens alike.

Jörg Tiedemann, 2009, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia
- http://opus.lingfil.uu.se/OpenSubtitles2016.php

Datamining

When you have a solid corpus with 2 millions words, you can process it so you get a words frequency list. For datamining, Python and other languages are your friends to gather data and/or process various those directories of files.

From corpus to frequency data {item}{occurences}

Characters frequency (+sorted!)

$ grep -o '\S' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > myoutput.txt

Space-separated Words frequency (+sorted!):

$ grep -o '\w*' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > myoutput.txt
# or 
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt

On all .txt of a folder and its subfolders

find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > myfileout.txt

Output : myfileout.txt

39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

From frequency data to clean list of {item}s

Most sources provide wordlists with <word> <number_of_apparitions> such as :

Input : frequency-list.txt

39626 aš
35938 ir
33361 tai
28520 tu'21th
26213 kad'toto
...

Command

To clean up, we recommend sed’s -r or -E:

sed  -E 's/^[0-9]+ //g' frequency-list.txt > words-list.txt

Output : words-list.txt

aš
ir
tai
tu'21th
kad'toto
...

This final result is what you want for LinguaLibre.

Splitting the file

Words-lists files generally are be over 10k lines long, thus not convenient to run recording sessions. Given 1000 recordings per hour via LinguaLibre and 3 hours sessions being quite good andintense, we recommend sub-files of : - 1000 lines, so you use 1, 2 or 3 files per session ; - 3000 lines, so you use 1 file per session and kill it off like a warrior

See How to split a large text file into smaller files with equal number of lines in terminal?

split -d -l 2000 FRA-mysource-words-list.txt  FRA-mysource-words-list-

    $ split --help
    Usage: split [OPTION] [INPUT [PREFIX]]
    Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
    size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
    is -, read standard input.
    
    Mandatory arguments to long options are mandatory for short options too.
      -a, --suffix-length=N   use suffixes of length N (default 2)
      -b, --bytes=SIZE        put SIZE bytes per output file
      -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
      -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
      -l, --lines=NUMBER      put NUMBER lines per output file
          --verbose           print a diagnostic to standard error just
                                before each output file is opened
          --help     display this help and exit
          --version  output version information and exit

Counting lines of a file

wc -l filename.txt       # -l : lines

See sample of a file

head -n 50 filename.txt       # -n : number of line

@@ Line 34: / Line 34: @@
 (*) : Assuming the most frequent word-families learnt first.
-See also CEFR 5.2.1.1 :
+See also [https://rm.coe.int/1680459f97 CEFR 5.2.1.1] ([https://i.stack.imgur.com/1fLE2.png image]) :
+{| class="wikitable"
+! || VOCABULARY RANGE
+|-
+| C2 || Has a good command of a very broad lexical repertoire including idiomatic expressions and
+colloquialisms; shows awareness of connotative levels of meaning.
+|-
+| C1 || Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with
+circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of
+idiomatic expressions and colloquialisms.
+|-
+| B2 || Has a good range of vocabulary for matters connected to his/her field and most general topics. Can
+vary formulation to avoid frequent repetition, but lexical gaps can still cause hesitation and
+circumlocution.
+|-
+| B1 || Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to
+his/her everyday life such as family, hobbies and interests, work, travel, and current events.
+Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and
+topics.
+|-
+| A2 || Has a sufficient vocabulary for the expression of basic communicative needs.<br>
+Has a sufficient vocabulary for coping with simple survival needs.
+|-
+| A1 || Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete
+situations.
+|}
-[[File:https://i.stack.imgur.com/1fLE2.png]]
+{| class="wikitable"
+!  || VOCABULARY CONTROL
+|-
+| C2 || Consistently correct and appropriate use of vocabulary.
+|-
+| C1 || Occasional minor slips, but no significant vocabulary errors.
+|-
+| B2 || Lexical accuracy is generally high, though some confusion and incorrect word choice does occur without
+hindering communication.
+|-
+| B1 || Shows good control of elementary vocabulary but major errors still occur when expressing more complex
+thoughts or handling unfamiliar topics and situations.
+|-
+| A2 || Can control a narrow repertoire dealing with concrete everyday needs.
+|-
+| A1 || No descriptor available
+|}
+{| class="wikitable"
+!
+|-
+| Users of the Framework may wish to consider and where appropriate state:
+• which lexical elements (fixed expressions and single word forms) the learner will need/be
+equipped/be required to recognise and/or use;
+• how they are selected and ordered
+|}
 == Getting my corpus ==

Help