Why wordlists matter?

Some context

Priorities : With limited recording capabilities, it is better to use frequency lists to record the most frequent words first. With unlimited recording abilities, the order doesn’t matter much since we we assume that all the target words will eventually be recorded. Frequency lists have high correlation between languages[1].

Corpus’purpose : As for language’s learning, written transcripts of spoken language such as films’ subtitles are known to be better materials (see SUBTLEX studies, 2007). Other corpuses will also allows you to do a good work to provide audio recording. For lexicographic purposes as Wiktionary, rare words are as interesting as frequent words, and the aim is to provide all items with their audio.

Consistency : It is best to provide consistent audio data, with same neutral or enhousiastic tone and same speaker.

Lexicon range for learners : For language learners and assuming learning via the most frequent words, a minimum vocabulary of 2000-2500 base-words is required to move the learner to autonomous level. Language teaching academics name this level the “threshold level”. The CEFR (Common European Framework of Reference for Languages: Learning, Teaching, Assessment)[2], Chinese’s HSK levels and their pairing with CEFR levels, and some academic researches[3] lead to the following relation between lexicon size, CEFR level and competence :

Lexicon's size

Lexicon(*) Levels CEFR’s descriptors
600 A1 “Basic user. Breakthrough or beginner”. Survival communication, expressing basic needs.
1,200 A2 “Basic user. Waystage or elementary”
2,500 B1 “Independant user. Threshold or intermediate”.
5,000 B2 “Independant user. Vantage or upper intermediate”
20,000+ C2 “Mastery or proficiency”. Native after graduation from highschool.

(*) : Assuming the most frequent word-families learnt first.

See also CEFR (image), with the most relevant section cited below :

Vocabulary range

C2 Has a good command of a very broad lexical repertoire including idiomatic expressions and

colloquialisms; shows awareness of connotative levels of meaning.

C1 Has a good command of a broad lexical repertoire allowing gaps to be readily overcome with

circumlocutions; little obvious searching for expressions or avoidance strategies. Good command of idiomatic expressions and colloquialisms.

B2 Has a good range of vocabulary for matters connected to his/her field and most general topics. Can

vary formulation to avoid frequent repetition, but lexical gaps can still cause hesitation and circumlocution.

B1 Has a sufficient vocabulary to express him/herself with some circumlocutions on most topics pertinent to

his/her everyday life such as family, hobbies and interests, work, travel, and current events. Has sufficient vocabulary to conduct routine, everyday transactions involving familiar situations and topics.

A2 Has a sufficient vocabulary for the expression of basic communicative needs.

Has a sufficient vocabulary for coping with simple survival needs.

A1 Has a basic vocabulary repertoire of isolated words and phrases related to particular concrete


Vocabulary control

C2 Consistently correct and appropriate use of vocabulary.
C1 Occasional minor slips, but no significant vocabulary errors.
B2 Lexical accuracy is generally high, though some confusion and incorrect word choice does occur without

hindering communication.

B1 Shows good control of elementary vocabulary but major errors still occur when expressing more complex

thoughts or handling unfamiliar topics and situations.

A2 Can control a narrow repertoire dealing with concrete everyday needs.
A1 No descriptor available
Users of the Framework may wish to consider and where appropriate state:

• which lexical elements (fixed expressions and single word forms) the learner will need/be equipped/be required to recognise and/or use;
• how they are selected and ordered


  1. Paul Nation and David Crabbe (1991), "A SURVIVAL LANGUAGE LEARNING SYLLABUS FOR FOREIGN TRAVEL" Victoria University of Wellington, New Zealand Published in System Vol 19, No 3, 1991, pp 191-201.
  2. "Common European Framework of Reference for Languages: Learning, Teaching, Assessment" (2001), (pdf
  3. Marc Brysbaert*, Michaël Stevens, Paweł Mandera and Emmanuel Keuleers (2016), How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age.

See also

Lingua Libre Help pages
General help pages Help:InterfaceHelp:Your first recordHelp:Choosing a microphoneHelp:Configure your microphoneHelp:TranslateHelp:LangtagsLinguaLibre:Language codes systems used across LinguaLibreLinguaLibre:List of languages
Linguistic help pages Help:Add a new languageHelp:HomographsHelp:List translationHelp:Ethics
Lists help pages Help:Create your own listsHelp:How to create a frequency list?Help:Why wordlists matter?Help:Swadesh listsHelp:ListsHelp:Create a new generator
Events, Outreach Lingualibre:EventsLingualibre:RolesLingualibre:WorkshopsLingualibre:HackathonLingualibre:Interested communitiesLingualibre:Events/2022 Public Relations CampaignLingualibre:MailingLingualibre:JargonLingualibre:AppsLingualibre:CitationsService civique 2022-2023
Strategy Lingualibre 2022 Review (including outreach)2022-2023 Lingualibre wishlist • {{Wikimedia Language Diversity/Projects}} • Speakers map • Voices gender • StatsLingua Libre SignIt/2022 report • {{Grants}}