User

Olafbot

Revision as of 09:29, 13 September 2021 by Olaf (talk | contribs) (exclusion lists)

Wiktionary Bots.png

The bot, created by Olaf, continuously updates various lists of missing audio recordings. Much more active in Polish Wiktionary.

Lists named "Lemmas-without-audio-sorted-by-number-of-wiktionaries" are created in the following way:

  • For a given language, the bot traverses categories on all wiktionaries and a few open dictionaries and collects statistics - for each lemma it counts dictionaries that describe this word in this language. This is something the bot has been doing for 11 years, generating different lists for Polish Wiktionary.
  • Titles written in wrong alphabets are removed.
  • Titles containing uppercase letters are removed, except German, because of a bug in Lingua Libre, which makes recording uppercase lemmas problematic.
  • Lemmas with audio recording in Commons are also removed from this set. Not only files created with LiLi are removed, but also other recordings found in the "pronunciation" category for a given language or in its subcategories.
  • For a few languages, minor corrections are done, in order to extract the set of dictionary lemmas, if possible without inflected forms.
  • Items from a corresponding exclusion list (see below) are removed.
  • The resulting list is sorted descending by the number of dictionaries and limited to 380 entries.
  • The recorded words are removed from the lists every three hours.

Lists maintained (72) : afr, ang, ara, ast, aze, bel, ben, bul, cat, ceb, ces, cmn, csb, cym, dan, deu, ekk, eng, epo, est, eus, fao, fas, fin, fra, gla, gle, glg, grc, gre, guj, hau, heb, hin, hrv, hun, hye, ido, ina, ind, isl, ita, jav, jpn, kan, kat, kaz, khm, kor, kur, lat, lit, ltz, lvs, mal, mar, mkd, mlg, mlt, mon, msa, nld, nor, oci, pan, pnb, pol, por, ron, rus, san, slk, slv, spa, sqi, swa, swe, tam, tel, tgl, tha, tur, ukr, urd, vie, wuu, yid, yue.

Sometimes a list may contain an error. There is no point in removing it manually from the list, because the bot is going to add it again in the next pass. Instead, you can put the erroneous word on the exclusion list. Such a list is maintained separately for each language. Items from the exclusion lists are removed automatically from a corresponding "Lemmas" list.

The exclusion lists: afr, ang, ara, ast, aze, bel, ben, bul, cat, ceb, ces, cmn, csb, cym, dan, deu, ekk, eng, epo, est, eus, fao, fas, fin, fra, gla, gle, glg, grc, gre, guj, hau, heb, hin, hrv, hun, hye, ido, ina, ind, isl, ita, jav, jpn, kan, kat, kaz, khm, kor, kur, lat, lit, ltz, lvs, mal, mar, mkd, mlg, mlt, mon, msa, nld, nor, oci, pan, pnb, pol, por, ron, rus, san, slk, slv, spa, sqi, swa, swe, tam, tel, tgl, tha, tur, ukr, urd, vie, wuu, yid, yue