Special Export translations SettingsGroupCategory:Bureaucrats of Lingua LibreCategory:Lingua Libre:HelpHelp:Add a new languageHelp:BotsHelp:Choosing a microphoneHelp:Configure your microphoneHelp:Create a new generatorHelp:Create your own listsHelp:Download datasetsHelp:Download datasets/Header/textHelp:InterfaceHelp:LangtagsHelp:MainHelp:RecordWizard manualHelp:TranslateHelp:Your first recordLinguaLibre:AboutLinguaLibre:AdministratorsLinguaLibre:Administrators' noticeboard/Header/textLinguaLibre:Babel/textLinguaLibre:BureaucratsLinguaLibre:Bureaucrats/Header/textLinguaLibre:ChangelogLinguaLibre:Chat roomLinguaLibre:Chat room/FAQ/textLinguaLibre:Chat room/Header/textLinguaLibre:CommunityLinguaLibre:CopyrightsLinguaLibre:Events/2022 Lingualibre-Surui workshopLinguaLibre:Events/2022.06 Lingualibre online meetupLinguaLibre:Events/Patrol assistance tool prototyping projectLinguaLibre:Events/Patrol assistance tool prototyping project/Intro/textLinguaLibre:Explore the sound libraryLinguaLibre:List of languagesLinguaLibre:Main Page/textLinguaLibre:Misleading itemsLinguaLibre:Privacy policyLinguaLibre:RecordsLinguaLibre:SpeakersLinguaLibre:StatsLinguaLibre:Stats/LanguagesLinguaLibre:Stats/Menu/textLinguaLibre:Stats/SpeakersLinguaLibre:Stats/TimeLinguaLibre:Technical board/Header/textLinguaLibre:Translation administratorsLinguaLibre:User rightsTemplate:Autopatrolled/textTemplate:Bot steps/DocumentationTemplate:Bot steps/textTemplate:DataViz/textTemplate:Helps/textTemplate:Projects/textTemplate:User administrator/textTemplate:User bureaucrat/textTemplate:Welcome/textUser:WikiLucas00/Sitenotice/textLanguageaa - Afarab - Abkhazianabs - Ambonese Malayace - Achineseady - Adygheady-cyrl - Adyghe (Cyrillic script)aeb - Tunisian Arabicaeb-arab - Tunisian Arabic (Arabic script)aeb-latn - Tunisian Arabic (Latin script)af - Afrikaansak - Akanaln - Gheg Albanianalt - Southern Altaiam - Amharicami - Amisan - Aragoneseang - Old Englishanp - Angikaar - Arabicarc - Aramaicarn - Mapuchearq - Algerian Arabicary - Moroccan Arabicarz - Egyptian Arabicas - Assamesease - American Sign Languageast - Asturianatj - Atikamekwav - Avaricavk - Kotavaawa - Awadhiay - Aymaraaz - Azerbaijaniazb - South Azerbaijaniba - Bashkirban - Balinesebar - Bavarianbbc - Batak Tobabbc-latn - Batak Toba (Latin script)bcc - Southern Balochibcl - Central Bikolbe - Belarusianbe-tarask - Belarusian (Taraškievica orthography)bg - Bulgarianbgn - Western Balochibh - Bhojpuribho - Bhojpuribi - Bislamabjn - Banjarbm - Bambarabn - Banglabo - Tibetanbpy - Bishnupriyabqi - Bakhtiaribr - Bretonbrh - Brahuibs - Bosnianbtm - Batak Mandailingbto - Iriga Bicolanobug - Buginesebxr - Russia Buriatca - Catalancbk-zam - Chavacanocdo - Min Dong Chinesece - Chechenceb - Cebuanoch - Chamorrocho - Choctawchr - Cherokeechy - Cheyenneckb - Central Kurdishco - Corsicancps - Capiznoncr - Creecrh - Crimean Turkishcrh-cyrl - Crimean Tatar (Cyrillic script)crh-latn - Crimean Tatar (Latin script)cs - Czechcsb - Kashubiancu - Church Slaviccv - Chuvashcy - Welshda - Danishde - Germande-at - Austrian Germande-ch - Swiss High Germande-formal - German (formal address)din - Dinkadiq - Zazakidsb - Lower Sorbiandtp - Central Dusundty - Dotelidv - Divehidz - Dzongkhaee - Eweegl - Emilianel - Greekeml - Emiliano-Romagnoloen - Englishen-ca - Canadian Englishen-gb - British Englisheo - Esperantoes - Spanishes-419 - Latin American Spanishes-formal - español (formal)et - Estonianeu - Basqueext - Extremaduranfa - Persianff - Fulahfi - Finnishfit - Tornedalen Finnishfj - Fijianfo - Faroesefr - Frenchfrc - Cajun Frenchfrp - Arpitanfrr - Northern Frisianfur - Friulianfy - Western Frisianga - Irishgag - Gagauzgan - Gan Chinesegan-hans - Gan (Simplified)gan-hant - Gan (Traditional)gcr - Guianan Creolegd - Scottish Gaelicgl - Galicianglk - Gilakign - Guaranigom - Goan Konkanigom-deva - Goan Konkani (Devanagari script)gom-latn - Goan Konkani (Latin script)gor - Gorontalogot - Gothicgrc - Ancient Greekgsw - Swiss Germangu - Gujaratigv - Manxha - Hausahak - Hakka Chinesehaw - Hawaiianhe - Hebrewhi - Hindihif - Fiji Hindihif-latn - Fiji Hindi (Latin script)hil - Hiligaynonho - Hiri Motuhr - Croatianhrx - Hunsrikhsb - Upper Sorbianht - Haitian Creolehu - Hungarianhu-formal - magyar (formal)hy - Armenianhyw - Western Armenianhz - Hereroia - Interlinguaid - Indonesianie - Interlingueig - Igboii - Sichuan Yiik - Inupiaqike-cans - Eastern Canadian (Aboriginal syllabics)ike-latn - Eastern Canadian (Latin script)ilo - Ilokoinh - Ingushio - Idois - Icelandicit - Italianiu - Inuktitutja - Japanesejam - Jamaican Creole Englishjbo - Lojbanjut - Jutishjv - Javaneseka - Georgiankaa - Kara-Kalpakkab - Kabylekbd - Kabardiankbd-cyrl - Kabardian (Cyrillic script)kbp - Kabiyekea - Kabuverdianukg - Kongokhw - Khowarki - Kikuyukiu - Kirmanjkikj - Kuanyamakjp - Eastern Pwokk - Kazakhkk-arab - Kazakh (Arabic script)kk-cn - Kazakh (China)kk-cyrl - Kazakh (Cyrillic script)kk-kz - Kazakh (Kazakhstan)kk-latn - Kazakh (Latin script)kk-tr - Kazakh (Turkey)kl - Kalaallisutkm - Khmerkn - Kannadako - Koreanko-kp - Korean (North Korea)koi - Komi-Permyakkr - Kanurikrc - Karachay-Balkarkri - Kriokrj - Kinaray-akrl - Karelianks - Kashmiriks-arab - Kashmiri (Arabic script)ks-deva - Kashmiri (Devanagari script)ksh - Colognianku - Kurdishku-arab - Kurdish (Arabic script)ku-latn - Kurdish (Latin script)kum - Kumykkv - Komikw - Cornishky - Kyrgyzla - Latinlad - Ladinolb - Luxembourgishlbe - Laklez - Lezghianlfn - Lingua Franca Novalg - Gandali - Limburgishlij - Ligurianliv - Livonianlki - Lakilld - Ladinlmo - Lombardln - Lingalalo - Laoloz - Lozilrc - Northern Lurilt - Lithuanianltg - Latgalianlus - Mizoluz - Southern Lurilv - Latvianlzh - Literary Chineselzz - Lazmai - Maithilimap-bms - Basa Banyumasanmdf - Mokshamg - Malagasymh - Marshallesemhr - Eastern Marimi - Maorimin - Minangkabaumk - Macedonianml - Malayalammn - Mongolianmni - Manipurimnw - Monmo - Moldovanmr - Marathimrj - Western Marims - Malaymt - Maltesemus - Muscogeemwl - Mirandesemy - Burmesemyv - Erzyamzn - Mazanderanina - Naurunah - Nāhuatlnan - Min Nan Chinesenap - Neapolitannb - Norwegian Bokmålnds - Low Germannds-nl - Low Saxonne - Nepalinew - Newaring - Ndonganiu - Niueannl - Dutchnl-informal - Nederlands (informeel)nn - Norwegian Nynorskno - Norwegiannod - Northern Thainov - Novialnqo - N’Konrm - Normannso - Northern Sothonv - Navajony - Nyanjanys - Nyungaroc - Occitanolo - Livvi-Karelianom - Oromoor - Odiaos - Osseticota - Ottoman Turkishpa - Punjabipag - Pangasinanpam - Pampangapap - Papiamentopcd - Picardpdc - Pennsylvania Germanpdt - Plautdietschpfl - Palatine Germanpi - Palipih - Norfuk / Pitkernpl - Polishpms - Piedmontesepnb - Western Punjabipnt - Ponticprg - Prussianps - Pashtopt - Portuguesept-br - Brazilian Portugueseqqq - Message documentationqu - Quechuaqug - Chimborazo Highland Quichuargn - Romagnolrif - Riffianrm - Romanshrmy - Vlax Romanirn - Rundiro - Romanianroa-tara - Tarantinoru - Russianrue - Rusynrup - Aromanianruq - Megleno-Romanianruq-cyrl - Megleno-Romanian (Cyrillic script)ruq-latn - Megleno-Romanian (Latin script)rw - Kinyarwandarwr - Marwari (India)sa - Sanskritsah - Sakhasat - Santalisc - Sardinianscn - Siciliansco - Scotssd - Sindhisdc - Sassarese Sardiniansdh - Southern Kurdishse - Northern Samisei - Serises - Koyraboro Sennisg - Sangosgs - Samogitiansh - Serbo-Croatianshi - Tachelhitshi-latn - Tachelhit (Latin script)shi-tfng - Tachelhit (Tifinagh script)shn - Shanshy-latn - Shawiya (Latin script)si - Sinhalasimple - Simple Englishsje - Pite Samisk - Slovakskr - Saraikiskr-arab - Saraiki (Arabic script)sl - Sloveniansli - Lower Silesiansm - Samoansma - Southern Samismj - Lule Samismn - Inari Samisn - Shonaso - Somalisq - Albaniansr - Serbiansr-ec - Serbian (Cyrillic script)sr-el - Serbian (Latin script)srn - Sranan Tongosrq - Sirionóss - Swatist - Southern Sothostq - Saterland Frisiansty - себертатарsu - Sundanesesv - Swedishsw - Swahiliszl - Silesianszy - Sakizayata - Tamiltay - Tayaltcy - Tulute - Telugutet - Tetumtg - Tajiktg-cyrl - Tajik (Cyrillic script)tg-latn - Tajik (Latin script)th - Thaiti - Tigrinyatk - Turkmentl - Tagalogtly - Talyshtn - Tswanato - Tongantokipona - Toki Ponatpi - Tok Pisintr - Turkishtru - Turoyotrv - Tarokots - Tsongatt - Tatartt-cyrl - Tatar (Cyrillic script)tt-latn - Tatar (Latin script)tum - Tumbukatw - Twity - Tahitiantyv - Tuviniantzm - Central Atlas Tamazightudm - Udmurtug - Uyghurug-arab - Uyghur (Arabic script)ug-latn - Uyghur (Latin script)uk - Ukrainianur - Urduuz - Uzbekuz-cyrl - Uzbek (Cyrillic script)uz-latn - Uzbek (Latin script)ve - Vendavec - Venetianvep - Vepsvi - Vietnamesevls - West Flemishvmf - Main-Franconianvo - Volapükvot - Voticvro - Võrowa - Walloonwar - Waraywo - Wolofwuu - Wu Chinesexal - Kalmykxh - Xhosaxmf - Mingrelianxsy - Saisiyatyi - Yiddishyo - Yorubayue - Cantoneseza - Zhuangzea - Zeelandiczgh - Standard Moroccan Tamazightzh - Chinesezh-cn - Chinese (China)zh-hans - Simplified Chinesezh-hant - Traditional Chinesezh-hk - Chinese (Hong Kong)zh-mo - Chinese (Macau)zh-my - Chinese (Malaysia)zh-sg - Chinese (Singapore)zh-tw - Chinese (Taiwan)zu - ZuluFormatExport for off-line translationExport in native format Fetch {{#Subtitle:{{Help:Download_datasets/Header}}}} <languages/> {| class="wikitable right" style="float:right;" ! colspan=2| Data size — 2022/02 |- | Audios files || 800,000+ |- | Average size || 100kB |- | Total size (est.) || 80GB <!-- |- | Safety factor || 5~10x |- ! Required disk space || 400~800GB --> |} == Download datasets via click == '''Download by language:''' <br> # Open https://lingualibre.org/datasets/ # Find your language, naming convention is: <code>{qId}-{iso639-3}-{language_English_name}.zip</code> # '''Click to download''' # On your device, unzip. '''Post-processing''' <br>Refer to the relevant tutorials in [[#See also]] to mass rename, mass convert or mass denoise your downloaded audios. == Programmatic tools == The tools below first fetch from one or several Wikimedia Commons categories the list of audio files within them. Some of them allow to filter that list further to focus a single speaker, either by editing their code or by post-processing of the resulting .csv list of audio files. The listed targets are then downloaded at a speed of 500 to 15,000 per hours. Items already present locally and matching the latest Commons version are generally not re-downloaded. === Find your target === Categories on Wikimedia Commons are organized as follow: * [[:Commons:Category:Lingua Libre pronunciation by user]] * [[:Commons:Category:Lingua Libre pronunciation]] (by language) === Python (current)=== Dependencies: Python 3.6+ '''Petscan''' and '''Wikiget''' allows to download about 15,000 audio files per hour. # '''Select your category :''' see [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] and [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], then find your target category, # '''List target files with [https://petscan.wmflabs.org Petscan] :''' Given a target category on Commons, provides list of target files. [https://petscan.wmflabs.org/?&cb_labels_yes_l=1&cb_labels_no_l=1&edits%5Banons%5D=both&interface_language=en&edits%5Bflagged%5D=both&categories=Lingua%20Libre%20pronunciation-cmn&cb_labels_any_l=1&ns%5B0%5D=1&project=wikimedia&since_rev0=&search_max_results=500&edits%5Bbots%5D=both&ns%5B6%5D=1&language=commons&search_query= Example]. # '''Download target files with [https://pypi.org/project/wikiget/ Wikiget] :''' downloads targets files. Comments: * Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec. * Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request [https://github.com/clpo13/wikiget/issues/2 on github]. * WikiGet therefore requires a volunteer to supervise the script while running. * As of December 2021, WikiGet does not support multi-thread downloads. Therefore, to increase the efficiency of the download process it is recommended to run the Python Script on 20-30 terminal windows simultaneously. Each terminal running WikiGet would consume an average of 20 Kb/s. * WikiGet requires an stable internet connection. Any disruption of 1 second would stop the download process and it requires manual restart of the Python Script. * [[m:Special:MyLanguage/PetScan|Manual for PetScan]] * Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj === NodeJS (soon) === Dependencies: git, nodejs, npm. A '''WikiapiJS''' script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour. # WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre. # Specific script used to do a given task: #* Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js #* Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js Comments, as of December 2021: * Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec. * Successfully process single category's files. * Successfully process root category and subcategories' files, generating ./isocode/ folders. * Scalability tests for resilience with high amounts requests >500 to 100,000 items is required. * Performance improvements are under consideration [https://github.com/kanasimi/wikiapi/issues/51#issuecomment-1002267855 on github]. === Python (slow) === Dependencies: python. '''CommonsDownloadTool.py''' is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips : * Download scripts : ** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh] - creates CommonsDownloadTool's commands. ** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py] - core script. * Read them a bit, move them where they fit the best on you computer so they require the minimum of editing * Edit as needed so the paths are correct, make it work. * Run <code>create_datasets.sh</code> * Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] Comments: * Last ran on February 2021, stopped due to slow speed. * This script is slow and has been phased out as Lingualibre grown too much. * The page may gain from some html and styling. * Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the [[LinguaLibre:Chat room]]. === Python with UI (Sulochanaviji) === :''Description to complete, see its [https://github.com/sulochanaviji/Wiki-bulk-downloader github repository].'' [[:meta:User:Sulochanaviji|User:Sulochanaviji]] coded a Django/Python tool with a HTML/CSS user interface. See its [https://github.com/sulochanaviji/Wiki-bulk-downloader github repository]. === Python Script to Download a User's Pronunciations === This script downloads all the pronunciations added by a user into a folder by first querying the Lingua Libre database and then downloading the files from Commons. See its [https://github.com/rkosov/Lingua-Libre-User-Audio-Downloader github repository]. [[User:Languageseeker|Languageseeker]] ([[User talk:Languageseeker|talk]]) 01:57, 24 May 2022 (UTC) === Anki Extension for Lingua Libre === The [https://ankiweb.net/shared/info/124265771 Lingua Libre and Forvo Addon]. It has a number of advanced options to improve search results and can run either as a batch operation or on an individual note. By default, it first checks Lingua Libre and, if there are no results on Lingua Libre, it then checks Forvo. To run as a pure Lingua Libre extension, you will need to set ''"disable_Forvo" to <code>True</code> in your configuration section. Please reports bugs, issues, ideas on [https://github.com/rkosov/Lingua-Libre-and-Forvo-Audio-Downloader github]. === Java (not tested) === Dependencies: <syntaxhighlight lang="bash"> sudo apt-get install default-jre # install Java environment </syntaxhighlight> Usage: * Open [https://github.com/MarcoFalke/wiki-java-tools/releases GitHub Wiki-java-tools project page]. * Find the last <code>Imker</code> release. * Download Imker_vxx.xx.xx'''.zip''' archive * Extract the .zip file * Run as follow : ** On Windows : start the .exe file. ** On Ubuntu, open shell then : <syntaxhighlight lang="bash"> $java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName' # Downloads all medias within Wikimedia Commons's category "CategoryName" </syntaxhighlight> Comments : * Not used yet by any LinguaLibre member. If you do, please share your experience of this tool. ==== Manual ==== <syntaxhighlight lang="bash"> Imker -- Wikimedia Commons batch downloading tool. Usage: java -jar imker-cli.jar [options] Options: --category, -c Use the specified Wiki category as download source. --domain, -d Wiki domain to fetch from Default: commons.wikimedia.org --file, -f Use the specified local file as download source. * --outfolder, -o The output folder. --page, -p Use the specified Wiki page as download source. The download source must be ONE of the following: ↳ A Wiki category (Example: --category="Denver, Colorado") ↳ A Wiki page (Example: --page="Sandboarding") ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!) </syntaxhighlight> == See also == * [[Special:MyLanguage/Help:Renaming|Help:Renaming]] * [[Special:MyLanguage/Help:Converting audios|Help:Converting audios]] * [[:phab:T261519|Help:Embed audio in HTML]] * [[:phab:T261519]] == See also == {{Helps}} {{Technicals}} [[Category:Lingua Libre:Help]]