Difference between revisions of "Download datasets"

Latest revision as of 07:50, 24 December 2023

Other languages:

Deutsch • ‎English • ‎norsk bokmål • ‎occitan • ‎polski • ‎português do Brasil • ‎svenska • ‎македонски • ‎বাংলা

Data size — 2022/02
Audios files	1,000,000+
Average size	100kB
Total size (est.)	100GB

Download datasets via click

Download by language:

On lingualibre.org top bar, click "Datasets"
Search your language by Native or English name > Click : « Download »
On your device, unzip.

Post-processing
Refer to the relevant tutorials in #See also to mass rename, mass convert or mass denoise your downloaded audios.

Programmatic tools

The tools below first fetch from one or several Wikimedia Commons categories the list of audio files within them. Some of them allow to filter that list further to focus a single speaker, either by editing their code or by post-processing of the resulting .csv list of audio files. The listed targets are then downloaded at a speed of 500 to 15,000 per hours. Items already present locally and matching the latest Commons version are generally not re-downloaded.

Find your target

Categories on Wikimedia Commons are organized as follow:

Python (current)

Dependencies: Python 3.6+

Petscan and Wikiget allows to download about 15,000 audio files per hour.

Select your category : see Category:Lingua Libre pronunciation and Category:Lingua Libre pronunciation by user, then find your target category,
List target files with Petscan : Given a target category on Commons, provides list of target files. Example.
Download target files with Wikiget : downloads targets files.

Comments:

Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request on github.
WikiGet therefore requires a volunteer to supervise the script while running.
As of December 2021, WikiGet does not support multi-thread downloads. Therefore, to increase the efficiency of the download process it is recommended to run the Python Script on 20-30 terminal windows simultaneously. Each terminal running WikiGet would consume an average of 20 Kb/s.
WikiGet requires an stable internet connection. Any disruption of 1 second would stop the download process and it requires manual restart of the Python Script.
Manual for PetScan
Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj

NodeJS

Dependencies: git, nodejs, npm.

A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.

WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
Manual for .download()

Comments, as of December 2021:

Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
Successfully process single category's files.
Successfully process root category and subcategories' files, generating ./isocode/ folders.
Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.

Python (slow)

Dependencies: python.

CommonsDownloadTool.py is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips :

Download scripts :
- create_datasets.sh - creates CommonsDownloadTool's commands.
- CommonsDownloadTool/commons_download_tool.py - core script.
Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
Edit as needed so the paths are correct, make it work.
Run create_datasets.sh
Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation

Comments:

Last ran on February 2021, stopped due to slow speed.
This script is slow and has been phased out as Lingualibre grown too much.
The page may gain from some html and styling.
Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the LinguaLibre:Chat room.

Python with UI (Sulochanaviji)

Description to complete, see its github repository.

User:Sulochanaviji coded a Django/Python tool with a HTML/CSS user interface. See its github repository.

Python Script to Download a User's Pronunciations

This script downloads all the pronunciations added by a user into a folder by first querying the Lingua Libre database and then downloading the files from Commons. See its github repository. Languageseeker (talk) 01:57, 24 May 2022 (UTC)

Anki Extension for Lingua Libre

The Lingua Libre and Forvo Addon. It has a number of advanced options to improve search results and can run either as a batch operation or on an individual note.

By default, it first checks Lingua Libre and, if there are no results on Lingua Libre, it then checks Forvo. To run as a pure Lingua Libre extension, you will need to set "disable_Forvo" to True in your configuration section.

Please reports bugs, issues, ideas on github.

Java (not tested)

Dependencies:

sudo apt-get install default-jre    # install Java environment

Usage:

Open GitHub Wiki-java-tools project page.
Find the last Imker release.
Download Imker_vxx.xx.xx.zip archive
Extract the .zip file
Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :

$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"

Comments :

Not used yet by any LinguaLibre member. If you do, please share your experience of this tool.

Manual

Imker -- Wikimedia Commons batch downloading tool.

Usage: java -jar imker-cli.jar [options]
  Options:
    --category, -c
       Use the specified Wiki category as download source.
    --domain, -d
       Wiki domain to fetch from
       Default: commons.wikimedia.org
    --file, -f
       Use the specified local file as download source.
  * --outfolder, -o
       The output folder.
    --page, -p
       Use the specified Wiki page as download source.

The download source must be ONE of the following:
 ↳ A Wiki category (Example: --category=&quot;Denver, Colorado&quot;)
 ↳ A Wiki page (Example: --page=&quot;Sandboarding&quot;)
 ↳ A local file (Example: --file=&quot;Documents/files.txt&quot;; One filename per line!)

Lingua Libre Help pages
General help pages	Help:Interface • Help:Your first record • Help:Choosing a microphone • Help:Configure your microphone • Help:Translate • Help:Langtags • LinguaLibre:Language codes systems used across LinguaLibre • LinguaLibre:List of languages
Linguistic help pages	Help:Add a new language • Help:Homographs • Help:List translation • Help:Ethics
Lists help pages	Help:Create your own lists • Help:How to create a frequency list? • Help:Why wordlists matter? • Help:Swadesh lists • Help:Lists • Help:Create a new generator
Events, Outreach	Lingualibre:Events • Lingualibre:Roles • Lingualibre:Workshops • Lingualibre:Hackathon • Lingualibre:Interested communities • Lingualibre:Events/2022 Public Relations Campaign • Lingualibre:Mailing • Lingualibre:Jargon • Lingualibre:Apps • Lingualibre:Citations • Service civique 2022-2023
Strategy	Lingualibre 2022 Review (including outreach) • 2022-2023 Lingualibre wishlist • {{Wikimedia Language Diversity/Projects}} • Speakers map • Voices gender • Stats • Lingua Libre SignIt/2022 report • {{Grants}}

**Lingua Libre technical helps**
Template	{{Speakers category}} • {{Recommended lists}} • {{To iso 639-2}} • {{To iso 639-3}} • {{Userbox-records}} • {{Bot steps}}
Audio files	How to create a frequency list? • Convert files formats • Denoise files with SoX • Rename and mass rename
Bots	Help:Bots • LinguaLibre:Bot • Help:Log in to Lingua Libre with Pywikibot • Lingua Libre Bot (gh) • Olafbot • PamputtBot • Dragons Bot (gh)
MediaWiki	MediaWiki: Help:Documentation opérationelle Mediawiki • Help:Database structure • Help:CSS • Help:Rename • Help:OAuth • LinguaLibre:User rights (rate limit) • Module:Lingua Libre record & {{Lingua Libre record}} • JS scripts: MediaWiki:Common.js • LastAudios.js • SoundLibrary.js • ItemsSugar.js • LexemeQueriesGenerator.js (pad) • Sparql2data.js (pad) • LanguagesGallery.js (pad) • Gadgets: Gadget-LinguaImporter.js • Gadget-Demo.js • Gadget-RecentNonAudio.js • LiLiZip.js
Queries	Help:APIs • Help:SPARQL • SPARQL (intermediate) (stub) • SPARQL for lexemes (stub) • SPARQL for maintenance • Lingualibre:Wikidata (stub) • Help:SPARQL (HAL)
Reuses	Help:Download datasets • Help:Embed audio in HTML
Unstable & tests	Help:SPARQL/test
Categories	Category:Technical reports

@@ Line 1: / Line 1: @@
-{{#SUBTITLE:This page deals with downloading Lingualibre.org's medias, both by hand and programatically, as packaged zip archives with rich filenames. We then have tutorials on how to clean up the resulting folders and how to rename these files into more practical ''{language}−{word}.ogg''. Be aware of Lingualibre's data's size could be in 100s GB if you download it all.}}
+{{#Subtitle:{{Help:Download_datasets/Header}}}}
+<languages/>
 {| class="wikitable right" style="float:right;"
-! colspan=2| Data size — 2021/02
+! colspan=2| <translate><!--T:1--> Data size — 2022/02</translate>
 |-
-| Audios files || 800,000+
+| <translate><!--T:2--> Audios files</translate> || 1,000,000+
 |-
-| Average size || 100kB
+| <translate><!--T:3--> Average size</translate> || 100kB
 |-
-| Total size (est.) || 80GB <！--
+| <translate><!--T:4--> Total size (est.)</translate> || 100GB <!--
 |-
-| Safety factor || 5~10x
+| <translate><!--T:5--> Safety factor</translate> || 5~10x
 |-
-! Required disk space || 400~800GB -->
+! <translate><!--T:6--> Required disk space</translate> || 500~1,000GB -->
 |}
-== Context ==
-=== Data clean up ===
+<translate>
-:See also [[Help:Convert_audios%3F|Convert files formats]] • [[Help:SoX|Denoise files]] • [[Help:Renaming|Rename and mass rename]]
+== Download datasets via click == <!--T:7-->
-By default, we provide both per-language and all-lingualibre zip archives, which therefor double the data size of your download it all.
-=== Find your target category ===
+<!--T:8-->
+'''Download by language:'''</translate>
+<br>
+<translate>
+<div style="border: #FFA500 1px solid; background: #FFDD0055;">
+<!--T:9-->
+# On lingualibre.org top bar, click "[https://lingualibre.org/LanguagesGallery/ Datasets]"
+# Search your language by Native or English name > Click : « Download »
+# On your device, unzip.
+</div>
+<!--T:10-->
+'''Post-processing'''</translate>
+<br><translate><!--T:11-->
+Refer to the relevant tutorials in [[#See also]] to mass rename, mass convert or mass denoise your downloaded audios.
+== Programmatic tools == <!--T:12-->
+<!--T:13-->
+The tools below first fetch from one or several Wikimedia Commons categories the list of audio files within them.
+Some of them allow to filter that list further to focus a single speaker, either by editing their code or by post-processing of the resulting .csv list of audio files. The listed targets are then downloaded at a speed of 500 to 15,000 per hours. Items already present locally and matching the latest Commons version are generally not re-downloaded.
+=== Find your target === <!--T:14-->
+<!--T:15-->
+Categories on Wikimedia Commons are organized as follow:
 * [[:Commons:Category:Lingua Libre pronunciation by user]]
-* [[:Commons:Category:Lingua Libre pronunciation]] by language
+* [[:Commons:Category:Lingua Libre pronunciation]] (by language)
-== Hand downloading ==
+=== Python (current)=== <!--T:16-->
-# Open https://lingualibre.org/datasets/
-# Download your target language's zip
+<!--T:17-->
-# On your device, unzip.
+Dependencies: Python 3.6+
-Go to the relevant tutorials to clean up or rename your data.
+<!--T:18-->
+'''Petscan''' and '''Wikiget''' allows to download about 15,000 audio files per hour.
+# '''Select your category :''' see [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] and [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], then find your target category,
+# '''List target files with [https://petscan.wmflabs.org Petscan] :''' Given a target category on Commons, provides list of target files. [https://petscan.wmflabs.org/?&cb_labels_yes_l=1&cb_labels_no_l=1&edits%5Banons%5D=both&interface_language=en&edits%5Bflagged%5D=both&categories=Lingua%20Libre%20pronunciation-cmn&cb_labels_any_l=1&ns%5B0%5D=1&project=wikimedia&since_rev0=&search_max_results=500&edits%5Bbots%5D=both&ns%5B6%5D=1&language=commons&search_query= Example].
+# '''Download target files with [https://pypi.org/project/wikiget/ Wikiget] :''' downloads targets files.
+<!--T:19-->
+Comments:
+* Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
+* Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request [https://github.com/clpo13/wikiget/issues/2 on github].
+* WikiGet therefore requires a volunteer to supervise the script while running.
+* As of December 2021, WikiGet does not support multi-thread downloads. Therefore, to increase the efficiency of the download process it is recommended to run the Python Script on 20-30 terminal windows simultaneously. Each terminal running WikiGet would consume an average of 20 Kb/s.
+* WikiGet requires an stable internet connection. Any disruption of 1 second would stop the download process and it requires manual restart of the Python Script.
+* [[m:Special:MyLanguage/PetScan|Manual for PetScan]]
+* Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj
+=== NodeJS === <!--T:20-->
+<!--T:21-->
+Dependencies: git, nodejs, npm.
+<!--T:22-->
+A '''WikiapiJS''' script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.
+# WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
+# [https://kanasimi.github.io/wikiapi/Wikiapi.html#download Manual for .download()]
+<!--
+# Specific script used to do a given task:
+#* Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
+#* Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js -->
+<!--T:23-->
+Comments, as of December 2021:
+* Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
+* Successfully process single category's files.
+* Successfully process root category and subcategories' files, generating ./isocode/ folders.
+* Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
+=== Python (slow) === <!--T:24-->
+<!--T:25-->
+Dependencies: python.
+<!--T:26-->
+'''CommonsDownloadTool.py''' is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips :
+* Download scripts :
+** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh] - creates CommonsDownloadTool's commands.
+** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py] - core script.
+* Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
+* Edit as needed so the paths are correct, make it work.
+* Run <code>create_datasets.sh</code>
+* Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]]
+<!--T:27-->
+Comments:
+* Last ran on February 2021, stopped due to slow speed.
+* This script is slow and has been phased out as Lingualibre grown too much.
+* The page may gain from some html and styling.
+* Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the [[LinguaLibre:Chat room]].
+=== Python with UI (Sulochanaviji) === <!--T:35-->
+:''Description to complete, see its [https://github.com/sulochanaviji/Wiki-bulk-downloader github repository].''
+[[:meta:User:Sulochanaviji|User:Sulochanaviji]] coded a Django/Python tool with a HTML/CSS user interface. See its [https://github.com/sulochanaviji/Wiki-bulk-downloader github repository].
+=== Python Script to Download a User's Pronunciations === <!--T:36-->
+This script downloads all the pronunciations added by a user into a folder by first querying the Lingua Libre database and then downloading the files from Commons. See its [https://github.com/rkosov/Lingua-Libre-User-Audio-Downloader github repository]. [[User:Languageseeker|Languageseeker]] ([[User talk:Languageseeker|talk]]) 01:57, 24 May 2022 (UTC)
+=== Anki Extension for Lingua Libre === <!--T:37-->
+The [https://ankiweb.net/shared/info/124265771 Lingua Libre and Forvo Addon]. It has a number of advanced options to improve search results and can run either as a batch operation or on an individual note.
+<!--T:38-->
+By default, it first checks Lingua Libre and, if there are no results on Lingua Libre, it then checks Forvo.  To run as a pure Lingua Libre extension, you will need to set ''"disable_Forvo" to <code>True</code> in your configuration section.
+<!--T:39-->
+Please reports bugs, issues, ideas on [https://github.com/rkosov/Lingua-Libre-and-Forvo-Audio-Downloader github].
-== Using Imker ==
+=== Java (not tested) === <!--T:28-->
-=== Requirements ===
-On Ubuntu, run:
-<pre>sudo apt-get install default-jre    # install Java environment</pre>
-Be aware of your target data size (see section above).
+<!--T:29-->
+Dependencies:
+<syntaxhighlight lang="bash">
+sudo apt-get install default-jre    # install Java environment
+</syntaxhighlight>
-=== Install ===
+<!--T:30-->
+Usage:
 * Open [https://github.com/MarcoFalke/wiki-java-tools/releases GitHub Wiki-java-tools project page].
 * Find the last <code>Imker</code> release.
@@ Line 44: / Line 148: @@
 ** On Windows : start the .exe file.
 ** On Ubuntu, open shell then :
-<pre>$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"</pre>
+<syntaxhighlight lang="bash">
+$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"
+</syntaxhighlight>
-=== Manual ===
+<!--T:31-->
-<pre>Imker -- Wikimedia Commons batch downloading tool.
+Comments :
+* Not used yet by any LinguaLibre member. If you do, please share your experience of this tool.
+==== Manual ==== <!--T:32-->
+</translate>
+<syntaxhighlight lang="bash">
+Imker -- Wikimedia Commons batch downloading tool.
 Usage: java -jar imker-cli.jar [options]
@@ Line 66: / Line 178: @@
   ↳ A Wiki category (Example: --category=&quot;Denver, Colorado&quot;)
   ↳ A Wiki page (Example: --page=&quot;Sandboarding&quot;)
-  ↳ A local file (Example: --file=&quot;Documents/files.txt&quot;; One filename per line!)</pre>
+  ↳ A local file (Example: --file=&quot;Documents/files.txt&quot;; One filename per line!)
+</syntaxhighlight>
-== Using CommonsDownloadTool ==
-To download all datasets as zips :
-* Download on your large device the scripts :
-** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh]
-** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py]
-* Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
-* Edit as needed so the paths are correct, make it work.
-* Run <code>create_datasets.sh</code> successfully
-* Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]]
-== Javascript and/or API queries ==
+<translate>
-There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them.
+== See also == <!--T:33-->
-For a start point on API queries, see [https://codepen.io/hugolpz/pen/ByoKOK this pen], which gives some example of API queries.
-== Use html audios elements in webpages ==
+<!--T:34-->
-See [https://codepen.io/hugolpz/pen/QWGyVwM Audio 101].
+* [[<tvar|1>Special:MyLanguage/Help:Renaming</>|Help:Renaming]]
+* [[<tvar|2>Special:MyLanguage/Help:Converting audios</>|Help:Converting audios]]
+* [[<tvar|3>Special:MyLanguage/Help:Embed audio in HTML</>|Help:Embed audio in HTML]]
+* [[<tvar|3>:phab:T261519</>]]
+</translate>
+<translate>
+== See also == <!--T:40-->
+</translate>
+{{Helps}}
+{{Technicals}}
-== See also ==
-{{Lingua_Libre_scripts}}
 [[Category:Lingua Libre:Help]]