Help
Difference between revisions of "Download datasets"
Line 1: | Line 1: | ||
+ | This page deals with downloading Lingualibre.org's medias, both by hand and programatically, as packaged zip archives with rich filenames. When done, we have tutorials on how to clean up the resulting folders and how to rename these media files into web-friendly names such as <code>{language}-word.ogg</code>. Be aware of Lingualibre's data's size could be in 100s GB if you download it all. | ||
+ | |||
+ | == Context == | ||
+ | === Data size === | ||
+ | As of early 2021, lingualibre has : | ||
+ | * Number of audios files: 300,000+. | ||
+ | * Audio file average size: 100kB. | ||
+ | * Total data's size = 30GB (estimate). | ||
+ | * Safe error margin factor : 5~10x. | ||
+ | '''Required disk space : 150~300GB.''' | ||
+ | |||
+ | === Data clean up === | ||
+ | By default, we provide both per-language and all-lingualibre zip archives, which therefor double the data size of your download it all. | ||
+ | |||
+ | === Find your target category === | ||
+ | * [[:Commons:Category:Lingua Libre pronunciation by user]] | ||
+ | * [[:Commons:Category:Lingua Libre pronunciation]] by language | ||
+ | |||
+ | == Hand downloading == | ||
+ | # Open https://lingualibre.org/datasets/ | ||
+ | # Download your target language's zip | ||
+ | # On your device, unzip. | ||
+ | Go to the relevant tutorials to clean up or rename your data. | ||
+ | |||
== Using Imker == | == Using Imker == | ||
=== Requirements === | === Requirements === | ||
− | Java | + | On Ubuntu, run: |
+ | <pre>sudo apt-get install default-jre # install Java environment</pre> | ||
− | + | Be aware of your target data size (see section above). | |
=== Install === | === Install === | ||
Line 12: | Line 37: | ||
* Run as follow : | * Run as follow : | ||
** On Windows : start the .exe file. | ** On Windows : start the .exe file. | ||
− | ** On Ubuntu, open shell then : < | + | ** On Ubuntu, open shell then : |
− | + | <pre>$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName' # Downloads all medias within Wikimedia Commons's category "CategoryName"</pre> | |
− | |||
− | |||
− | |||
=== Manual === | === Manual === | ||
Line 39: | Line 61: | ||
↳ A Wiki page (Example: --page="Sandboarding") | ↳ A Wiki page (Example: --page="Sandboarding") | ||
↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)</pre> | ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)</pre> | ||
− | |||
− | |||
− | |||
== Using CommonsDownloadTool == | == Using CommonsDownloadTool == | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
To download all datasets as zips : | To download all datasets as zips : | ||
Line 59: | Line 72: | ||
* Run <code>create_datasets.sh</code> successfully | * Run <code>create_datasets.sh</code> successfully | ||
* Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] | * Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] | ||
+ | |||
+ | |||
+ | == Javascript and/or API queries === | ||
+ | There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. | ||
+ | For a start point on API queries, see [https://codepen.io/hugolpz/pen/ByoKOK this pen], which gives some example of API queries. | ||
[[Category:Lingua Libre:Help]] | [[Category:Lingua Libre:Help]] |
Revision as of 10:47, 5 February 2021
This page deals with downloading Lingualibre.org's medias, both by hand and programatically, as packaged zip archives with rich filenames. When done, we have tutorials on how to clean up the resulting folders and how to rename these media files into web-friendly names such as {language}-word.ogg
. Be aware of Lingualibre's data's size could be in 100s GB if you download it all.
Context
Data size
As of early 2021, lingualibre has :
- Number of audios files: 300,000+.
- Audio file average size: 100kB.
- Total data's size = 30GB (estimate).
- Safe error margin factor : 5~10x.
Required disk space : 150~300GB.
Data clean up
By default, we provide both per-language and all-lingualibre zip archives, which therefor double the data size of your download it all.
Find your target category
- Commons:Category:Lingua Libre pronunciation by user
- Commons:Category:Lingua Libre pronunciation by language
Hand downloading
- Open https://lingualibre.org/datasets/
- Download your target language's zip
- On your device, unzip.
Go to the relevant tutorials to clean up or rename your data.
Using Imker
Requirements
On Ubuntu, run:
sudo apt-get install default-jre # install Java environment
Be aware of your target data size (see section above).
Install
- Open GitHub Wiki-java-tools project page.
- Find the last
Imker
release. - Download Imker_vxx.xx.xx.zip archive
- Extract the .zip file
- Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName' # Downloads all medias within Wikimedia Commons's category "CategoryName"
Manual
Imker -- Wikimedia Commons batch downloading tool. Usage: java -jar imker-cli.jar [options] Options: --category, -c Use the specified Wiki category as download source. --domain, -d Wiki domain to fetch from Default: commons.wikimedia.org --file, -f Use the specified local file as download source. * --outfolder, -o The output folder. --page, -p Use the specified Wiki page as download source. The download source must be ONE of the following: ↳ A Wiki category (Example: --category="Denver, Colorado") ↳ A Wiki page (Example: --page="Sandboarding") ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)
Using CommonsDownloadTool
To download all datasets as zips :
- Download on your large device the scripts :
- Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
- Edit as needed so the paths are correct, make it work.
- Run
create_datasets.sh
successfully - Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation
Javascript and/or API queries =
There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen, which gives some example of API queries.