Help
Difference between revisions of "Download datasets"
(→Note) |
|||
Line 40: | Line 40: | ||
=== Note === | === Note === | ||
There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see [https://codepen.io/hugolpz/pen/ByoKOK this pen]. | There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see [https://codepen.io/hugolpz/pen/ByoKOK this pen]. | ||
+ | |||
+ | |||
+ | === Using CommonsDownloadTool == | ||
+ | Be aware, lingualibre has : | ||
+ | - Audio average size on Lili: 100kB | ||
+ | - Audios on Lili: 300,000+ audios | ||
+ | - Total data's size = 30GB. | ||
+ | - Safe error margin : 5-10x | ||
+ | Required disk space : 150~300GB. | ||
+ | |||
+ | To download all datasets as zips : | ||
+ | - Download on your large device the script : | ||
+ | - [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh] | ||
+ | - [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py] | ||
+ | - Read them a bit, move them where they fit the best on you computer | ||
+ | - Edit as needed so the paths are correct, make it work. | ||
+ | - Run successfully | ||
+ | - Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] | ||
[[Category:Lingua Libre:Help]] | [[Category:Lingua Libre:Help]] |
Revision as of 10:21, 5 February 2021
Requirements
Java Runtime Environment.
Ubuntu: sudo apt-get install default-jre
Install
- Open GitHub Wiki-java-tools project page.
- Find the last
Imker
release. - Download Imker_vxx.xx.xx.zip archive
- Extract the .zip file
- Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'
Find your target category
- Commons:Category:Lingua Libre pronunciation by user
- Commons:Category:Lingua Libre pronunciation by language
Manual
Imker -- Wikimedia Commons batch downloading tool. Usage: java -jar imker-cli.jar [options] Options: --category, -c Use the specified Wiki category as download source. --domain, -d Wiki domain to fetch from Default: commons.wikimedia.org --file, -f Use the specified local file as download source. * --outfolder, -o The output folder. --page, -p Use the specified Wiki page as download source. The download source must be ONE of the following: ↳ A Wiki category (Example: --category="Denver, Colorado") ↳ A Wiki page (Example: --page="Sandboarding") ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)
Note
There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen.
= Using CommonsDownloadTool
Be aware, lingualibre has : - Audio average size on Lili: 100kB - Audios on Lili: 300,000+ audios - Total data's size = 30GB. - Safe error margin : 5-10x Required disk space : 150~300GB.
To download all datasets as zips : - Download on your large device the script :
- create_datasets.sh - CommonsDownloadTool/commons_download_tool.py
- Read them a bit, move them where they fit the best on you computer - Edit as needed so the paths are correct, make it work. - Run successfully - Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation