Help
Difference between revisions of "Download datasets"
Line 42: | Line 42: | ||
− | + | == Using CommonsDownloadTool == | |
Be aware, lingualibre has : | Be aware, lingualibre has : | ||
* Audio file average size: 100kB. | * Audio file average size: 100kB. |
Revision as of 10:25, 5 February 2021
Requirements
Java Runtime Environment.
Ubuntu: sudo apt-get install default-jre
Install
- Open GitHub Wiki-java-tools project page.
- Find the last
Imker
release. - Download Imker_vxx.xx.xx.zip archive
- Extract the .zip file
- Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'
Find your target category
- Commons:Category:Lingua Libre pronunciation by user
- Commons:Category:Lingua Libre pronunciation by language
Manual
Imker -- Wikimedia Commons batch downloading tool. Usage: java -jar imker-cli.jar [options] Options: --category, -c Use the specified Wiki category as download source. --domain, -d Wiki domain to fetch from Default: commons.wikimedia.org --file, -f Use the specified local file as download source. * --outfolder, -o The output folder. --page, -p Use the specified Wiki page as download source. The download source must be ONE of the following: ↳ A Wiki category (Example: --category="Denver, Colorado") ↳ A Wiki page (Example: --page="Sandboarding") ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)
Note
There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen.
Using CommonsDownloadTool
Be aware, lingualibre has :
- Audio file average size: 100kB.
- Number of audios files: 300,000+.
- Total data's size = 30GB (estimate).
- Safe error margin factor : 5~10x.
Required disk space : 150~300GB.
To download all datasets as zips :
- Download on your large device the scripts :
- Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
- Edit as needed so the paths are correct, make it work.
- Run
create_datasets.sh
successfully - Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation