Help
Difference between revisions of "Download datasets"
Line 44: | Line 44: | ||
=== Using CommonsDownloadTool === | === Using CommonsDownloadTool === | ||
Be aware, lingualibre has : | Be aware, lingualibre has : | ||
− | + | * Audio file average size: 100kB. | |
− | + | * Number of audios files: 300,000+. | |
− | + | * Total data's size = 30GB (estimate). | |
− | + | * Safe error margin factor : 5~10x. | |
− | Required disk space : 150~300GB. | + | '''Required disk space : 150~300GB.''' |
To download all datasets as zips : | To download all datasets as zips : | ||
− | + | * Download on your large device the scripts : | |
− | + | ** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh] | |
− | + | ** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py] | |
− | + | * Read them a bit, move them where they fit the best on you computer so they require the minimum of editing | |
− | + | * Edit as needed so the paths are correct, make it work. | |
− | + | * Run <code>create_datasets.sh</code> successfully | |
− | + | * Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] | |
[[Category:Lingua Libre:Help]] | [[Category:Lingua Libre:Help]] |
Revision as of 10:25, 5 February 2021
Requirements
Java Runtime Environment.
Ubuntu: sudo apt-get install default-jre
Install
- Open GitHub Wiki-java-tools project page.
- Find the last
Imker
release. - Download Imker_vxx.xx.xx.zip archive
- Extract the .zip file
- Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'
Find your target category
- Commons:Category:Lingua Libre pronunciation by user
- Commons:Category:Lingua Libre pronunciation by language
Manual
Imker -- Wikimedia Commons batch downloading tool. Usage: java -jar imker-cli.jar [options] Options: --category, -c Use the specified Wiki category as download source. --domain, -d Wiki domain to fetch from Default: commons.wikimedia.org --file, -f Use the specified local file as download source. * --outfolder, -o The output folder. --page, -p Use the specified Wiki page as download source. The download source must be ONE of the following: ↳ A Wiki category (Example: --category="Denver, Colorado") ↳ A Wiki page (Example: --page="Sandboarding") ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)
Note
There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen.
Using CommonsDownloadTool
Be aware, lingualibre has :
- Audio file average size: 100kB.
- Number of audios files: 300,000+.
- Total data's size = 30GB (estimate).
- Safe error margin factor : 5~10x.
Required disk space : 150~300GB.
To download all datasets as zips :
- Download on your large device the scripts :
- Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
- Edit as needed so the paths are correct, make it work.
- Run
create_datasets.sh
successfully - Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation