Difference between revisions of "Download datasets"

Revision as of 10:25, 5 February 2021

Requirements

Java Runtime Environment.

Ubuntu: sudo apt-get install default-jre

Install

Open GitHub Wiki-java-tools project page.
Find the last Imker release.
Download Imker_vxx.xx.xx.zip archive
Extract the .zip file
Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then : $java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'

Find your target category

Manual

Imker -- Wikimedia Commons batch downloading tool.

Usage: java -jar imker-cli.jar [options]
  Options:
    --category, -c
       Use the specified Wiki category as download source.
    --domain, -d
       Wiki domain to fetch from
       Default: commons.wikimedia.org
    --file, -f
       Use the specified local file as download source.
  * --outfolder, -o
       The output folder.
    --page, -p
       Use the specified Wiki page as download source.

The download source must be ONE of the following:
 ↳ A Wiki category (Example: --category="Denver, Colorado")
 ↳ A Wiki page (Example: --page="Sandboarding")
 ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)

Note

There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen.

Using CommonsDownloadTool

Be aware, lingualibre has :

Audio file average size: 100kB.
Number of audios files: 300,000+.
Total data's size = 30GB (estimate).
Safe error margin factor : 5~10x.

Required disk space : 150~300GB.

To download all datasets as zips :

Download on your large device the scripts :
- create_datasets.sh
- CommonsDownloadTool/commons_download_tool.py
Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
Edit as needed so the paths are correct, make it work.
Run create_datasets.sh successfully
Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation

@@ Line 44: / Line 44: @@
 === Using CommonsDownloadTool ===
 Be aware, lingualibre has :
-- Audio average size on Lili: 100kB
+* Audio file average size: 100kB.
-- Audios on Lili: 300,000+ audios
+* Number of audios files: 300,000+.
-- Total data's size = 30GB.
+* Total data's size = 30GB (estimate).
-- Safe error margin : 5-10x
+* Safe error margin factor : 5~10x.
-Required disk space : 150~300GB.
+'''Required disk space : 150~300GB.'''
 To download all datasets as zips :
-- Download on your large device the script :
+* Download on your large device the scripts :
-  -  [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh]
+** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh]
-  - [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py]
+** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py]
-- Read them a bit, move them where they fit the best on you computer
+* Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
-- Edit as needed so the paths are correct, make it work.
+* Edit as needed so the paths are correct, make it work.
-- Run successfully
+* Run <code>create_datasets.sh</code> successfully
-- Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]]
+* Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]]
 [[Category:Lingua Libre:Help]]