Difference between revisions of "Download datasets"

Revision as of 17:59, 30 December 2021

Data size — 2021/02
Audios files	800,000+
Average size	100kB
Total size (est.)	80GB

Context

Data clean up

By default, we provide both per-language and all-lingualibre zip archives, which therefor double the data size of your download it all.

Find your target category

Tools

Python (current)

Petscan and Wikiget allows to download about 15,000 audio files per hour.

Select your category : see Category:Lingua Libre pronunciation and Category:Lingua Libre pronunciation by user, then find your target category,
List target files with Petscan : Given a target category on Commons, provides list of target files. Example.
Download target files with Wikiget : downloads targets files.

Comments:

Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request on github.
WikiGet therefor requires a volunteer to supervise the script while running.

NodeJS (soon)

A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.

WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
Specific script used to do a given task:
- Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
- Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js

Dependencies: git, nodejs, npm.

Comments, as of December 2021:

Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
Successfully process single category's files.
Successfully process root category and subcategories' files, generating ./isocode/ folders.
Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
Performance improvements are under consideration on github.

Datasets (outdated)

Bash / Python

Refreshed : auto-run every 2 days.
The scripts : One master script (/lingua-libre/operations/create_datasets.sh) create the commands. On LinguaLibre, we want to collect audios by languages. lingua-libre/CommonsDownloadTool, a server-side python script, runs them. Python and LinguaLibre knowledge is required.
Evolutions : the page may gain from some html and styling. Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the LinguaLibre:Chat room.

Access (outdated)

Open https://lingualibre.org/datasets/
Download zip name such
- Target language : {qId}-{iso639-3}-{language_English_name}.zip
- All languages : https://lingualibre.fr/datasets/lingualibre_full.zip
On your device, unzip.

Go to the relevant tutorials to clean up or rename your data.

Using Imker

Requirements

On Ubuntu, run:

sudo apt-get install default-jre    # install Java environment

Be aware of your target data size (see section above).

Install

Open GitHub Wiki-java-tools project page.
Find the last Imker release.
Download Imker_vxx.xx.xx.zip archive
Extract the .zip file
Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :

$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"

Manual

Imker -- Wikimedia Commons batch downloading tool.

Usage: java -jar imker-cli.jar [options]
  Options:
    --category, -c
       Use the specified Wiki category as download source.
    --domain, -d
       Wiki domain to fetch from
       Default: commons.wikimedia.org
    --file, -f
       Use the specified local file as download source.
  * --outfolder, -o
       The output folder.
    --page, -p
       Use the specified Wiki page as download source.

The download source must be ONE of the following:
 ↳ A Wiki category (Example: --category="Denver, Colorado")
 ↳ A Wiki page (Example: --page="Sandboarding")
 ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)

Using CommonsDownloadTool

To download all datasets as zips :

Download on your large device the scripts :
- create_datasets.sh
- CommonsDownloadTool/commons_download_tool.py
Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
Edit as needed so the paths are correct, make it work.
Run create_datasets.sh successfully
Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation

Javascript and/or API queries

There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen, which gives some example of API queries.

Use html audios elements in webpages

See Audio 101.

@@ Line 23: / Line 23: @@
 * [[:Commons:Category:Lingua Libre pronunciation]] by language
-== Hand downloading ==
+== Tools ==
+=== Python (current)===
+Petscan and Wikiget allows to download about 15,000 audio files per hour.
+# '''Select your category :''' see [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] and [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], then find your target category,
+# '''List target files with [https://petscan.wmflabs.org Petscan] :''' Given a target category on Commons, provides list of target files. [https://petscan.wmflabs.org/?&cb_labels_yes_l=1&cb_labels_no_l=1&edits%5Banons%5D=both&interface_language=en&edits%5Bflagged%5D=both&categories=Lingua%20Libre%20pronunciation-cmn&cb_labels_any_l=1&ns%5B0%5D=1&project=wikimedia&since_rev0=&search_max_results=500&edits%5Bbots%5D=both&ns%5B6%5D=1&language=commons&search_query= Example].
+# '''Download target files with [https://pypi.org/project/wikiget/ Wikiget] :''' downloads targets files.
+Comments:
+* Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
+* Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request [https://github.com/clpo13/wikiget/issues/2 on github].
+* WikiGet therefor requires a volunteer to supervise the script while running.
+=== NodeJS (soon) ===
+A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.
+# WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
+# Specific script used to do a given task:
+#* Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
+#* Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js
+Dependencies: git, nodejs, npm.
+Comments, as of December 2021:
+* Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
+* Successfully process single category's files.
+* Successfully process root category and subcategories' files, generating ./isocode/ folders.
+* Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
+* Performance improvements are under consideration [https://github.com/kanasimi/wikiapi/issues/51#issuecomment-1002267855 on github].
+== Datasets (outdated) ==
+=== Bash / Python ===
+Refreshed : auto-run every 2 days.<br>
+The scripts : One master script ([https://github.com/lingua-libre/operations/blob/master/create_datasets.sh /lingua-libre/operations/create_datasets.sh]) create the commands. On LinguaLibre, we want to collect audios by languages. [https://github.com/lingua-libre/CommonsDownloadTool lingua-libre/CommonsDownloadTool], a server-side python script, runs them. Python and LinguaLibre knowledge is required.<br>
+Evolutions : the page may gain from some html and styling. Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the [[LinguaLibre:Chat room]].
+==== Access (outdated)====
 # Open https://lingualibre.org/datasets/
-# Download your target language's zip
+# Download  zip name such
+#* Target language : <code>{qId}-{iso639-3}-{language_English_name}.zip</code>
+#* All languages : https://lingualibre.fr/datasets/lingualibre_full.zip
 # On your device, unzip.
 Go to the relevant tutorials to clean up or rename your data.

Template	{{Speakers category}} • {{Recommended lists}} • {{To iso 639-2}} • {{To iso 639-3}} • {{Userbox-records}} • {{Bot steps}}
Audio files	How to create a frequency list? • Convert files formats • Denoise files with SoX • Rename and mass rename
Bots	Help:Bots • LinguaLibre:Bot • Help:Log in to Lingua Libre with Pywikibot • Lingua Libre Bot (gh) • Olafbot • PamputtBot • Dragons Bot (gh)
MediaWiki	MediaWiki: Help:Documentation opérationelle Mediawiki • Help:Database structure • Help:CSS • Help:Rename • Help:OAuth • LinguaLibre:User rights (rate limit) • Module:Lingua Libre record & {{Lingua Libre record}} • JS scripts: MediaWiki:Common.js • LastAudios.js • SoundLibrary.js • ItemsSugar.js • LexemeQueriesGenerator.js (pad) • Sparql2data.js (pad) • LanguagesGallery.js (pad) • Gadgets: Gadget-LinguaImporter.js • Gadget-Demo.js • Gadget-RecentNonAudio.js • LiLiZip.js
Queries	Help:APIs • Help:SPARQL • SPARQL (intermediate) (stub) • SPARQL for lexemes (stub) • SPARQL for maintenance • Lingualibre:Wikidata (stub) • Help:SPARQL (HAL)
Reuses	Help:Download datasets • Help:Embed audio in HTML
Unstable & tests	Help:SPARQL/test
Categories	Category:Technical reports

Help