Difference between revisions of "Download datasets"

Revision as of 01:47, 31 December 2021

Data size — 2022/02
Audios files	800,000+
Average size	100kB
Total size (est.)	80GB

Context

Data clean up

By default, we provide both per-language and all-lingualibre zip archives, which therefor double the data size of your download it all.

Find your target category

Tools

Python (current)

Dependencies: Python 3.6+

Petscan and Wikiget allows to download about 15,000 audio files per hour.

Select your category : see Category:Lingua Libre pronunciation and Category:Lingua Libre pronunciation by user, then find your target category,
List target files with Petscan : Given a target category on Commons, provides list of target files. Example.
Download target files with Wikiget : downloads targets files.

Comments:

Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request on github.
WikiGet therefore requires a volunteer to supervise the script while running.
As of December 2021, WikiGet does not support multi-thread downloads. Therefore, to increase the efficiency of the download process it is recommended to run the Python Script on 20-30 terminal windows simultaneously. Each terminal running WikiGet would consume an average of 20 Kb/s.
WikiGet requires an stable internet connection. Any disruption of 1 second would stop the download process and it requires manual restart of the Python Script.
Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj

NodeJS (soon)

Dependencies: git, nodejs, npm.

A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.

WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
Specific script used to do a given task:
- Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
- Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js

Comments, as of December 2021:

Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
Successfully process single category's files.
Successfully process root category and subcategories' files, generating ./isocode/ folders.
Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
Performance improvements are under consideration on github.

Python (slow)

Dependencies: python.

CommonsDownloadTool.py is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips :

Download scripts :
- create_datasets.sh - creates CommonsDownloadTool's commands.
- CommonsDownloadTool/commons_download_tool.py - core script.
Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
Edit as needed so the paths are correct, make it work.
Run create_datasets.sh
Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation

Comments:

This script is slow and has been phased out as Lingualibre grown too much.
The page may gain from some html and styling.
Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the LinguaLibre:Chat room.

Java (not tested)

Dependencies:

sudo apt-get install default-jre    # install Java environment

Usage:

Open GitHub Wiki-java-tools project page.
Find the last Imker release.
Download Imker_vxx.xx.xx.zip archive
Extract the .zip file
Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :

$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"

Comments :

Not used yet by any LinguaLibre member. If you do, please share your experience of this tool.

Manual

Imker -- Wikimedia Commons batch downloading tool.

Usage: java -jar imker-cli.jar [options]
  Options:
    --category, -c
       Use the specified Wiki category as download source.
    --domain, -d
       Wiki domain to fetch from
       Default: commons.wikimedia.org
    --file, -f
       Use the specified local file as download source.
  * --outfolder, -o
       The output folder.
    --page, -p
       Use the specified Wiki page as download source.

The download source must be ONE of the following:
 ↳ A Wiki category (Example: --category=&quot;Denver, Colorado&quot;)
 ↳ A Wiki page (Example: --page=&quot;Sandboarding&quot;)
 ↳ A local file (Example: --file=&quot;Documents/files.txt&quot;; One filename per line!)

LinguaLibre dataset page (outdated)

Former access (outdated)

This formerly used the Python

Open https://lingualibre.org/datasets/
Download zip name such
- Target language : {qId}-{iso639-3}-{language_English_name}.zip
- All languages : https://lingualibre.fr/datasets/lingualibre_full.zip
On your device, unzip.

Go to the relevant tutorials to clean up or rename your data.

API queries

See Help:APIs

Use html audios elements in webpages

See Help:Embed audio in HTML

@@ Line 27: / Line 27: @@
 == Tools ==
 === Python (current)===
-Dependencies: python.
+Dependencies: Python 3.6+
 '''Petscan''' and '''Wikiget''' allows to download about 15,000 audio files per hour.
@@ Line 37: / Line 37: @@
 * Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
 * Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request [https://github.com/clpo13/wikiget/issues/2 on github].
-* WikiGet therefor requires a volunteer to supervise the script while running.
+* WikiGet therefore requires a volunteer to supervise the script while running.
+* As of December 2021, WikiGet does not support multi-thread downloads. Therefore, to increase the efficiency of the download process it is recommended to run the Python Script on 20-30 terminal windows simultaneously. Each terminal running WikiGet would consume an average of 20 Kb/s.
+* WikiGet requires an stable internet connection. Any disruption of 1 second would stop the download process and it requires manual restart of the Python Script.
+* Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj
 === NodeJS (soon) ===

Template	{{Speakers category}} • {{Recommended lists}} • {{To iso 639-2}} • {{To iso 639-3}} • {{Userbox-records}} • {{Bot steps}}
Audio files	How to create a frequency list? • Convert files formats • Denoise files with SoX • Rename and mass rename
Bots	Help:Bots • LinguaLibre:Bot • Help:Log in to Lingua Libre with Pywikibot • Lingua Libre Bot (gh) • Olafbot • PamputtBot • Dragons Bot (gh)
MediaWiki	MediaWiki: Help:Documentation opérationelle Mediawiki • Help:Database structure • Help:CSS • Help:Rename • Help:OAuth • LinguaLibre:User rights (rate limit) • Module:Lingua Libre record & {{Lingua Libre record}} • JS scripts: MediaWiki:Common.js • LastAudios.js • SoundLibrary.js • ItemsSugar.js • LexemeQueriesGenerator.js (pad) • Sparql2data.js (pad) • LanguagesGallery.js (pad) • Gadgets: Gadget-LinguaImporter.js • Gadget-Demo.js • Gadget-RecentNonAudio.js • LiLiZip.js
Queries	Help:APIs • Help:SPARQL • SPARQL (intermediate) (stub) • SPARQL for lexemes (stub) • SPARQL for maintenance • Lingualibre:Wikidata (stub) • Help:SPARQL (HAL)
Reuses	Help:Download datasets • Help:Embed audio in HTML
Unstable & tests	Help:SPARQL/test
Categories	Category:Technical reports

Help