Difference between revisions of "Download datasets"

Revision as of 18:36, 30 December 2021

Data size — 2021/02
Audios files	800,000+
Average size	100kB
Total size (est.)	80GB

Context

Data clean up

By default, we provide both per-language and all-lingualibre zip archives, which therefor double the data size of your download it all.

Find your target category

Tools

Python (current)

Dependencies: python.

Petscan and Wikiget allows to download about 15,000 audio files per hour.

Select your category : see Category:Lingua Libre pronunciation and Category:Lingua Libre pronunciation by user, then find your target category,
List target files with Petscan : Given a target category on Commons, provides list of target files. Example.
Download target files with Wikiget : downloads targets files.

Comments:

Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request on github.
WikiGet therefor requires a volunteer to supervise the script while running.

NodeJS (soon)

Dependencies: git, nodejs, npm.

A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.

WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
Specific script used to do a given task:
- Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
- Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js

Comments, as of December 2021:

Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
Successfully process single category's files.
Successfully process root category and subcategories' files, generating ./isocode/ folders.
Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
Performance improvements are under consideration on github.

Java (not tested)

Dependencies:

sudo apt-get install default-jre    # install Java environment

Usage:

Open GitHub Wiki-java-tools project page.
Find the last Imker release.
Download Imker_vxx.xx.xx.zip archive
Extract the .zip file
Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :

$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"

Comments :

Not used yet by any LinguaLibre member. If you do, please share your experience of this tool.

Manual

Imker -- Wikimedia Commons batch downloading tool.

Usage: java -jar imker-cli.jar [options]
  Options:
    --category, -c
       Use the specified Wiki category as download source.
    --domain, -d
       Wiki domain to fetch from
       Default: commons.wikimedia.org
    --file, -f
       Use the specified local file as download source.
  * --outfolder, -o
       The output folder.
    --page, -p
       Use the specified Wiki page as download source.

The download source must be ONE of the following:
 ↳ A Wiki category (Example: --category=&quot;Denver, Colorado&quot;)
 ↳ A Wiki page (Example: --page=&quot;Sandboarding&quot;)
 ↳ A local file (Example: --file=&quot;Documents/files.txt&quot;; One filename per line!)

Datasets (outdated)

Former access (outdated)

Open https://lingualibre.org/datasets/
Download zip name such
- Target language : {qId}-{iso639-3}-{language_English_name}.zip
- All languages : https://lingualibre.fr/datasets/lingualibre_full.zip
On your device, unzip.

Go to the relevant tutorials to clean up or rename your data.

Python

Dependencies: python.

CommonsDownloadTool.py is a python script which formerly created datasets for LinguaLibre.

To download all datasets as zips :

Download on your large device the scripts :
- create_datasets.sh - creates CommonsDownloadTool's commands.
- CommonsDownloadTool/commons_download_tool.py - core script.
Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
Edit as needed so the paths are correct, make it work.
Run create_datasets.sh
Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation

Comments:

This script is slow and has been phased out as Lingualibre grown too much.
The page may gain from some html and styling.
Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the LinguaLibre:Chat room.

Javascript and/or API queries

See Help:APIs

There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen, which gives some example of API queries.

Use html audios elements in webpages

See Audio 101.

@@ Line 27: / Line 27: @@
 == Tools ==
 === Python (current)===
+Dependencies: python.
 Petscan and Wikiget allows to download about 15,000 audio files per hour.
 # '''Select your category :''' see [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] and [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], then find your target category,
@@ Line 38: / Line 40: @@
 === NodeJS (soon) ===
+Dependencies: git, nodejs, npm.
 A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.
 # WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
@@ Line 43: / Line 47: @@
 #* Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
 #* Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js
-Dependencies: git, nodejs, npm.
 Comments, as of December 2021:
@@ Line 70: / Line 72: @@
 $java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"
 </syntaxhighlight>
+Comments :
+* Not used yet by any LinguaLibre member. If you do, please share your experience of this tool.
 ==== Manual ====
@@ Line 104: / Line 109: @@
 Go to the relevant tutorials to clean up or rename your data.
-=== Bash / Python ===
+=== Python ===
-Refreshed : auto-run every 2 days.<br>
+Dependencies: python.
-The scripts : One master script ([https://github.com/lingua-libre/operations/blob/master/create_datasets.sh /lingua-libre/operations/create_datasets.sh]) create the commands. On LinguaLibre, we want to collect audios by languages. [https://github.com/lingua-libre/CommonsDownloadTool lingua-libre/CommonsDownloadTool], a server-side python script, runs them. Python and LinguaLibre knowledge is required.<br>
-Evolutions : the page may gain from some html and styling. Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the [[LinguaLibre:Chat room]].
-=== Using CommonsDownloadTool ===
+CommonsDownloadTool.py is a python script which formerly created datasets for LinguaLibre.
 To download all datasets as zips :
 * Download on your large device the scripts :
-** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh]
+** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh] - creates CommonsDownloadTool's commands.
-** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py]
+** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py] - core script.
 * Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
 * Edit as needed so the paths are correct, make it work.
-* Run <code>create_datasets.sh</code> successfully
+* Run <code>create_datasets.sh</code>
 * Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]]
+Comments:
+* This script is slow and has been phased out as Lingualibre grown too much.
+* The page may gain from some html and styling.
+* Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the [[LinguaLibre:Chat room]].
 == Javascript and/or API queries ==
+:''See [[Help:APIs]]
 There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them.
 For a start point on API queries, see [https://codepen.io/hugolpz/pen/ByoKOK this pen], which gives some example of API queries.

Template	{{Speakers category}} • {{Recommended lists}} • {{To iso 639-2}} • {{To iso 639-3}} • {{Userbox-records}} • {{Bot steps}}
Audio files	How to create a frequency list? • Convert files formats • Denoise files with SoX • Rename and mass rename
Bots	Help:Bots • LinguaLibre:Bot • Help:Log in to Lingua Libre with Pywikibot • Lingua Libre Bot (gh) • Olafbot • PamputtBot • Dragons Bot (gh)
MediaWiki	MediaWiki: Help:Documentation opérationelle Mediawiki • Help:Database structure • Help:CSS • Help:Rename • Help:OAuth • LinguaLibre:User rights (rate limit) • Module:Lingua Libre record & {{Lingua Libre record}} • JS scripts: MediaWiki:Common.js • LastAudios.js • SoundLibrary.js • ItemsSugar.js • LexemeQueriesGenerator.js (pad) • Sparql2data.js (pad) • LanguagesGallery.js (pad) • Gadgets: Gadget-LinguaImporter.js • Gadget-Demo.js • Gadget-RecentNonAudio.js • LiLiZip.js
Queries	Help:APIs • Help:SPARQL • SPARQL (intermediate) (stub) • SPARQL for lexemes (stub) • SPARQL for maintenance • Lingualibre:Wikidata (stub) • Help:SPARQL (HAL)
Reuses	Help:Download datasets • Help:Embed audio in HTML
Unstable & tests	Help:SPARQL/test
Categories	Category:Technical reports

Help