Help
Difference between revisions of "Download datasets"
Download of Lingualibre's audio datasets allows external reuse of those audios into native or web applications. LinguaLivre's service of periodic generation of dumps is currently staled, volunteer developers are working on it (Janv. 2022). Current, past and future alternatives are documented below. Other tutorials deal with how to clean up the resulting folders and how to rename these files into more practical {language}−{word}.ogg. Be aware of the overall datasize of estimated 40GB for wav format.
Line 27: | Line 27: | ||
== Tools == | == Tools == | ||
=== Python (current)=== | === Python (current)=== | ||
+ | Dependencies: python. | ||
+ | |||
Petscan and Wikiget allows to download about 15,000 audio files per hour. | Petscan and Wikiget allows to download about 15,000 audio files per hour. | ||
# '''Select your category :''' see [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] and [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], then find your target category, | # '''Select your category :''' see [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] and [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], then find your target category, | ||
Line 38: | Line 40: | ||
=== NodeJS (soon) === | === NodeJS (soon) === | ||
+ | Dependencies: git, nodejs, npm. | ||
+ | |||
A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour. | A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour. | ||
# WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre. | # WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre. | ||
Line 43: | Line 47: | ||
#* Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js | #* Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js | ||
#* Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js | #* Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js | ||
− | |||
− | |||
Comments, as of December 2021: | Comments, as of December 2021: | ||
Line 70: | Line 72: | ||
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName' # Downloads all medias within Wikimedia Commons's category "CategoryName" | $java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName' # Downloads all medias within Wikimedia Commons's category "CategoryName" | ||
</syntaxhighlight> | </syntaxhighlight> | ||
+ | |||
+ | Comments : | ||
+ | * Not used yet by any LinguaLibre member. If you do, please share your experience of this tool. | ||
==== Manual ==== | ==== Manual ==== | ||
Line 104: | Line 109: | ||
Go to the relevant tutorials to clean up or rename your data. | Go to the relevant tutorials to clean up or rename your data. | ||
− | === | + | === Python === |
− | + | Dependencies: python. | |
− | |||
− | |||
− | + | CommonsDownloadTool.py is a python script which formerly created datasets for LinguaLibre. | |
To download all datasets as zips : | To download all datasets as zips : | ||
* Download on your large device the scripts : | * Download on your large device the scripts : | ||
− | ** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh] | + | ** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh] - creates CommonsDownloadTool's commands. |
− | ** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py] | + | ** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py] - core script. |
* Read them a bit, move them where they fit the best on you computer so they require the minimum of editing | * Read them a bit, move them where they fit the best on you computer so they require the minimum of editing | ||
* Edit as needed so the paths are correct, make it work. | * Edit as needed so the paths are correct, make it work. | ||
− | * Run <code>create_datasets.sh</code> | + | * Run <code>create_datasets.sh</code> |
* Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] | * Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] | ||
+ | |||
+ | Comments: | ||
+ | * This script is slow and has been phased out as Lingualibre grown too much. | ||
+ | * The page may gain from some html and styling. | ||
+ | * Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the [[LinguaLibre:Chat room]]. | ||
== Javascript and/or API queries == | == Javascript and/or API queries == | ||
+ | :''See [[Help:APIs]] | ||
There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. | There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. | ||
For a start point on API queries, see [https://codepen.io/hugolpz/pen/ByoKOK this pen], which gives some example of API queries. | For a start point on API queries, see [https://codepen.io/hugolpz/pen/ByoKOK this pen], which gives some example of API queries. |
Revision as of 18:36, 30 December 2021
Data size — 2021/02 | |
---|---|
Audios files | 800,000+ |
Average size | 100kB |
Total size (est.) | 80GB |
Context
Data clean up
- See also Convert files formats • Denoise files • Rename and mass rename
By default, we provide both per-language and all-lingualibre zip archives, which therefor double the data size of your download it all.
Find your target category
- Commons:Category:Lingua Libre pronunciation by user
- Commons:Category:Lingua Libre pronunciation by language
Tools
Python (current)
Dependencies: python.
Petscan and Wikiget allows to download about 15,000 audio files per hour.
- Select your category : see Category:Lingua Libre pronunciation and Category:Lingua Libre pronunciation by user, then find your target category,
- List target files with Petscan : Given a target category on Commons, provides list of target files. Example.
- Download target files with Wikiget : downloads targets files.
Comments:
- Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
- Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request on github.
- WikiGet therefor requires a volunteer to supervise the script while running.
NodeJS (soon)
Dependencies: git, nodejs, npm.
A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.
- WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
- Specific script used to do a given task:
- Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
- Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js
Comments, as of December 2021:
- Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
- Successfully process single category's files.
- Successfully process root category and subcategories' files, generating ./isocode/ folders.
- Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
- Performance improvements are under consideration on github.
Java (not tested)
Dependencies:
sudo apt-get install default-jre # install Java environment
Usage:
- Open GitHub Wiki-java-tools project page.
- Find the last
Imker
release. - Download Imker_vxx.xx.xx.zip archive
- Extract the .zip file
- Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName' # Downloads all medias within Wikimedia Commons's category "CategoryName"
Comments :
- Not used yet by any LinguaLibre member. If you do, please share your experience of this tool.
Manual
Imker -- Wikimedia Commons batch downloading tool.
Usage: java -jar imker-cli.jar [options]
Options:
--category, -c
Use the specified Wiki category as download source.
--domain, -d
Wiki domain to fetch from
Default: commons.wikimedia.org
--file, -f
Use the specified local file as download source.
* --outfolder, -o
The output folder.
--page, -p
Use the specified Wiki page as download source.
The download source must be ONE of the following:
↳ A Wiki category (Example: --category="Denver, Colorado")
↳ A Wiki page (Example: --page="Sandboarding")
↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)
Datasets (outdated)
Former access (outdated)
- Open https://lingualibre.org/datasets/
- Download zip name such
- Target language :
{qId}-{iso639-3}-{language_English_name}.zip
- All languages : https://lingualibre.fr/datasets/lingualibre_full.zip
- Target language :
- On your device, unzip.
Go to the relevant tutorials to clean up or rename your data.
Python
Dependencies: python.
CommonsDownloadTool.py is a python script which formerly created datasets for LinguaLibre.
To download all datasets as zips :
- Download on your large device the scripts :
- create_datasets.sh - creates CommonsDownloadTool's commands.
- CommonsDownloadTool/commons_download_tool.py - core script.
- Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
- Edit as needed so the paths are correct, make it work.
- Run
create_datasets.sh
- Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation
Comments:
- This script is slow and has been phased out as Lingualibre grown too much.
- The page may gain from some html and styling.
- Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the LinguaLibre:Chat room.
Javascript and/or API queries
- See Help:APIs
There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen, which gives some example of API queries.
Use html audios elements in webpages
See Audio 101.