Export translations

Settings

Group

Language

Format

Export for off-line translation

Export in native format

{{#Subtitle:{{Help:Download_datasets/Header}}}}
<languages/>
{| class="wikitable right" style="float:right;"
! colspan=2| Data size — 2022/02
|-
| Audios files || 800,000+
|-
| Average size || 100kB
|-
| Total size (est.) || 80GB 
|}

== Download datasets via click ==

'''Download by language:'''
<br>
# Open https://lingualibre.org/datasets/
# Find your language, naming convention is: <code>{qId}-{iso639-3}-{language_English_name}.zip</code>
# '''Click to download'''
# On your device, unzip.

'''Post-processing'''
<br>Refer to the relevant tutorials in [[#See also]] to mass rename, mass convert or mass denoise your downloaded audios.

== Programmatic tools ==

The tools below first fetch from one or several Wikimedia Commons categories the list of audio files within them.
Some of them allow to filter that list further to focus a single speaker, either by editing their code or by post-processing of the resulting .csv list of audio files. The listed targets are then downloaded at a speed of 500 to 15,000 per hours. Items already present locally and matching the latest Commons version are generally not re-downloaded.

=== Find your target ===

Categories on Wikimedia Commons are organized as follow:
* [[:Commons:Category:Lingua Libre pronunciation by user]] 
* [[:Commons:Category:Lingua Libre pronunciation]] (by language)

=== Python (current)===

Dependencies: Python 3.6+

'''Petscan''' and '''Wikiget''' allows to download about 15,000 audio files per hour.
# '''Select your category :''' see [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] and [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], then find your target category,
# '''List target files with [https://petscan.wmflabs.org Petscan] :''' Given a target category on Commons, provides list of target files. [https://petscan.wmflabs.org/?&cb_labels_yes_l=1&cb_labels_no_l=1&edits%5Banons%5D=both&interface_language=en&edits%5Bflagged%5D=both&categories=Lingua%20Libre%20pronunciation-cmn&cb_labels_any_l=1&ns%5B0%5D=1&project=wikimedia&since_rev0=&search_max_results=500&edits%5Bbots%5D=both&ns%5B6%5D=1&language=commons&search_query= Example].
# '''Download target files with [https://pypi.org/project/wikiget/ Wikiget] :''' downloads targets files.

Comments:
* Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
* Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request [https://github.com/clpo13/wikiget/issues/2 on github].
* WikiGet therefore requires a volunteer to supervise the script while running.
* As of December 2021, WikiGet does not support multi-thread downloads. Therefore, to increase the efficiency of the download process it is recommended to run the Python Script on 20-30 terminal windows simultaneously. Each terminal running WikiGet would consume an average of 20 Kb/s.
* WikiGet requires an stable internet connection. Any disruption of 1 second would stop the download process and it requires manual restart of the Python Script.
* [[m:Special:MyLanguage/PetScan|Manual for PetScan]]
* Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj

=== NodeJS (soon) ===

Dependencies: git, nodejs, npm.

A '''WikiapiJS''' script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.
# WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
# Specific script used to do a given task:
#* Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
#* Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js

Comments, as of December 2021:
* Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
* Successfully process single category's files.
* Successfully process root category and subcategories' files, generating ./isocode/ folders.
* Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
* Performance improvements are under consideration [https://github.com/kanasimi/wikiapi/issues/51#issuecomment-1002267855 on github].

=== Python (slow) ===

Dependencies: python.

'''CommonsDownloadTool.py''' is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips :
* Download scripts : 
** [https://github.com/lingua-libre/operations/blob/master/create_datasets.sh create_datasets.sh] - creates CommonsDownloadTool's commands.
** [https://github.com/lingua-libre/CommonsDownloadTool/blob/master/commons_download_tool.py CommonsDownloadTool/commons_download_tool.py] - core script.
* Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
* Edit as needed so the paths are correct, make it work.
* Run <code>create_datasets.sh</code>
* Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]]

Comments:
* Last ran on February 2021, stopped due to slow speed.
* This script is slow and has been phased out as Lingualibre grown too much.
* The page may gain from some html and styling.
* Proposals go on https://phabricator.wikimedia.org/tag/lingua_libre/ or on the [[LinguaLibre:Chat room]].

=== Python with UI (Sulochanaviji) ===
:''Description to complete, see its [https://github.com/sulochanaviji/Wiki-bulk-downloader github repository].''
[[:meta:User:Sulochanaviji|User:Sulochanaviji]] coded a Django/Python tool with a HTML/CSS user interface. See its [https://github.com/sulochanaviji/Wiki-bulk-downloader github repository].

=== Python Script to Download a User's Pronunciations ===
This script downloads all the pronunciations added by a user into a folder by first querying the Lingua Libre database and then downloading the files from Commons. See its [https://github.com/rkosov/Lingua-Libre-User-Audio-Downloader github repository]. [[User:Languageseeker|Languageseeker]] ([[User talk:Languageseeker|talk]]) 01:57, 24 May 2022 (UTC)

=== Anki Extension for Lingua Libre ===
The [https://ankiweb.net/shared/info/124265771 Lingua Libre and Forvo Addon]. It has a number of advanced options to improve search results and can run either as a batch operation or on an individual note.

By default, it first checks Lingua Libre and, if there are no results on Lingua Libre, it then checks Forvo.  To run as a pure Lingua Libre extension, you will need to set ''"disable_Forvo" to <code>True</code> in your configuration section.

Please reports bugs, issues, ideas on [https://github.com/rkosov/Lingua-Libre-and-Forvo-Audio-Downloader github].

=== Java (not tested) ===

Dependencies:
<syntaxhighlight lang="bash">
sudo apt-get install default-jre    # install Java environment
</syntaxhighlight>

Usage:
* Open [https://github.com/MarcoFalke/wiki-java-tools/releases GitHub Wiki-java-tools project page].
* Find the last <code>Imker</code> release.
* Download Imker_vxx.xx.xx'''.zip''' archive
* Extract the .zip file
* Run as follow :
** On Windows : start the .exe file.
** On Ubuntu, open shell then : 
<syntaxhighlight lang="bash">
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"
</syntaxhighlight>

Comments :
* Not used yet by any LinguaLibre member. If you do, please share your experience of this tool.

==== Manual ====
<syntaxhighlight lang="bash">
Imker -- Wikimedia Commons batch downloading tool.

Usage: java -jar imker-cli.jar [options]
  Options:
    --category, -c
       Use the specified Wiki category as download source.
    --domain, -d
       Wiki domain to fetch from
       Default: commons.wikimedia.org
    --file, -f
       Use the specified local file as download source.
  * --outfolder, -o
       The output folder.
    --page, -p
       Use the specified Wiki page as download source.

The download source must be ONE of the following:
 ↳ A Wiki category (Example: --category=&quot;Denver, Colorado&quot;)
 ↳ A Wiki page (Example: --page=&quot;Sandboarding&quot;)
 ↳ A local file (Example: --file=&quot;Documents/files.txt&quot;; One filename per line!)
</syntaxhighlight>

== See also ==

* [[Special:MyLanguage/Help:Renaming|Help:Renaming]]
* [[Special:MyLanguage/Help:Converting audios|Help:Converting audios]]
* [[:phab:T261519|Help:Embed audio in HTML]]
* [[:phab:T261519]]
== See also ==
{{Helps}}
{{Technicals}}

[[Category:Lingua Libre:Help]]