Help

Difference between revisions of "Download datasets/oc"

Download of Lingualibre's audio datasets allows external reuse of those audios into native or web applications. LinguaLibre's service of periodic generation of dumps is currently staled, volunteer developers are working on it (Jan. 2022). Current, past and future alternatives are documented below. Other tutorials deal with how to clean up the resulting folders and how to rename these files into more practical {language}−{word}.ogg. Be aware of the overall datasize of estimated 40GB for wav format.

(Created page with "Referissètz-vos al tutorial concernit dins #See also pels renommages de massa, las conversions de massa o lo debruitage de massa dels audios telecargats.")
(Created page with "'''Petscan''' e '''Wikiget''' permeton de telecargar mai o mens 15,000 fichièrs audio files per ora. # '''Causir la vòstra categoria :''' veire :commons:Category:Lingua_Li...")
 
(8 intermediate revisions by the same user not shown)
Line 27: Line 27:
 
<br>Referissètz-vos al tutorial concernit dins [[#See also]] pels renommages de massa, las conversions de massa o lo debruitage de massa dels audios telecargats.
 
<br>Referissètz-vos al tutorial concernit dins [[#See also]] pels renommages de massa, las conversions de massa o lo debruitage de massa dels audios telecargats.
  
== Programmatic tools ==
+
== Aisinas de programacion ==
  
The tools below first fetch from one or several Wikimedia Commons categories the list of audio files within them.
+
Las aisinas çai jos recupèran primièr dins una o mantuna categoria Wikimedia Commons, la lista dels fichièrs audio que contenon.
Some of them allow to filter that list further to focus a single speaker, either by editing their code or by post-processing of the resulting .csv list of audio files. The listed targets are then downloaded at a speed of 500 to 15,000 per hours. Items already present locally and matching the latest Commons version are generally not re-downloaded.
 
  
=== Find your target ===
+
=== Tobètz la vòstra cibla ===
  
Categories on Wikimedia Commons are organized as follow:
+
Categorias dins Wikimedia Commons son organizadas coma seguís :
 
* [[:Commons:Category:Lingua Libre pronunciation by user]]  
 
* [[:Commons:Category:Lingua Libre pronunciation by user]]  
* [[:Commons:Category:Lingua Libre pronunciation]] (by language)
+
* [[:Commons:Category:Lingua Libre pronunciation]] (per lenga)
  
=== Python (current)===
+
=== Python (actual)===
  
Dependencies: Python 3.6+
+
Dependéncias: Python 3.6+
  
'''Petscan''' and '''Wikiget''' allows to download about 15,000 audio files per hour.
+
'''Petscan''' e '''Wikiget''' permeton de telecargar mai o mens 15,000 fichièrs audio files per ora.
# '''Select your category :''' see [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] and [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], then find your target category,
+
# '''Causir la vòstra categoria :''' veire [[:commons:Category:Lingua_Libre_pronunciation|Category:Lingua Libre pronunciation]] e [[:commons:Category:Lingua Libre pronunciation by user|Category:Lingua Libre pronunciation by user]], puèi trobètz la vòstra categoria,
# '''List target files with [https://petscan.wmflabs.org Petscan] :''' Given a target category on Commons, provides list of target files. [https://petscan.wmflabs.org/?&cb_labels_yes_l=1&cb_labels_no_l=1&edits%5Banons%5D=both&interface_language=en&edits%5Bflagged%5D=both&categories=Lingua%20Libre%20pronunciation-cmn&cb_labels_any_l=1&ns%5B0%5D=1&project=wikimedia&since_rev0=&search_max_results=500&edits%5Bbots%5D=both&ns%5B6%5D=1&language=commons&search_query= Example].
+
# '''listar los fichièrs ciblats amb [https://petscan.wmflabs.org Petscan] :''' Una categoria ciblada sus Commons presenta una lista de fichièrs . [https://petscan.wmflabs.org/?&cb_labels_yes_l=1&cb_labels_no_l=1&edits%5Banons%5D=both&interface_language=en&edits%5Bflagged%5D=both&categories=Lingua%20Libre%20pronunciation-cmn&cb_labels_any_l=1&ns%5B0%5D=1&project=wikimedia&since_rev0=&search_max_results=500&edits%5Bbots%5D=both&ns%5B6%5D=1&language=commons&search_query= Example].
# '''Download target files with [https://pypi.org/project/wikiget/ Wikiget] :''' downloads targets files.
+
# '''Telecargar los fichièrs ciblats amb [https://pypi.org/project/wikiget/ Wikiget] :''' telecarga los fichièrs.
  
 
Comments:
 
Comments:
Line 56: Line 55:
 
* Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj
 
* Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj
  
=== NodeJS (soon) ===
+
=== NodeJS (lèu) ===
  
Dependencies: git, nodejs, npm.
+
Dependéncias: git, nodejs, npm.
  
 
A '''WikiapiJS''' script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.
 
A '''WikiapiJS''' script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.

Latest revision as of 13:36, 8 October 2023

Other languages:
Deutsch • ‎English • ‎norsk bokmål • ‎occitan • ‎polski • ‎português do Brasil • ‎svenska • ‎македонски • ‎বাংলা
Volum de las donadas - 2022/02
Fichièrs audio 800,000+
Talha mejana 100kB
Talha totala (est.) 80GB

Telecargar los datasets amb un clic

Telecargament per lenga

  1. dobrir https://lingualibre.org/datasets/
  2. Trapar la vòstra lenga, la convencion de nommage es : {qId}-{iso639-3}-{nom en anglés}.zip
  3. Clicar per telecargar
  4. Sus lo vòstre terminal, unzip.

Anar al seguent
Referissètz-vos al tutorial concernit dins #See also pels renommages de massa, las conversions de massa o lo debruitage de massa dels audios telecargats.

Aisinas de programacion

Las aisinas çai jos recupèran primièr dins una o mantuna categoria Wikimedia Commons, la lista dels fichièrs audio que contenon.

Tobètz la vòstra cibla

Categorias dins Wikimedia Commons son organizadas coma seguís :

Python (actual)

Dependéncias: Python 3.6+

Petscan e Wikiget permeton de telecargar mai o mens 15,000 fichièrs audio files per ora.

  1. Causir la vòstra categoria : veire Category:Lingua Libre pronunciation e Category:Lingua Libre pronunciation by user, puèi trobètz la vòstra categoria,
  2. listar los fichièrs ciblats amb Petscan : Una categoria ciblada sus Commons presenta una lista de fichièrs . Example.
  3. Telecargar los fichièrs ciblats amb Wikiget : telecarga los fichièrs.

Comments:

  • Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
  • Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request on github.
  • WikiGet therefore requires a volunteer to supervise the script while running.
  • As of December 2021, WikiGet does not support multi-thread downloads. Therefore, to increase the efficiency of the download process it is recommended to run the Python Script on 20-30 terminal windows simultaneously. Each terminal running WikiGet would consume an average of 20 Kb/s.
  • WikiGet requires an stable internet connection. Any disruption of 1 second would stop the download process and it requires manual restart of the Python Script.
  • Manual for PetScan
  • Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj

NodeJS (lèu)

Dependéncias: git, nodejs, npm.

A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.

  1. WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
  2. Specific script used to do a given task:

Comments, as of December 2021:

  • Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
  • Successfully process single category's files.
  • Successfully process root category and subcategories' files, generating ./isocode/ folders.
  • Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
  • Performance improvements are under consideration on github.

Python (slow)

Dependencies: python.

CommonsDownloadTool.py is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips :

Comments:


Python with UI (Sulochanaviji)

Description to complete, see its github repository.

User:Sulochanaviji coded a Django/Python tool with a HTML/CSS user interface. See its github repository.

Python Script to Download a User's Pronunciations

This script downloads all the pronunciations added by a user into a folder by first querying the Lingua Libre database and then downloading the files from Commons. See its github repository. Languageseeker (talk) 01:57, 24 May 2022 (UTC)


Anki Extension for Lingua Libre

The Lingua Libre and Forvo Addon. It has a number of advanced options to improve search results and can run either as a batch operation or on an individual note.

By default, it first checks Lingua Libre and, if there are no results on Lingua Libre, it then checks Forvo. To run as a pure Lingua Libre extension, you will need to set "disable_Forvo" to True in your configuration section.

Please reports bugs, issues, ideas on github.

Java (not tested)

Dependencies:

sudo apt-get install default-jre    # install Java environment

Usage:

  • Open GitHub Wiki-java-tools project page.
  • Find the last Imker release.
  • Download Imker_vxx.xx.xx.zip archive
  • Extract the .zip file
  • Run as follow :
    • On Windows : start the .exe file.
    • On Ubuntu, open shell then :
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"

Comments :

  • Not used yet by any LinguaLibre member. If you do, please share your experience of this tool.

Manual

Imker -- Wikimedia Commons batch downloading tool.

Usage: java -jar imker-cli.jar [options]
  Options:
    --category, -c
       Use the specified Wiki category as download source.
    --domain, -d
       Wiki domain to fetch from
       Default: commons.wikimedia.org
    --file, -f
       Use the specified local file as download source.
  * --outfolder, -o
       The output folder.
    --page, -p
       Use the specified Wiki page as download source.

The download source must be ONE of the following:
 ↳ A Wiki category (Example: --category=&quot;Denver, Colorado&quot;)
 ↳ A Wiki page (Example: --page=&quot;Sandboarding&quot;)
 ↳ A local file (Example: --file=&quot;Documents/files.txt&quot;; One filename per line!)

See also

See also

Lingua Libre Help pages
General help pages Help:InterfaceHelp:Your first recordHelp:Choosing a microphoneHelp:Configure your microphoneHelp:TranslateHelp:LangtagsLinguaLibre:Language codes systems used across LinguaLibreLinguaLibre:List of languages
Linguistic help pages Help:Add a new languageHelp:HomographsHelp:List translationHelp:Ethics
Lists help pages Help:Create your own listsHelp:How to create a frequency list?Help:Why wordlists matter?Help:Swadesh listsHelp:ListsHelp:Create a new generator
Events, Outreach Lingualibre:EventsLingualibre:RolesLingualibre:WorkshopsLingualibre:HackathonLingualibre:Interested communitiesLingualibre:Events/2022 Public Relations CampaignLingualibre:MailingLingualibre:JargonLingualibre:AppsLingualibre:CitationsService civique 2022-2023
Strategy Lingualibre 2022 Review (including outreach)2022-2023 Lingualibre wishlist • {{Wikimedia Language Diversity/Projects}} • Speakers map • Voices gender • StatsLingua Libre SignIt/2022 report • {{Grants}}
Lingua Libre technical helps
Template {{Speakers category}} • {{Recommended lists}} • {{To iso 639-2}} • {{To iso 639-3}} • {{Userbox-records}} • {{Bot steps}}
Audio files How to create a frequency list?Convert files formatsDenoise files with SoXRename and mass rename
Bots Help:BotsLinguaLibre:BotHelp:Log in to Lingua Libre with PywikibotLingua Libre Bot (gh) • OlafbotPamputtBotDragons Bot (gh)
MediaWiki MediaWiki: Help:Documentation opérationelle MediawikiHelp:Database structureHelp:CSSHelp:RenameHelp:OAuthLinguaLibre:User rights (rate limit) • Module:Lingua Libre record & {{Lingua Libre record}}JS scripts: MediaWiki:Common.jsLastAudios.jsSoundLibrary.jsItemsSugar.jsLexemeQueriesGenerator.js (pad) • Sparql2data.js (pad) • LanguagesGallery.js (pad) • Gadgets: Gadget-LinguaImporter.jsGadget-Demo.jsGadget-RecentNonAudio.jsLiLiZip.js
Queries Help:APIsHelp:SPARQLSPARQL (intermediate) (stub) • SPARQL for lexemes (stub) • SPARQL for maintenanceLingualibre:Wikidata (stub) • Help:SPARQL (HAL)
Reuses Help:Download datasetsHelp:Embed audio in HTML
Unstable & tests Help:SPARQL/test
Categories Category:Technical reports