Help
Difference between revisions of "Download datasets/mk"
Download of Lingualibre's audio datasets allows external reuse of those audios into native or web applications. LinguaLibre's service of periodic generation of dumps is currently staled, volunteer developers are working on it (Jan. 2022). Current, past and future alternatives are documented below. Other tutorials deal with how to clean up the resulting folders and how to rename these files into more practical {language}−{word}.ogg. Be aware of the overall datasize of estimated 40GB for wav format.
(Created page with "Коментари: * Засега не се користи од ниеден член на LinguaLibre. Ако го користите, споделете го вашето...") |
(Updating to match new version of source page) |
||
(8 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | {{#Subtitle:{{/Header}}}} | + | {{#Subtitle:{{Help:Download_datasets/Header}}}} |
<languages/> | <languages/> | ||
{| class="wikitable right" style="float:right;" | {| class="wikitable right" style="float:right;" | ||
Line 19: | Line 19: | ||
'''Преземање по јазик:''' | '''Преземање по јазик:''' | ||
<br> | <br> | ||
− | # | + | # Отворете го https://lingualibre.org/datasets/ |
− | # | + | # Најдете го вашиот јазик. Начинот на именување гласи: <code>{qId}-{iso639-3}-{јазик_англиско_име}.zip</code> |
− | # ''' | + | # '''Стиснете за да преземете''' |
− | # | + | # На вашиот уред, распакувајте. |
'''Пообработка''' | '''Пообработка''' | ||
− | <br> | + | <br>Погледајте ги релевантните прирачници во [[#Поврзано]] за масовно преименување, масовно претворање или масовно обесшумување на преземените снимки. |
== Програмски алатки == | == Програмски алатки == | ||
Line 56: | Line 56: | ||
* Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj | * Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj | ||
− | === NodeJS ( | + | === NodeJS (наскоро) === |
Dependencies: git, nodejs, npm. | Dependencies: git, nodejs, npm. | ||
Line 73: | Line 73: | ||
* Performance improvements are under consideration [https://github.com/kanasimi/wikiapi/issues/51#issuecomment-1002267855 on github]. | * Performance improvements are under consideration [https://github.com/kanasimi/wikiapi/issues/51#issuecomment-1002267855 on github]. | ||
− | === Python ( | + | === Python (бавно) === |
− | + | Зависности: python. | |
'''CommonsDownloadTool.py''' is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips : | '''CommonsDownloadTool.py''' is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips : | ||
Line 86: | Line 86: | ||
* Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] | * Check if the number of files in the downloaded zips matches the number of files in [[:Commons:Category:Lingua Libre pronunciation]] | ||
− | + | Забелешки: | |
− | * | + | * Спсоците пуштени во февруари 2021 г. запреа поради бавноста. |
− | * | + | * Оваа скрипта е бавна и беше укината со големиот раст на Lingualibre. |
− | * | + | * Страницата може да се подобри со извесна мера HTML и стилизација. |
− | * | + | * Предлозите одат на https://phabricator.wikimedia.org/tag/lingua_libre/ или во [[LinguaLibre:Chat room|Разговорницата]]. |
+ | |||
+ | |||
+ | === Python with UI (Sulochanaviji) === | ||
+ | :''Description to complete, see its [https://github.com/sulochanaviji/Wiki-bulk-downloader github repository].'' | ||
+ | [[:meta:User:Sulochanaviji|User:Sulochanaviji]] coded a Django/Python tool with a HTML/CSS user interface. See its [https://github.com/sulochanaviji/Wiki-bulk-downloader github repository]. | ||
+ | |||
+ | === Python Script to Download a User's Pronunciations === | ||
+ | This script downloads all the pronunciations added by a user into a folder by first querying the Lingua Libre database and then downloading the files from Commons. See its [https://github.com/rkosov/Lingua-Libre-User-Audio-Downloader github repository]. [[User:Languageseeker|Languageseeker]] ([[User talk:Languageseeker|talk]]) 01:57, 24 May 2022 (UTC) | ||
+ | |||
+ | |||
+ | === Anki Extension for Lingua Libre === | ||
+ | The [https://ankiweb.net/shared/info/124265771 Lingua Libre and Forvo Addon]. It has a number of advanced options to improve search results and can run either as a batch operation or on an individual note. | ||
+ | |||
+ | By default, it first checks Lingua Libre and, if there are no results on Lingua Libre, it then checks Forvo. To run as a pure Lingua Libre extension, you will need to set ''"disable_Forvo" to <code>True</code> in your configuration section. | ||
+ | |||
+ | Please reports bugs, issues, ideas on [https://github.com/rkosov/Lingua-Libre-and-Forvo-Audio-Downloader github]. | ||
=== Java (неиспробано) === | === Java (неиспробано) === | ||
Line 144: | Line 160: | ||
* [[:phab:T261519|Помош:Вметнување на снимки во HTML]] | * [[:phab:T261519|Помош:Вметнување на снимки во HTML]] | ||
* [[:phab:T261519]] | * [[:phab:T261519]] | ||
− | {{ | + | == See also == |
+ | {{Helps}} | ||
+ | {{Technicals}} | ||
[[Category:Lingua Libre:Help]] | [[Category:Lingua Libre:Help]] |
Latest revision as of 18:47, 20 November 2022
Податочна големина — 2022/02 | |
---|---|
Звучни снимки | 800,000+ |
Прос. големина | 100kB |
Вкуп. големина (проц.) | 80GB |
Преземање на податочни зборови со стискање
Преземање по јазик:
- Отворете го https://lingualibre.org/datasets/
- Најдете го вашиот јазик. Начинот на именување гласи:
{qId}-{iso639-3}-{јазик_англиско_име}.zip
- Стиснете за да преземете
- На вашиот уред, распакувајте.
Пообработка
Погледајте ги релевантните прирачници во #Поврзано за масовно преименување, масовно претворање или масовно обесшумување на преземените снимки.
Програмски алатки
The tools below first fetch from one or several Wikimedia Commons categories the list of audio files within them. Some of them allow to filter that list further to focus a single speaker, either by editing their code or by post-processing of the resulting .csv list of audio files. The listed targets are then downloaded at a speed of 500 to 15,000 per hours. Items already present locally and matching the latest Commons version are generally not re-downloaded.
Find your target
Categories on Wikimedia Commons are organized as follow:
- Commons:Category:Lingua Libre pronunciation by user
- Commons:Category:Lingua Libre pronunciation (by language)
Python (current)
Dependencies: Python 3.6+
Petscan and Wikiget allows to download about 15,000 audio files per hour.
- Select your category : see Category:Lingua Libre pronunciation and Category:Lingua Libre pronunciation by user, then find your target category,
- List target files with Petscan : Given a target category on Commons, provides list of target files. Example.
- Download target files with Wikiget : downloads targets files.
Comments:
- Successful on November 2021, with 730,000 audio downloaded in 20 hours. Sustained average speed : 10 downloads/sec.
- Some delete files on Commons may cause Wikiget to return an error and pause. The script has to be resumed manually. Occurrence have been reported to be around 1/30,000 files. Fix is underway, support the request on github.
- WikiGet therefore requires a volunteer to supervise the script while running.
- As of December 2021, WikiGet does not support multi-thread downloads. Therefore, to increase the efficiency of the download process it is recommended to run the Python Script on 20-30 terminal windows simultaneously. Each terminal running WikiGet would consume an average of 20 Kb/s.
- WikiGet requires an stable internet connection. Any disruption of 1 second would stop the download process and it requires manual restart of the Python Script.
- Manual for PetScan
- Any question about downloading datasets can be made on the Discord Server of Lingua Libre : https://discord.gg/2WECKUHj
NodeJS (наскоро)
Dependencies: git, nodejs, npm.
A WikiapiJS script allows to download target category's files, or a root category, its subcategories and contained files. Downloads about 1,400 audio files per hour.
- WikiapiJS is the NodeJS / NPM package allowing scripted API calls upon Wikimedia Commons and LinguaLibre.
- Specific script used to do a given task:
- Given a category, download all files : https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download-many.js
- Given a root category, list subcategories, download all files: https://github.com/hugolpz/WikiapiJS-Eggs/blob/main/wiki-download_by_root_category-many.js
Comments, as of December 2021:
- Successful on December 2021, with 400 audios downloaded in 16 minutes. Sustained average speed : 0.4 downloads/sec.
- Successfully process single category's files.
- Successfully process root category and subcategories' files, generating ./isocode/ folders.
- Scalability tests for resilience with high amounts requests >500 to 100,000 items is required.
- Performance improvements are under consideration on github.
Python (бавно)
Зависности: python.
CommonsDownloadTool.py is a python script which formerly created datasets for LinguaLibre. It can be hacked and tinkered to your needs. To download all datasets as zips :
- Download scripts :
- create_datasets.sh - creates CommonsDownloadTool's commands.
- CommonsDownloadTool/commons_download_tool.py - core script.
- Read them a bit, move them where they fit the best on you computer so they require the minimum of editing
- Edit as needed so the paths are correct, make it work.
- Run
create_datasets.sh
- Check if the number of files in the downloaded zips matches the number of files in Commons:Category:Lingua Libre pronunciation
Забелешки:
- Спсоците пуштени во февруари 2021 г. запреа поради бавноста.
- Оваа скрипта е бавна и беше укината со големиот раст на Lingualibre.
- Страницата може да се подобри со извесна мера HTML и стилизација.
- Предлозите одат на https://phabricator.wikimedia.org/tag/lingua_libre/ или во Разговорницата.
- Description to complete, see its github repository.
User:Sulochanaviji coded a Django/Python tool with a HTML/CSS user interface. See its github repository.
Python Script to Download a User's Pronunciations
This script downloads all the pronunciations added by a user into a folder by first querying the Lingua Libre database and then downloading the files from Commons. See its github repository. Languageseeker (talk) 01:57, 24 May 2022 (UTC)
Anki Extension for Lingua Libre
The Lingua Libre and Forvo Addon. It has a number of advanced options to improve search results and can run either as a batch operation or on an individual note.
By default, it first checks Lingua Libre and, if there are no results on Lingua Libre, it then checks Forvo. To run as a pure Lingua Libre extension, you will need to set "disable_Forvo" to True
in your configuration section.
Please reports bugs, issues, ideas on github.
Java (неиспробано)
Dependencies:
sudo apt-get install default-jre # install Java environment
Usage:
- Open GitHub Wiki-java-tools project page.
- Find the last
Imker
release. - Download Imker_vxx.xx.xx.zip archive
- Extract the .zip file
- Run as follow :
- On Windows : start the .exe file.
- On Ubuntu, open shell then :
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName' # Downloads all medias within Wikimedia Commons's category "CategoryName"
Коментари:
- Засега не се користи од ниеден член на LinguaLibre. Ако го користите, споделете го вашето искуство со алаткава.
Упатство
Imker -- Wikimedia Commons batch downloading tool.
Usage: java -jar imker-cli.jar [options]
Options:
--category, -c
Use the specified Wiki category as download source.
--domain, -d
Wiki domain to fetch from
Default: commons.wikimedia.org
--file, -f
Use the specified local file as download source.
* --outfolder, -o
The output folder.
--page, -p
Use the specified Wiki page as download source.
The download source must be ONE of the following:
↳ A Wiki category (Example: --category="Denver, Colorado")
↳ A Wiki page (Example: --page="Sandboarding")
↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)