Help

Download datasets

This page deals with downloading Lingualibre.org's medias, both by hand and programatically, as packaged zip archives with rich filenames. We then have tutorials on how to clean up the resulting folders and how to rename these files into more practical {language}−{word}.ogg. Be aware of Lingualibre's data's size could be in 100s GB if you download it all.

Data size — 2021/02
Audios files 400,000+
Average size 100kB
Total size (est.) 40GB
Safety factor 5~10x
Required disk space 200~400GB

Context

Data clean up

See also Convert files formatsDenoise filesRename and mass rename

By default, we provide both per-language and all-lingualibre zip archives, which therefor double the data size of your download it all.

Find your target category

Hand downloading

  1. Open https://lingualibre.org/datasets/
  2. Download your target language's zip
  3. On your device, unzip.

Go to the relevant tutorials to clean up or rename your data.

Using Imker

Requirements

On Ubuntu, run:

sudo apt-get install default-jre    # install Java environment

Be aware of your target data size (see section above).

Install

  • Open GitHub Wiki-java-tools project page.
  • Find the last Imker release.
  • Download Imker_vxx.xx.xx.zip archive
  • Extract the .zip file
  • Run as follow :
    • On Windows : start the .exe file.
    • On Ubuntu, open shell then :
$java -jar imker-cli.jar -o ./myFolder/ -c 'CategoryName'     # Downloads all medias within Wikimedia Commons's category "CategoryName"

Manual

Imker -- Wikimedia Commons batch downloading tool.

Usage: java -jar imker-cli.jar [options]
  Options:
    --category, -c
       Use the specified Wiki category as download source.
    --domain, -d
       Wiki domain to fetch from
       Default: commons.wikimedia.org
    --file, -f
       Use the specified local file as download source.
  * --outfolder, -o
       The output folder.
    --page, -p
       Use the specified Wiki page as download source.

The download source must be ONE of the following:
 ↳ A Wiki category (Example: --category="Denver, Colorado")
 ↳ A Wiki page (Example: --page="Sandboarding")
 ↳ A local file (Example: --file="Documents/files.txt"; One filename per line!)

Using CommonsDownloadTool

To download all datasets as zips :


Javascript and/or API queries

There are also ways to use a category name as input, then to do API queries in order to get the list of files, download them. For a start point on API queries, see this pen, which gives some example of API queries.

Use html audios elements in webpages

See Audio 101.

Lingua Libre codes
Audio files Convert files formatsDenoise filesRename and mass rename
Bots Help:Bots
MediaWiki MediaWiki:Common.jsGadget-RecentNonAudio.js
Datasets Download datasets
Web integration Help:Embed audio in HTML