Difference between revisions of "Chat room"

Revision as of 23:02, 20 January 2019

Welcome to the Chat room!

Place used to discuss any and all aspects of Lingua Libre: the project itself, discussions of the operations, policy and proposals, technical issues, etc. Other forums include LinguaLibre:Technical board for code-oriented issues, LinguaLibre:Administrators' noticeboard.

Feel free to participate in any language you want to.

Start a new discussion

Chatroom FAQ

How to download all audios of one language ? By speaker ?
- Languages are there https://lingualibre.fr/datasets/. A short server-side script is auto-ran every 2 days, itself using lingua-libre/CommonsDownloadTool. For more, see Help:Download from LinguaLibre.

How to add missing languages ?
- Administrators can add new languages, they do so within few days. For users, please provide your language's iso-639-3 code + link to the en.wikipedia.org's article. Optional infos are the common English name and wikidata IQ. For more, see Help:Add a new language.

How to archive sections which have been answered ?
- After reviewing the section, add `{{done}} -- can be closed ~~~~` to the top of the section. After some days to 2 weeks, move the sectin's code to LinguaLibre:Chat_room/Archives/2018.

How to keep my wikimedia project up to date ?
- Contact User:0x010C, the botmaster of Lingua Libre Bot. For more, see Help:Bots.

What IRL event.s are coming ? When ? Where ?
- Paris's Events/Hackathon_15-16_décembre_2018 just finished. More events to come. For more, see LinguaLibre:Events.

How to translate LinguaLibre User Interface into a new language ?
- Go to translatewiki.net, change the url part fr into your language's ISO 639-2 code. For more, see Help:Translate.

Utiliser le Lingua Libre Bot dans l'incubator:shy

Est-ce que c'est possible de faire la même chose pour le wiktionnaire en Chaoui ? je veux dir est-il il possible d'utiliser votre bot sur notre wiktionnaire aussi ? je peux donner l'algorithme du wiki-test. Cordialement. -Reda Kerbouche (talk) 12:32, 8 July 2018 (UTC)

Oui bien sur ! Avez-vous un bistro / village pump / ... pour en discuter là-bas ? — 0x010C ^~talk~ 15:24, 8 July 2018 (UTC)

Oui il y a un bistro vierge du wiktionnaire Chaoui que vous pouvez activer. Ou bien celui de l'incubator où en peut discuter avec des administrateurs à propos de l'autorisation du bot. Cordialement. -Reda Kerbouche (talk) 18:26, 8 July 2018 (UTC)

Je suis en ce moment en chemin pour Wikimania, je vais n'avoir que très peu de temps jusque là, mais je lancerais la discussion à mon retour. Cordialement — 0x010C ^~talk~ 11:43, 11 July 2018 (UTC)

Bon voyage.--Reda Kerbouche (talk) 21:48, 11 July 2018 (UTC)

0x010C J'espère que vous m'avez pas oublié =) Car en septembre on lance un concours pour le wiktionnaire en Chaoui, et si on peut enregistrer des mots qui vont passer directement sur incubateur, je fais la promo de Lingualibre en même temps que la promo du concours.--Reda Kerbouche (talk) 14:01, 16 August 2018 (UTC)

Reda Kerbouche, 0x010C, Is this {done} ? --Yug (talk) 11:04, 15 December 2018 (UTC)

Bots-related documentation could be gathered in Help:Bots Yug (talk) 11:01, 31 December 2018 (UTC)

Liste sur le modèle de Petscan

Salut, est ce qu'il serait possible de faire une liste à la volée sur le modèle de ce qu'est capable de faire Petscan ? Ici, on a la liste de tous les lemmes du Wiktionnaire qui n'ont pas de catégorie « Prononciations audio en français » ce qui signifie qu'il n'ont pas le modèle « écouter » qui permet d'ajouter les entrées dans cette catégorie. Je trouve que la génération d'une telle liste serait vraiment sympa pour les Wiktionnaires. Pamputt (talk) 06:07, 12 July 2018 (UTC)

L'idée est bonne en effet, cependant ça représente un gros boulot à intégrer sur Lingua Libre. Je pense qu'il serait intéressant d'en discuter un peu et d'établir un petit cahier des charges de ce que l'on veut pouvoir faire (tout dans petscan n'est pas utile ici). — 0x010C ^~talk~ 22:00, 14 July 2018 (UTC)

0x010C, est ce que tu penses que l'exemple que j'ai donné ci-dessus (lemmes en français qui n'ont pas de prononciation) peut être implémenté à partir de MediaWiki:Gadget-Demo.js. Pamputt (talk) 14:23, 14 October 2018 (UTC)

Oui c'est exactement ça, il faut passer par la création d'un nouveau générateur de mots. Dans mon début de réflexion plus haut, je réfléchissais à comment implémenter les fonctionnalités de petscan dans un générateur. Sauf que niveau performance et rapidité, on pourrait jamais faire quelque chose d'utilisateur avec des catégories aussi grosse que "Lemmes en français", je m'explique. Petscan fait son travail de recherche et de recoupement côté serveur, directement sur une copie de la base de donnée des wikis (il peut ainsi en un coup explorer tous les enregistrements). Or ici, nous n'avons pas d'accès à la base de donnée et les calculs doivent être fait côté client, en javascript. On dépend donc de l'API des wikis en question pour récupérer les données, API qui n'est pas du tout faite pour travailler sur des catégories très grosses (ne peut retourner que 500 membres par requête, etc).

Bref, c'est pas possible. Cependant, on peut imaginer se reposer sur petscan pour faire le boulot chiant à notre place (ce générateur deviendrait complètement dépendant de cet outil externe, une panne de ce dernier bloquerait le fonctionnement du premier). Je vois trois options :

le générateur reprend un certain nombre de champs de petscan, et va à partir des valeurs fournies générer une requête à petscan (complexe pour l'utilisateur lambda, flexible pour l'utilisateur expérimenté) ;
le générateur propose à l'utilisateur de choisir parmi un certain nombre de requêtes petscan préparé à l'avance par nos soins (par exemple en cliquant sur "mots en français n'ayant pas de prononciation sur le wiktionnaire francophone", ta requête exposé plus haut serait utilisé), ou de coller l'URL / l'identifiant d'une requête qu'il a préparé / trouvé (plus simple à implémenter, nous oblige à créer pleins de requêtes pour supporter différentes langues, assez flexible) ;
on fait un générateur spécialisé "mots dont la prononciation est manquante" où il va automatiquement forger la requête petscan pour faire comme dans ton exemple pour la langue sélectionnée (facile d'utilisation, très spécifique mais potentiellement très utile, nous obligerait à renseigner manuellement les catégories wiktionnaire correspondante car je ne vois aucun moyen de deviner le nom de la catégorie d'une langue à partir de son code ou son id wikidata...)

Qu'en penses-tu ?

— 0x010C ^~talk~ 02:53, 16 October 2018 (UTC)

La première proposition me semble trop usine à gaz et bien que puissante, je ne pense pas qu'elle s'adresse au public de Lingua Libre.Entre les propositions 2 et 3, j'ai une préférence pour la 2 car elle est simple d'utilisation au premier abord (on utilise des requêtes pré-forgées) tout en permettant une utilisation avancée (avantage de la solution 1). Et par rapport à la solution 3, ça évite de la maintenance pour déterminer la langue d'une catégorie donc c'est plus maintenable sur le long terme à mon avis. Pamputt (talk) 06:23, 17 October 2018 (UTC)

@Pamputt: Entre deux avions, je viens de finir une première version du générateur petscan, activable via préférences > gadgets. Est-ce que tu peux y jetter un œil et me dire ce que tu en penses avant que je continue et que je l'annonce plus largement ?

Merci — 0x010C ^~talk~ 08:39, 22 October 2018 (UTC)

0x010C, j'ai activé le gadget et je vois bien PetScan dans la liste. J'ai fait quelques essais et ça fonctionne bien. J'ai essayé avec l'URL du premier message et ça fonctionne nickel. En revanche, j'ai essayé avec ça et ça m'indique "Petscan output something weired with this URL, check it and come back afterwards.". En revanche si j'ajoute le « &doit= » à la fin, ça fonctionne correctement (est-il vraiment nécessaire) ?

Autre point, cest-ce qu'il est déjà possible de préparer des requêtes pré-faites (« mots en français n'ayant pas de prononciation sur le wiktionnaire francophone », ...) ou pas encore ? En l'état c'est déjà super cool. Pamputt (talk) 17:04, 22 October 2018 (UTC)

J'avais oublié que cetaines URL pouvaient ne pas avoir l'auto-run, c'est fix. Je réfléchis actuellement à la meilleur façon de faire en fait. Ma problématique, c'est qu'une requête comme « mots en français n'ayant pas de prononciation sur le wiktionnaire francophone » n'intéressera que ceux qui font des enregitrements en français, si un germanophone dois scroller 25 requêtes qui le concerne pas (et qu'il ne comprend surement pas) avant d'en trouver une en allemand, c'est pas cool pour lui.

De là, trois idées qui me viennent en écrivant ces lignes :

Une page par langue, dans l'espace de nom list (List:fra ? List:fra-external ? List:fra-examples ? ...) qui regroupe via une liste à puce toutes les urls dispo pour une langue ;
Une fois ce travail fait, ce n'est pas très compliqué de supporter d'autres outils externes qui peuvent être appelé via une URL et renvoyer le résultat en JSON ; je pense notamment à querry.wikidata.org ;
Et là, plus une réflexion, est-ce que ça serait pertinent une fois que ça sera stable de l'intégrer au générateur "listes" actuel (genre avoir deux onglets dedans, "listes statique", "listes dynamiques/externes/..." ?), ou l'intégrer comme un nouveau générateur à part entière dans le core du RecordWizard ? (et du coup comment le nommer dans ce cas ?)

Un avis externe me serait bien utile pour trancher tout cela :) — 0x010C ^~talk~ 19:52, 22 October 2018 (UTC)

Variations géographiques

Bonjour,

Bravo pour ce projet très intéressant.

Je me pose une question à propos des prononciations. Je suis du sud de la France et contrairement à une bonne partie du reste de la France, nous usons beaucoup de l'accent tonique (influence italienne et espagnole, j'imagine). Du coup, la prononciation de certains mots, et surtout des locutions, ont une rythmique différente par chez moi.

Comment gérer ces variations de prononciation ? Ont-elles droit de cité ou comme les québécois doit-on privilégier un "Français international" neutre ?

Pour finir sur le sujet, la prononciation de certains mots sont différentes chez nous : lait, mas, moins (avec un s !), etc. Comment intégrer ça dans Wiktionnaire ou Wikipédia ?

Jpgibert (talk) 12:02, 13 July 2018 (UTC)

Bonjour,

Merci pour ton intérêt !

Non, il ne faut surtout pas privilégier un français "neutre". Chaque variation / accent locale est intéressent. En fait, juste avant de commencer à enregistrer il t'es demandé de remplir ton profil de locuteur, dans lequel tu peux renseigner ton lieu d'habitation / d'apprentissage d'une langue.

Lorsqu'un enregistrement est ajouté ensuite sur le Wiktionnaire par exemple, cette information y est inclu. Si plusieurs personnes ont enregistré les même mots, on pourra donc écouter les différences de prononciation de « lait » en Alsace, au Québec, en Occitanie, en Île de France, au Mali,... Et ça c'est cool :)

Cela répond à tes questions ?

Cordialement — 0x010C ^~talk~ 21:55, 14 July 2018 (UTC)

Bonjour User:0x010C

Merci pour la réponse. Je m'inquiétais de la chose parce que s'il existe un code linguistique pour les variations du français au Québec (fr-CA) ou de Belgique (fr-BE), en revanche l'accent n'est pas pris en compte.

Content d'apprendre que malgré mon accent, je serai le bienvenu. Bon pour le moment, faut que j'achète un bon micro avant de faire quoi que ce soit, mais dès que j'aurai ça, je tenterai de partager mon accent méridional.

Jpgibert (talk) 12:31, 23 July 2018 (UTC)

Thésaurus

Bonjour,

Durant la vidéo de présentation du projet par Lyokoï (LetsContribute6), j'ai appris qu'on pouvait générer des listes de mots à partir de catégories. Serait-il possible de faire le même genre de chose à partir d'un thésaurus ? Question subsidiaire, est-ce que ça à un intérêt ?

Jpgibert (talk) 12:39, 23 July 2018 (UTC)

Ca pourrait effectivement être intéressant même si c'est plus compliqué à coder (j'imagine). Juste pour donner un exemple pour ceux qui ne voient pas ce dont il est question, on peut aller voir ici. Pamputt (talk) 21:30, 23 July 2018 (UTC)

@Jpgibert Le plus simple pour faire ça, c'est de copier-coller le contenu du thésaurus et de séparer les mots avec un #. Ça doit demander quelques minutes pour être mis en forme, mais ce n'est pas non plus le Pérou. Lyokoï (talk) 14:22, 15 December 2018 (UTC)

General issues + issues with Odia and Asian writing systems

Done, all issue tracked on phabricator or explained below. Ready to archive. Yug (talk) 23:22, 23 December 2018 (UTC)

I loved the current version! Truly admire the changes you all have made over time. I have also done a few recordings in my own language Odia to check for any error. Below are a few:

Tag already recorded items (T212580): When a word has already been recorded and has been uploaded on Commons, does is not make sense to show it as a flag instead of letting any user to upload it directly?
Add custom commons categories (T201135): Also, different languages have different additional categories which Lingua Libre does not let one to add. For instance, I generally add a user category to know how many audio files I have uploaded. For the files recorded using Lingua Libre, I don't see an option to add that optional category.
Remove duplicated words (in same session: explanation below ; across time: T212580): If I am adding a wordlist before recording, is that possible to keep only one word if the same word is used multiple times? This would save some time for the uploader.
Monitor suspect cracking sound in audios (T201136): There is a bit of crackling sound that is heard while monitoring the recorded words. Any particular reason?
Some words fails anyway (T212584): Even though I am correctly pronouncing every word, I see a lot of red-labelled words.
Allow click-play-listen while recording (T212583): While recording, I cannot check how the recording sounds like. I can only choose to re-record after hearing the recorded sound. Otherwise even having that option is of no use.
Remove underline (done): While recording each word is seen as a green button and during the recording the word is underlined. This works well for Latin-based scripts. However for my script, Odia, and even many other Asian languages, this is a problem as we have diacritics and accent marks below the character. It becomes too hard at times to read when underlined. Also, the light green color and a white background is not accessible to people with corrections or color blindness. Maybe black background with white text will create more contrast and make it easier to read.
Last word cannot be re-recorded (explanation below): When you reach the last word of a batch and want to re-record that word, it doesn't allow you to click on the word button and re-record.

Also, requesting to add the Warang Citi (used for Ho language) and Ol Chiki (used for Santali language).

Thank you much again. I would really love to contribute more myself, and involve other community members. --Psubhashish (talk) 07:21, 26 July 2018 (UTC)

Hi!

First of all, thanks for your feedbacks, that's really helpful. Here are some details about your remarks:

In my opinion, it is interesting to have several records of the same word by different users, the naming convention takes this into account to avoid records to be overridden by another user. But as I'm not sure I understood this point very well, don't hesitate to clarify it if my answer is mistaken.
T201135
If I have correctly understood your point, that's already the case. You can't add duplicate words in the same record batch (if you try to do so, the second one will be dismissed).
It's just a small file-loading issue, it will be fixed soon, see T201136
This is a major issue I'm already aware of. In some cases (~ 1 word out of 100), for some unknown reason, MediaWiki is mistaken in taking WAV files for executable files, so it refuses them...
I'll try to add a way to listen the records while still in the recording studio.
I wasn't aware of that particularities, I'll remove the underline. I'm not so fond of the white text on black, but I'll try to find something more accessible.
Hum, this works well with me. When you have recorded the last word, the record automatically cuts off, did you click on the big red button to enable it again?

I've imported the Ho language, which was missing from Lingua Libre, but the two writing system you've mentionned are part of Unicode and should works, am I wrong?

Best regards — 0x010C ^~talk~ 08:37, 3 August 2018 (UTC)

+1 for point 7, the underline is also troublesome for Chinese. Yug (talk) 13:08, 6 August 2018 (UTC)

Hi! Continuing the cleaning effort and tracking of issues, also to stay short and concise, I enhanced the initial post with title and status (phabricator issue). Sorry for that, just cleaner. Yug (talk) 11:33, 24 December 2018 (UTC)

Note: I pointed out to Psubhashish the work on his former feedbacks. See positive discussion on EN. Yug (talk) 13:45, 9 January 2019 (UTC)

Première utilisation : quelques questionnements

Bonjour !

Tout d'abord, merci beaucoup pour ce super outil !

J'ai remarqué quelques difficultés à l'usage. Peut-être que c'est juste parce que je suis nouvelle et pas au courant de toutes les options, mais voilà ma liste :

Sur une liste de 20 mots, il faut généralement que je reprenne l'enregistrement manuellement trois ou quatre fois parce que l'outil décide soudain de ne plus enregistrer. Quand je sélectionne un mot, même en cliquant sur le gros bouton rouge, il y a à peu près une chance pour deux pour que l'enregistrement se lance.
Mes mots sont très souvent coupés au début et à la fin (pour les noms propres en deux ou trois mots surtout) : peut-être qu'il serait pertinent d'avoir un petit bouton "next" pour marquer manuellement les fins de mots ? Sur 20 mots enregistrés, entre ceux que l'outil n'a pas envie de me laisser enregistrer (cf #1) et ceux qui sont coupés, m'en reste peu. Sur 3 listes d'une vingtaine de noms, j'en ai eu 2, 5 et 7 exploitables.
Sur une page d'enregistrement comme Gwendoline Daudet (Q44570), le lien vers la page Wikipédia met un + au lieu d'un _ entre les mots donc on arrive sur un lien rouge dans Wikipédia.

Si ça peut servir, je suis sur la dernière version en date de Firefox au 10/10/18 & Windows 10.

Pour le reste : c'est vraiment super, bravo pour tout ce travail ! Je vais continuer à faire joujou avec l'outil jusqu'à être bien familière avec.

Exilexi (talk) 06:22, 10 October 2018 (UTC)

Les problèmes 1 et 2 sont en fait quasiment réglés avec un meilleur micro. Lingua Libre demande la permission pour un micro qui n'est pas mon micro par défaut, pour une raison inconnue.

Nouveau souci avec l'upload : tous les mots sauf 1 sont bien téléversés. Le bouton Commons s'affiche en grisé et rien ne se passe si je clique sur la petite croix à côté d'un mot : apparemment, c'est tout ou rien pour mettre sur Commons, donc je viens de perdre 29 mots parce qu'un seul refusait de s'uploader. Exilexi (talk) 06:44, 10 October 2018 (UTC)

J'en ajoute un : j'avais enregistré 20 mots "autour de moi". Là, je viens d'en lancer 20 autres... et c'est les mêmes. Il pourrait être intéressant d'ajouter une option pour éviter d'enregistrer plusieurs fois la même chose (mon accent ne change pas d'un jour à l'autre). Exilexi (talk) 05:36, 11 October 2018 (UTC)

Salut Exilexi, quelques remarques ou éléments de réponse à tes commentaires

Lorsque tu décris que l'outil stoppe l'enregistrement, je pense que le problème vient de la qualité du micro. C'est ce que tu sembles avoir conclu également.
Lingua Libre découpe les mots automatiquement dès qu'il détecte un blanc. Pour les noms à rallonge, on pourrait envisager d'ajouter un bouton pour passer manuellement au mot suivant. Cela étant dit, ça perd un peu de l'intérêt de l'outil car ça devient beaucoup plus lent.
concernant le lien vers Wikipédia (avec un « + »), ça semble en effet un bogue. J'ai ouvert un ticket sur Phabricator.
pour les problèmes d'upload, quand un téléversement échoue, un ticket existe déjà sur ce sujet.
pour les listes de mots, il est possible d'en créer soi-même. Il en existe déjà plusieurs en français (quelques dizaines) et moins dans les autres langues. Il est expliquer ici sur la façon de procéder. Si tu as besoin d'aide, fais-nous signe. Pamputt (talk) 20:13, 11 October 2018 (UTC)

Formosan languages workshop

Hi there, I had an email exchange with Vicky, the NCCU language researcher involved in Formosan languages protection. Some of her questions are beyond my skills :

1. I couldn't find ais(Sakizaya), ami(Amis), trv(Truku) in the language list. Please add, thanks!
2. Can I add the dialect information in the speaker file? 
Because there are 42 dialects under 16 aboriginal languages, I had record Squliq dialect not C’uli’ dialect of Atayal language today.
3. I had add the Chinese translation after the aboriginal languages, is that ok for lingua libre? 
Or I only can type in aboriginal languages?

I broke the questions in several subsections so a quick discussion may occurs for each. Please take notes that Vicky workshop is coming this week, so it would be cool to forward her practical solutions early. Yug (talk) 09:38, 29 November 2018 (UTC)

1) Requesting languages additions

Amis_language (iso: ami; wikidata: Q35132).
Sakizaya has no iso639, from my understanding. Sakizaya_language (iso: none, wikidata: Q718269), Nataoran_language (iso: ais, wikidata: Q42508148).
Truku (no iso no wikidata) : is described in Wikipedia as the main component of Seediq language (iso: trv, wikidata: Q716686), already in LinguaLibre. Taiwanese linguist, the most experienced in the matter, are making a distinction.

If I understand well, LL only requires wikidata ID. If so, I would recommend to add Q35132 (amis), Q718269 (Sakizaya). Q42508148 (Nataorans) and Q716686 (Seediq) are already in I think. Truku may require a wikidata item creation, then integration in LL. Yug (talk) 09:38, 29 November 2018 (UTC)

The four languages have been imported here: Seediq (Q51311) Seediq, Amis (Q51870) Amis, Sakizaya (Q51871) Sakizaya and Nataoran (Q51872) Nataoran and can be used for recording. Pamputt (talk) 04:15, 30 November 2018 (UTC)

2) "There are 42 dialects under 16 aboriginal languages".

We previously added 15 or 16 of these recognized languages into LinguaLibre (thanks x0 and Pamputt). Again, Taiwanese linguists are the experts on the matter, so what can we (LL) recommend for these 42 variants ? Two ideas came to me.

Add the information in he speaker name or place of learning. By example for : Paul Martin (Breton north) ; Paul Martin (Breton south).
Add the Wikidata items following Taiwanese linguists recommendations, while no wikipedia articles nor iso639 exists.

What do you think ? Yug (talk) 09:38, 29 November 2018 (UTC)

As far as I uundertand, if no Wikidata item exists for a given language, we have two options: create it on Wikidata (whether it is notable) and import here after or create it by hand directly here. So for dialect, I would say they are enough notable to be created on Wikidata but I have no time to do it by myself before the end of the year (I have no regular Internet connection for now). Pamputt (talk) 04:18, 30 November 2018 (UTC)

In fact, the second option mentionned above by Pamputt won't work. For a language to be recognised by the RecordWizard, it has to have a wikidata ID. The right way to do it imho is (as also suggested by Pamputt) to create the corresponding item on wikidata, and then ask for an import here. — 0x010C ^~talk~ 14:46, 3 December 2018 (UTC)

3) "Is it ok to use `mhway su (谢谢)` ?" (target word + translation)

Technically, both aboriginal languages and Chinese, de factor the target word together with its closest macro-language's translation, here, Chinese.
Keep extremely consistent in your practice, so to ease later usages (learning apps). If the rule is

{aboriginal}{white_space}{opening_round_braket_(}{Chinese}{closing_round_braket_)}

stick to it, and avoid round brackets in other places of your element. Early consistency makes later usages easier. Yug (talk) 09:38, 29 November 2018 (UTC)

@x0, devs, there again we have the questions of wordlists with translations. I previously suggested that words lists support a iso639 syntaxe or wikidata id syntax so to push the translation into a different metadata field. Example of list :

mhway su [cmn:谢谢,eng:Thank]

Then "mhway su" is the target recorded word. "谢谢" is the translation in the meta data "cmn" (Chinese). "Thank" is the translation in the meta data "eng" (English). I guess I should open a ticket on Phabricator. Yug (talk) 10:19, 29 November 2018 (UTC)

Multi-lingual wordlist --wordlist including the translation of target words-- are not supported at the moment. An issue have been opened on LinguaLibre developments and bugs tracking system (T211086). Yug (talk) 09:29, 4 December 2018 (UTC)

Thésaurus (2)

J'ai archivé le coeur de la discussion de Benoit & 0x010C, mais cet autre sujet mérite une section:

"Rien à voir. Je pensais qu'un petit outil de génération de liste depuis un thésaurus fr.wikt ce serait top. Au lieu de choisir une catégorie d'un wikiprojet, on choisirait un thésaurus. Une idée comme ça. --Benoît 21:36, 20 December 2018 (UTC)"

--Yug (talk) 10:41, 24 December 2018 (UTC)

Feature request: ask to reuse existing identical audio if available

Done, can be archived. 12:08, 31 December 2018 (UTC)

I waste a lot of time because Lingua Libre Bot has to have new audio for every lexeme forms. For example this audio https://commons.wikimedia.org/wiki/File:LL-Q809_(pol)-KaMan-Bizancjum.wav I had to record 10 times (https://lingualibre.fr/index.php?title=Q55850&action=history). A lot of forms in Polish language is duplicated in different cases. It would be great if in word generator (+ExternalTools) in Record Wizard could be question to ask if duplicate should be recorded (identical speaker, language and lexeme), and Lingua Libre Bot propagate existing audio. It could save time. KaMan (talk) 14:28, 25 December 2018 (UTC)

KaMan, where does your wordlist(s?) come from ? how is it created ? You use LinguaLibre word generator ? Yug (talk) 00:12, 27 December 2018 (UTC)

If I understand well, you eventually have the same issue as raised in Warn the user when they try to record a file that they already made. Namely, you meet again and again words that you already recorded. If this is correct, then we started to look for technical solutions (T212580). As of now, for long series, it is important to stick to large frequency list, so to not re-record similar words multiple times. Yug (talk) 00:17, 27 December 2018 (UTC)

I took a look online for available frequency lists in polish.

Subtlex-pl : article, http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-pl data, available but "for research usage".
Worldlex : article, data, available but unstated license
Hermit Dave, 2016 : page, data, CC-by-sa

So Hermit Dave's data would do. We have tutorials on how to clean up frequency lists,how to split such long file, other tricks, and how to create a list on LinguaLibre to help.

Some command will need minor changes if your input differs. If you have some basic shell skills, you can do it and learn the exact commands needed quickly. Yug (talk) 01:30, 27 December 2018 (UTC)

No Yug. He's talking about word lists generated with a SPARQL query from Lexemes on Wikidata, and from the fact that Lingua Libre Bot only associate audio recordings on the Lexeme when there is a direct link, causing him to re-record many times homograph words that are also homonym.

But the main issue I pointed out in T212580 apply here too, I don't have any idea of easy and effective implementation right now.

(and no Yug, it is not "important to stick to large frequency list", we have other —more simple— solutions yet as Wikimedia categories or external tools imports).

Best regards — 0x010C ^~talk~ 11:10, 27 December 2018 (UTC)

0x010C is right. It's not problem of wrong list, list of words is correct. If there is no easy solution to it I can work with it as is but I admit I feel pain ;) before recording of 14 identical forms of https://www.wikidata.org/wiki/Lexeme:L19356 :) KaMan (talk) 13:22, 27 December 2018 (UTC)

"Who doesnt try cannot be wrong." It really needs to read between lines to find the Wikidata reference. "Lexeme" is lexicology term before being a Wikidata item type. The current SPARQL query doesnt seems time savy.

And yes, generally speaking frequency list of unique words save our speakers energy. First, each form is recorded only once : this is why human speakers are for, and they shouldn't have to record multiple times a same form. Second, in natural language, words frequency follow the Zipf's law. Thus, the 135 most frequent English items represent 50% coverage of written text. On the opposite side, recording Wikipedia categories is not representative of human language and thus not time efficient. One volunteer can audio record 2000 categories it will still barely account for 1% of this human language. This only has internal value, by wikipedians for wikipedians, which is positive but sub-optimal.

As of KaMan's case, I would still recommend using frequency list : it would save valuable human time. A later bot could dispatch the audios upon the various wikidata items of this language and form. So I just used Hermit Dave CC-by-sa data to create Polish language frequency lists on LinguaLibre for the first 20k words, they are now availale to in the Record Studio > Details step : Local list > "pol". Yug (talk) 13:51, 27 December 2018 (UTC)

Yug, it's not a problem of frequency list but feature of language. I record all FORMS of words. Every noun in Polish has at least 14 forms, every adjective has 30-80 forms, same for verbs. Every form has entry in Wikidata and needs recording. But many of these forms are identical so in the end I have to record the same audio several times. It is independent from the fact the word is from frequency list. In other words word from frequency list has the same problem in Wikidata. BTW: I already follow frequency list in creating lexemes in Wikidata, but thanks :) KaMan (talk) 16:27, 27 December 2018 (UTC)

I think I get your process now. Learning ongoing ! Still seems weird you are recording 14 times the same form. Yug (talk) 16:58, 28 December 2018 (UTC)

Homonymy

How homonyms are treated? Will they be overwritten with new recordings? Infovarius (talk) 17:42, 27 December 2018 (UTC)

Yes, if a new word has the same transcription, the same language and the same speaker as an old one, it will be override. If you want to record two homonym words that have a different pronunciation, you can add a small qualifier into brakets just after the word when you type it in the 3rd step of the RecordWizard. Everything that is inside brackets will be put aside, like on this record File:LL-Q150 (fra)-0x010C-fils (enfant).wav. — 0x010C ^~talk~ 21:26, 27 December 2018 (UTC)

It is good that this is possible in principle. But how can I know that I am recording a homonym of something already recorded? Infovarius (talk) 21:51, 27 December 2018 (UTC)

How to properly credit lists

Done : no built in solution as of now, issue opened (T212671), current hack: put source in talk page. Yug (talk) 10:53, 31 December 2018 (UTC)

(T212671) I attempted this List:Pol/words-by-frequency-2001-to-4000#Source, but loading the list in the Record Studio keeps the source section as a word to record. Is there a known trick to hide this source section in the Record Studio ? Yug (talk) 16:56, 28 December 2018 (UTC)

Erreur de téléversements

Salut, je rencontre un problème assez curieux. Lorsque j'ai fini de m'enregistrer, je choisis de publier sur Commons et là, une partie de mes enregistrements sont publiés et puis ça se met à planter. Après quoi, je ne peux plus en ré-upload pour une certaine période de temps. Que dois-je faire ? Lepticed7 (talk) 21:17, 29 December 2018 (UTC)

Salut,

Désolé du délai de réponse, j'étais loin de mon ordinateur pour les fêtes.

Est-ce que ça t'es arrivé de nouveau depuis le 29 ? Si oui je vois deux possibilité : soit tu t'es fait déconnecté de Lingua Libre en plein milieu du versement, soit un filtre sur Commons bloque les uploads pour toi pour une raison mystérieuse. Si ça arrive de nouveau, peux-tu essayer d'ouvrir lingualibre dans un nouvel onglet, et vérifier dans cet autre onglet si tu es bien connecté ? Si le problème est là (mais ça devrait plus arriver normalement), une simple reconnexion dans l'onglet d'à côté suffit pour pouvoir ensuire reprendre le versement des enregistrements échoués.

— 0x010C ^~talk~ 15:36, 2 January 2019 (UTC)

Menu and naming

2019 Prague Wikimedia Hackathon and scholarship (bourse)

Event: 2019 Prague Wikimedia Hackathon
Place: Prague, Czech Republic
Date: 17-19 May, 2019
Objective: push wikimedia dev projects forward, via coding, networking, documentation.
Scholarship : possible ! Please apply before January 8th included. Please send info to potential candidate.
Link: mediawiki.org:Wikimedia_Hackathon_2019/Register_and_Attend

Please spread the word around the world ! Yug (talk) 20:21, 4 January 2019 (UTC)

Word frequencies for prioritizing, UNILEX and licence

Would it make sense to prioritize the data entry, so that users would start recording the most frequent words of a language, and then proceed to the less important words? If you’d like to do this, here’s the word frequencies for 1000+ languages, mostly from crawled corpora. Language codes are IETF BCP47. — Sascha (talk) 08:56, 8 January 2019 (UTC)

This would be indeed useful. To be available on Lingua Libre, we have to create manually (or using bots) lists with these words. I will try to find some time to do it. Pamputt (talk) 12:04, 8 January 2019 (UTC)

Lol. Sascha is in computational lexicology since 1993 ^^ #Boss Yug (talk) 16:14, 8 January 2019 (UTC)

Welcome Sascha, Happy to have your inputs,
We do encourage frequency lists usages (see Help:Why wordlists matter ?). LinguaLibre is still in it's open beta infancy.

Process and quality : We started to add some frequency list (Polish) by hand based on Hermite Dave project (50k list, github, wordpress announcement). Hermite's free data is helpful yet quite raw, namely: polluted by foreign languages. So when available, we use cleaner list from academic research. Ex: Chinese is planed via Subtlex-ch. These raw text lists are then copy-pasted into LL wikipages, so one of these lists is then loaded in the record wizard to provide a list of words for the speaker to read aloud. There is no interactive sorting, it's just loading the list as a text.

Licence : The other issue we have is that half of frequency lists around have weird semi-free licenses not or unclearly compatible with Wikimedia projects. UNILEX's licence is the UNICODE licence.
@LL team : Any idea how we handle data and license asking :

provided that either
(a) this copyright and permission notice appear with all copies of the Data Files or Software, or
(b) this copyright and permission notice appear in associated documentation.

We copy it to the talkpage as well ? --Yug (talk) 16:48, 8 January 2019 (UTC)

Good point about the license. Theoretically I could ask the Unicode Consortium to change the license for Unilex to CC0; but like any relicensing discussion, this would take forever. As the person who started the Unilex project at Unicode, I currently have the impression that Wikidata Lexemes is going to be the better (more scalable, faster progressing, eventually higher quality) approach for collecting lexical data about the world’s languages. So, instead of starting a painful relicensing debate, I think it’ll be easier to simply run corpus crawler to build these word lists from scratch. I’ve written that crawler a while ago to get started with the Unilex project; the Unilex word frequencies were built by running 1000 crawls (one for each language), and then segmenting their plaintext output with ICU word break iterators. I’ve now placed a link to the Corpus Crawler sources on Help:How_to_create_a_frequency_list, in case someone here wants to give it a try. If anything’s broken there, or to support additional languages in the crawler beyond the current 1000, just send a pull request via GitHub. You can also fork the crawler project if you want; the source code is a pretty dull Python script with a regular Apache-2.0 license. — Sascha (talk) 17:02, 14 January 2019 (UTC)

Enable all human languages in bulk?

Would it be possible to support all existing human languages at once? Currently, one needs to file a request for each and every language. It’s not very clear how to do this (which of the admins to contact, and how exactly to contact them?). Also, the LinguaLibre admins surely can make better use of their time than by handling single language requests... For a list of all languages, see the IANA language subtag registry for IETF BCP47. There’s only a few thousand languages, so it might be easy to do this in one single bulk, and then be done. If it helps, I’ll gladly generate a list of (IETF-BCP47-Code, Wikidata-ID) with the mapping, or any other information you’d need for this; feel free to contact me. — Sascha (talk) 09:32, 8 January 2019 (UTC)

+1. I think there is some techical issues for search fields... anyway to go forward ? Yug (talk) 17:07, 8 January 2019 (UTC)

Hi Sascha,

For now on, I only imported languages with an iso639-2 tag, to test Lingua Libre's software with a smaller set of languages for its start (Lingua Libre is still in beta). Importing every languages in the world is planned, but not on the short term, because I still have to check if the database and the software is able to manage smoothly thousend and thousend of languages.

Best regards — 0x010C ^~talk~ 18:24, 8 January 2019 (UTC)

Use IETF BCP47 instead of ISO 639?

Currently, LinguaLibre seems to use ISO 639 language codes internally. Consider switching to IETF BCP47; all modern computing standards such as HTML, XML or PDF have moved from ISO 639 to IETF BCP47. For example, BCP47 syntax supports regional variants such as Canadian French fr-CA; language variants such as Sursilvan Romansh rm-sursilv; regional subdivisions such as the Berne variant of Swiss German gsw-u-sd-chbe; and other fine-grained distinctions. See this article for an introduction, and the IANA registry of valid subtags for the complete list. Specifically, the proposal would be to add property IETF BCP47 language tag (Q1059900) to LinguaLibre’s copy of the Wikidata schema, and to use that property instead of ISO 639-3 code (Q56217712). — Sascha (talk) 10:40, 8 January 2019 (UTC)

Hi Sascha!

In fact, Lingua Libre uses nor ISO 639-3 nor BCP47 but Wikidata Qids as internal identifier for a language. Currently, and if I remember correctly, ISO639-3 codes are used in two cases:

For the name of pages containing lists in the list namespace (in the format [[List:ISO/List name]], with ISO the iso6369-3 code);
To forge Wikimedia Commons's category names

Changing the code would affect only those two parts of the process. If we switch from one language tag to an other, we would have to:

Add a new property BCP47 locally as you suggested (a bot can import them from Wikidata);
Rename all local lists (can be made by hand, we don't have many lists for now on);
Rename all existing Wikimedia Commons categories and move all the audio recordings (a bot there is required);

I have personnaly no opinion on this question, but if several person agree that it would be a good move, I'll add it to the development todo-list :).

Best regards — 0x010C ^~talk~ 18:19, 8 January 2019 (UTC)

Cool! I wasn’t aware that you’re internally using Wikidata IDs. This is great, because (other than ISO 639-3) it can model arbitrary languages and dialects.

Regarding the lists, would it perhaps be an option to key them by Wikidata ID? Then, arbitrary languages/dialects could be queried, and also regional variants such as Australian English. I don’t know how your server is implemented, but perhaps you could map language codes to Wikidata IDs in your frontend server, so it would not even have to be a user-visible change (apart from supporting more languages).
Regarding the names of categories on Wikimedia Commons, what would you think of the proposal to use IETF language codes instead of “other”?

Best, — Sascha (talk)

— Sascha (talk) 06:14, 11 January 2019 (UTC)

Documenting langtag usages on LL

See Help:Langtags and Wikipedia Language code#Common_schemes

In our Help:Main, we surely could have a page Help:Langtags (Languages codes and LinguaLibre) to expose our current / planned approaches on the matter. Yug (talk) 13:23, 9 January 2019 (UTC)

Help:Langtags (Languages codes and LinguaLibre) have been initiated. So, for now we are based on LL Qid, ok. Then,

Should these local LL pages contain ISO 639-3 and BCP47 properties, or should they go into the Wikidata page ONLY ? Or both.
Audios files could contains all these as metadata tags. Should they ?
If someone could forge a SPARQL query which list all our active languages on LL, with English name, LL-qid, WD-qid, ISO 639-3, BCP47, it could be an helpful conversion table. Yug (talk) 13:50, 9 January 2019 (UTC)

Yug here is you query :

select ?languageLabel ?language ?WD ?isoCode (COUNT(?record) AS ?count)
where {
?record prop:P2 entity:Q2 .
?record prop:P4 ?language .
?language prop:P12 ?WD .
?language prop:P13 ?isoCode .
SERVICE wikibase:label {bd:serviceParam wikibase:language "en" .} 
}
GROUP BY ?languageLabel ?language ?WD ?isoCode
ORDER BY DESC(?count)

As far as I can tell, there is no BCP47 property on LL and I added the number of records in these languages. And I don't know how to share a direct link to the query on https://lingualibre.fr/bigdata/#query ). Cheers, VIGNERON (talk) 09:47, 11 January 2019 (UTC)

I created T213530 to ask for implementing a direct link to a query. Pamputt (talk) 10:23, 11 January 2019 (UTC)

Support variants of Romansh

Done -- can be closed Sascha (talk) 20:31, 11 January 2019 (UTC)

Would it be possible to add support for the various variants of the Romansh language?

In the IETF BCP47 language subtag registry, rm-rumgr is the language code for Rumantsch Grischun; rm-surmiran for Rumantsch Surmiran; rm-sutsilv for Rumantsch Sutsilvan; rm-sursilv for Rumantsch Sursilvan; rm-vallader for Rumantsch Vallader; rm-puter for Rumantsch Puter.

In Wikidata, rm-rumgr is Q688873; rm-surmiran is Q690216; rm-sutsilv is Q688272; rm-sursilv is Q688348; rm-vallader is Q690226; rm-puter is Q688309.

In Wikimedia commons, the category tags are subtags of Category:Romansh_pronunciation but they are not very organized; I’ll gladly create new categories if needed.

I’m currently uploading a couple thousand Sursilvan pronunciations, such as “acceptar ezatgei”. It would be great to use LinguaLibre for recording additional variants of the Romansh language, and for recording the missing Sursilvan words. Your toolchain is so much nicer than my bot, so I’d love to switch over. :-)

See also Phabricator ticket T210293 for a related request to support them for monolingual text in Wikidata, which isn’t really related to LinguaLibre but might be interesting as context.

— Sascha (talk) 20:18, 9 January 2019 (UTC)

@Sascha it's done!

Note that the Wikidata Qid is enough, we have a script that extract automatically every other needed informations from Wikidata :).

Best regards — 0x010C ^~talk~ 09:26, 10 January 2019 (UTC)

For easy access:

wikidata:Q688309 : Putèr (Q74907): Putèr : rm-puter
wikidata:Q690226 : Vallader (Q74906): Vallader : rm-vallader
wikidata:Q688348 : Sursilvan (Q74905): Sursilvan : rm-sursilv
wikidata:Q688272 : Sutsilvan (Q74904): Sutsilvan : rm-sutsilv
wikidata:Q690216 : Surmiran (Q74903): Surmiran : rm-surmiran
wikidata:Q688873 : Rumantsch Grischun (Q74902): Rumantsch Grischun : rm-rumgr

So from this live import of pointers~examples I understand how we are rolling : most properties are in wikidata only ;) (It answer my question 1. in section above)

Thanks to 0x010C ! Yug (talk) 12:57, 10 January 2019 (UTC)

Thank you! — Sascha (talk) 14:59, 10 January 2019 (UTC)

Chakma

File:Screenshot 2019-01-10-22-28-54.jpg

Audio screenshot

I’ve tried to add support for the Chakma language by adding https://lingualibre.fr/wiki/Q74105. My Chakma contact (Bivuti Chakma, bsereye@hotmail.com) was able to record Chakma pronunciations, but he reports that the final step (uploading the files to Wikimedia servers) has failed. Probably it’s my fault; I should have asked you instead of trying to do this myself... Apologies for the nuisance, and thanks for your help. — Sascha (talk) 14:52, 10 January 2019 (UTC)

Hi I am Bivuti Chakma from Bangladesh. I am working on my language to implement in technology over the globe.

In you site I have recorded some audio, it's not publish accurately, why?

In this regard I include screenshot of audio.

Thanks, Bivuti

It is not clear to me now, but it seems that creating language "by hand" does not work. So I imported https://lingualibre.fr/wiki/Q75180. Help:Add_a_new_language should be updated. Bivuti, could you try again on few words and copy here any error message you get. Pamputt (talk) 06:42, 11 January 2019 (UTC)

Thanks Pamputt

When I try to audio recording. The site shows me like this screenshot:

File:Screenshot 2019-01-11-23-18-25.jpg

Unable to connect

Hi Bivuti!

Thanks for your participation.

I've fixed the language-import thing, which was causing the "Unable to contact the server" error.

Concerning the publishing issue: this question may be odd, but did you actually clicked on the big blue "Publish on Wikimedia Commons" button?

Best regards — 0x010C ^~talk~ 04:41, 12 January 2019 (UTC)

Compress audio?

Should LinguaLibre upload its pronunciations in FLAC format instead of uncompressed Wave files? FLAC is a lossless compression, so it would save space (and bandwidth for users) without losing quality. The only downside is that LinguaLibre’s server would use a bit more CPU, but that’s probably a very minor issue since it’s only needed once per file. To convert to FLAC in maximal compression, you can use something `ffmpeg -i input.wav -compression_level 12 output.flac`. Wikimedia Commons automatically transcodes FLAC to Vorbis and to MP3; see example for an uploaded FLAC file. Just a thought. — Sascha (talk) 15:09, 10 January 2019 (UTC)

Sascha, could you open a Phabricator ticket to track this proposal? Pamputt (talk) 06:50, 11 January 2019 (UTC)

Sure, filed T213534. — Sascha (talk) 11:11, 11 January 2019 (UTC)

Category “Lingua Libre pronunciation-other”

In this test, LinguaLibre has assigned a Commons category Lingua Libre pronunciation-other. Instead of “other”, could it use the IETF language tag (if present in Wikidata)? To get it, retrieve property P305 from the Wikidata record for the language. And perhaps fall back to the Wikidata ID for languages that don’t have an IETF code. Then, the recordings from unrelated languages wouldn’t get conflated. — Sascha (talk) 15:17, 10 January 2019 (UTC)

Indeed, this point has to improved on Lingua Libre. See T208641 on Phabricator. About IETF codes, the problem is they do not cover all the languages/dialects spoken on earth. So the problem remains for languages that do not have IETF code. Pamputt (talk) 06:47, 11 January 2019 (UTC)

Thanks for the pointer; I’ve added a comment to T208641. — Sascha (talk) 11:03, 11 January 2019 (UTC)

Normalize loudness

Should LinguaLibre normalize the loudness of recordings to EBU R 128, so that pronunciations are perceived equally loud irrespective of user microphones? ffmpeg can do this, either if you call it directly (rather painful), or via the ffmpeg-normalize wrapper script. It’s also possible to embed metadata with measured loudness, which some (but not all) players recognize; but in the context of LinguaLibre, it might be best to normalize loudness on the server and resample the signal accordingly. — Sascha (talk) 16:53, 10 January 2019 (UTC)

I would like this normalization for my usages as well, language learning.
Note @Sascha : relevant normalize loudness, denoising , fading-in-and-out cleanups commands to document in Help:Main#Download,_clean,_web_use > Help:SoX (to rename?). Denoise, fading not used serverside so far. 0x010C coded the recorder js and can give specifics. I'am of the opinion that such clean up scripts would sooner (server side) or later (after dataset download) come handy. Yug (talk) 18:33, 10 January 2019 (UTC)

Google:EBU R128 Loudness Normalisation ffmpeg > Audio Loudness Normalization With FFmpeg, Answer: How can I normalize audio using ffmpeg?. Yug (talk) 18:38, 10 January 2019 (UTC)

If normalization was done before uploading to Wikimedia Commons, all Wikipedia users would benefit (eg. when someone clicks on pronunciation icon on Wikipedia, they’d hear the recording in uniform loudness, denoised, etc.). If normalization is done in utility scripts called by end users, the set of people who benefit from this will be much smaller. The trade-off is that the recordings wouldn’t get preserved in their original form, but that’s probably not much an issue for LinguaLibre? — Sascha (talk) 06:24, 11 January 2019 (UTC)

Sascha, could you open a Phabricator ticket to track this proposal? Pamputt (talk) 06:49, 11 January 2019 (UTC)

Sure, filed T213535. — Sascha (talk) 11:17, 11 January 2019 (UTC)

Phabricator starts to have a load of server side developments to do. Not sure volunteers and opensource model will be productive enough. Maybe should we ask for a funding for 2 months dev work. In France it's about 6~8k€. Any lead ? Wikimedia france ? Grants ? Yug (talk) 16:00, 11 January 2019 (UTC)

Request for Comment: Moving from ISO 639-3 language codes to IETF BCP47

Hi Lingua Libre users,

Sascha suggested several times that Lingua Libre should switch from ISO 639-3 language codes to IETF BCP47 language tags. If we do that, it will be a major change in the Lingua Libre code-base. I will summarize here the different usages, pros & cons of such a switch.

Please share your opinion on this bellow!

Thank you all for your participation. — 0x010C ^~talk~ 16:59, 12 January 2019 (UTC)

Overview

Lingua Libre uses Wikidata Qids as internal identifier of a language. So the proposed change will not affect the core of the Record Wizard. Currently, ISO639-3 codes are used in four cases:

For the name of pages containing lists in the list namespace (in the format [[List:ISO/List name]], with ISO the iso6369-3 code);
In the name of the datasets archives;
In the description of the local item of each audio recording;
To forge Wikimedia Commons's category names;
To forge each file name that is uploaded on Wikimedia Commons;

Technical considerations

If we switch from one language tag to an other, to be consistent and use the new language tag everywhere, we would have to:

Create a new property BCP47, and add it to every language items localy, for the Record Wizard to be able to use them (a bot can import them from Wikidata);
Rename all local word lists (can be made by hand, we don't have many lists for now on);
Make a quick adaptation in the script that generates the datasets;
Rename all existing Wikimedia Commons categories and move all the audio recordings (a bot there is required);
Update the description of the item of every audio recording in our database (a bot can do it);
Change the way the Record Wizard manages the recording of duplicate words in two different recording sessions: it currently check if a file has already the forged name on Wikimedia Commons, but as the format of the name would change, we won't be able to rely on it anymore.

Pros

BCP47 is widely used in computing standards;
It has codes for way more languages and dialects;
It will solve the categorization issue we have currently on Wikimedia Commons (see T208641);
We will have a language code to display for way more languages and dialects (we only show the Wikidata Qid in file names for small languages curently, which is not very user-friendly, e.g. File:LL-Q36759-Assassas77-歡喜.wav);
Allow to have word lists working as expected for small languages / dialects;
Some Wiktionaries (like the French Wiktionary) use this standard to refer to a language in their templates; this is also the case of Wikibase (and so Wikidata) for the language of labels and description.

Cons

As we cannot rename 60.000+ files on Wikimedia Commons, two different file format will have to coexist (but this is not an issue if you use the SPARQL endpoint to extract the metadatas);
As of today, only 3003 languages have their IETF language tag filled on Wikidata (we have currently 8028 languages with an ISO 639-3 code listed);
Once the changes made to the Record Wizard and the migration scripts ready to run, we would have to turn off the Record Wizard for one or several days, while the different bots are running to avoid unsynchronized items and conflicts.

— 0x010C ^~talk~ 16:59, 12 January 2019 (UTC)

Comments

Support: This will be a hard change but if it has to be done, it's better to do it now rather than in several years. — 0x010C ^~talk~ 16:59, 12 January 2019 (UTC)
Contre : French wiktionnary don't use IETF code. Sorry I continue in french : L'IETF fait n'importe quoi avec les langues régionales, c'est pire que ISO 639-3. Nous n'utilisons pas les code IETF, jamais à aucun moment. Soit on prend le code ISO, ce qui marche pour 5000 langues environ, soit on prend le nom de la langue en français, ou en anglais si absent. Aujourd'hui, les contributeurs du Wiktionnaire tendent à s'affranchir de plus en plus des codes et de passer sur les noms de langues en tant que clés primaires. La seule organisation qui fassent l'unanimité sur les langues parce que gérée uniquement par des linguistes, c'est Glottolog, à la limite, on peut se caler dessus, ce sont les plus neutres. Lyokoï (talk) 17:46, 12 January 2019 (UTC)
Oppose (weakly) IETF looks to assign a code to more languages than ISO 639-3 codes. Yet, it does not solve all the issue because I guess it is possible to find language/dialect that do not have either ISO 639-3 or IETF code. In such case, all the issues we have with ISO 639-3 remain the same. If we have to switch to another code system, I has to solve some issues, not only to postpone them. From now, the only code that is flexible and can desribe all language/dialect is the Wikidata code but there are probably other issues if we decide to use them. But since I do not precisely IETF code, I may be wrong so that I do not want to oppose strongly. Pamputt (talk) 19:37, 12 January 2019 (UTC)
Oppose per Pamputt: if it not covers all the dialects then we still have the same problem. Also I don't feel comfortable with two systems in filenames in Commons. I have lots of homonims in Polish and I afraid I would have two files for the same pronounciation from Lingua Libre for one transcription. That would be nightmare for bot operators adding audio files to wiktionaries. KaMan (talk) 09:00, 13 January 2019 (UTC)
Support: There’s a couple misunderstandings here. IETF BCP47 is actually not yet another random codelist that would be different from ISO codes. Rather, BCP47 is a standardized system (and very widely used, eg. in HTML, XML and HTTP) that combinines subtags from other standards. For languages, subtags are taken from ISO 639; for countries, from ISO 3166-1; for provinces/states, from ISO 3166-2; etc. Also, you can add custom information into BCP47 tags without breaking the syntax; this could be used for embedding Wikidata IDs. Here’s a few examples: `en` for English (from ISO 639-1); `haw` for Hawaiian (from ISO 639-3, because Hawaiian has no two-letter code in ISO 639-1); `fr-CA` for Canadian French (language + country); `pt-AO` for Angolan Portuguese; `es-419` for Latin American Spanish (419 is the United Nations M.49 code for Latin America). There is a registry for standardized variants, for example the BCP47 code `rm-sursilv` stands for the Sursilvan variant of Romansh. When a language does not fit into the scheme, you can always append (short) pieces of “private” data after `-x-`. For example, you could encode Verlan (which doesn’t have an ISO language code) as an IETF BCP47 language tag `mis-x-Q1429662` or so. Admittedly, the Wikipedia article about BCP47 is not very helpful at the moment, and the standard itself is very technical. — Sascha (talk) 20:10, 13 January 2019 (UTC)
As I said, I do not know a lot about IETF BCP47 so I may be wrong. Yet, from the examples you give, you say that the language code comes from ISO 639, so actually if a language do not have ISO 639 code, then BCP47 will not have either. The only advantage I see, compared to ISO 639, is it can represent dialect and regional language (Canada French for example). Youwrite that if a language do not have ISO 639 code, then we can use something like `mis-x-Q1429662`. I do not see what is the advantage compared to simply use Wikidata ID (Q1429662 instead of mis-x-Q1429662). Pamputt (talk) 22:21, 13 January 2019 (UTC)
In names of dataset archives, names of uploaded pronunciation files, and in the other places where Lingua Libre currently uses ISO 639-3 codes, a BCP 47 tag would be easier to understand than just the Wikidata ID alone. For example, an IETF BCP 47 tag nan-x-Q36759 would identify Teochew as a variant of Southern Min (ISO 639-3: nan) while still pointing to Wikidata Q36759 for the exact identification. — Sascha (talk) 07:03, 14 January 2019 (UTC)
Support: I already have problems due to the impossibility to distinct variants in occitan. For instance, if a gascon occitan want to record words from a predefinite list (because he has no idea of which words to record), he can't search for a list in its variant. He will click randomly on lists names, until he got one in his variant (which can takes long and cause him to give up).
Second, on Commons, it will be easier for people who doesn't know Wikidata (for instance a teacher who wants to download words in a variety to have his pupils listening them) to get the variety of the word, directly in the results of the search page (with the filenames).

Third, for the compatibility with developpers programs. At Lo Congrès, we work with RFC5646 (we needed a way to indicate variants). If we make a program which queries Lingua Libre, we need to add a query via Wikidata to get the variety code compatible with our programs. It slows the page and make the work longer.

I work every day on a language with variants, and for the sort of work I (and others) do, it would be a real improvement. So maybe IETF is theoretically problematic for his language classement, but ISO 639-3 is pragmatically problematic. As a developer, I prefer a usable system that doesn't fit exactly the reality than a system teoretically right that can't be used without a lot of difficulties. — Unuaiga (talk) 16:21, 14 January 2019 (UTC)
Support (for human friendly filenames): I was slow to answer because it's indeed a tricky issue. For all recordings, the value of their langtags −Qids, ISO639-3, BCP47−, exist or can be created. Qids are always new creations assigned when creating the language on LL's wiki, whereas ISO639-3 and BCP47 can exist OR be extended. Each langtag family can covers +5000 languages and do the job we need them for up to 2025~ 2030, with custom extensions when required easier for Qid (still normal creation) and BCP47 (custom extension). Then, the equivalences between these 3 or more langtags can be found by wikimedia editors or outsiders via the Qid or Wikidata pages and few clicks. Afterwhat each langtag and its value can find its way back into the filename via some replacement script. So for me these Qids, ISO639-3, BCP47 langtags are technically equivalents : they each can do the work and be quite interchangeable.
The question is on HUMAN USAGES. Three groups of humans will manages these files and filenames : 1) LL speakers, organizers and editors ; 2) wikimedia users ; 3) outsiders like android app developers and non-recording linguists. Who is more important ? To who do we want to make access, readability and work easier ? What is their spontaneous knowledge ?
- The current way: opaque Qid-based filename online, post-download processing to make them readable. We have filename with unreadable Qids, with the actual human-friendly value on LinguaLibre Qid's page. So for us LL editors and maintainers, if we find out our language definition is obsolete, we just update the LL Qid's page, and new people coming there for reference will see the corrected values. For end users on wikimedia cannot directly recognize the language. After files or datasets download, batch renaming commands documented on LL can help end users to renames files as they wish.
- Datasets and filenames should be human-readable pre-downnload. If so, then the ISO 639-3-based IETF BCP-47 can cover 99% of our easy usages, and BCP-47 has native flexibility to create code for the 1% weird cases. Wikimedia users and outside-wikimedia users will appreciate. If we make mistakes, the vitality of open data spread wrongly name files and will get us troubles.
- We will need better LinguaLibre-Commons maintenance bots and more bots masters, so we don't always rely on 0x010C, who thereby become our bottleneck. We also need way to massively rename or remove files from Commons.
- I personally think we have to ease readability to outsiders, app developers and linguists who won't find their way through LL documentations. Also, I'am supporting a move toward human-friendly filenames, from LL website down to wikimedia sites and post-download outsiders' desktops computers. Yug (talk) 21:18, 14 January 2019 (UTC) -note: I have a cold so my English seems worse than usual, my apologize.

Implications
Approach	For LL editors	Wikimedia editors	Outsiders
Custom Qid codes, created as needed, opaque (LinguaLibre Qids)	Correcting language scope/definition : easy, only change value of fields IETF BCP-47. Existing files with this Qid, wherever they are, implicitly follow the corrected value.	Opaque filenames, not editable because by convention. Readable value to find on LinguaLibre. Commons page can have a def and links.	Opaque filenames. Value on LinguaLibre's Qid page Post-download: commands to rename batch of files, documented in Help:Main.
Existing codes, extensible, readable (ex:ISO 639-3-based IETF BCP-47)	Correcting language scope/definition : Hard, only new recording affected. Existing files with this code will each require correction.	Readable filenames, no need to rename.	Readable filenames. Ready to go.

Tricky example

Let's take a concrete example, what would be the code for the Gudjal language, a Pama-Nyungan language spoken in Australia? This language has neither ISO 639 code nor BCP47 code. It has a Glottolog, AUSTLANG and endangeredlanguages.com identifiers. So if we decide to switch to BCP47, what would be the advantage compared to the existing one (ISO 639) because there is no code in both systems? We simply delay the discussion on the problem of languages or dialects without code. Pamputt (talk) 12:20, 15 January 2019 (UTC)

Since Gudjal is a dialect/variant of Warrungu whose BCP47 code is wrg, the BCP47 code for Gudjal would be wrg-x-Q60610865. To find the prefix for arbitrary languages in Wikidata, it looks like we’ll have to clean up Wikidata a bit. For example, currently there’s no statement in Wikidata linking Verlan to French; we’d need that to come up with the code fr-x-Q1429662 for Verlan. — Sascha (talk) 19:23, 15 January 2019 (UTC)

Indeed, wrong example because some work say this "language" is actually a dialect of the Warrungu language.

So let us consider the Bunwurrung language, another Pama-Nyungan language spoken in Australia. This one does not seem to be (yet?) a dialect of another language. So, what would be its BCP47 code? The same for Bwenelang language, a Austronesian language spoken in Vanuatu. Pamputt (talk) 20:11, 15 January 2019 (UTC)

In the short term, their codes would be mis-x-Q4997965 and mis-x-Q56261010. In the long term, it would be good to assign ISO 639-3 codes to these languages. This is actually quite easy (if there’s references about the language). See FAQ, or this example registration request. Requests are reviewed once per year. All changes to ISO 639-3 also go into the registry for BCP 47. — Sascha (talk) 11:23, 16 January 2019 (UTC)

Thanks for the examples. In my opinion, "mis-x-Q4997965" is more cryptic than only "Q4997965". If a language has a ISO 639-3 code, then the BCP47 code is indeed easier to understand than a Qid. So, as I already said, the advantage is to make clearer the code for the dialects but it does not solve all the problems (such as these two languages. Since such code change will not be done every month, I would prefer to have a better solution (more universal) before breaking/changing everything. Pamputt (talk) 17:42, 16 January 2019 (UTC)

Encoding Wikidata IDs into BCP47

By the way, in a BCP47 language tag such as wrg-x-Q60610865, anyone can stuff anything after -x- which is flexible but not ideal for an identification scheme. I’m now preparing a formal proposal for encoding Wikidata IDs into BCP47 language tags. BCP47 already draws subtags from many other registries such as ISO 639, ISO 3166, UN M.49 and others; so why not treating Wikidata as yet another “registration authority”. If the proposal gets accepted, the official syntax would be something different than -x-. Just for your information; I’ve no idea if the proposal gets accepted, and it usually takes a long time to make changes. — Sascha (talk) 12:20, 16 January 2019 (UTC)

Ne pas proposer les termes pour lesquels on a déjà téléversé un enregistrement

Bonjour.

Tout est dans le titre : si je reprends les termes d’une liste déjà partiellement enregistrée, LinguaLibre me propose d’en réenregistrer tout les membres, ce qui ne me semble guère pertinent. Il devrait plutôt ne proposer que des termes pour lesquels je n’ai encore rien enregistré.

Cordialement. Penegal (talk) 17:36, 20 January 2019 (UTC)

Hello Penegal ! This feature has been requested before. We have a phabricator task on it (T212580),defining the problem and storing on the LinguaLibre developers' dashboard. Previous discussion have concluded that this feature isn't easy to provide. We call for volunteer developer.s with required skills to jump in and develop a script providing this service.

Which word lists do you work with ? You could compare the lists before work, using comm. An alternative is to progress not via thematic lists or extracts from texts as of now on FRA, but with method, more specifically by recording words from the most frequent to the lesser ones. We currently don't have large frequency list for FRA. If this would satisfy your needs, please message me. Yug (talk) 22:12, 20 January 2019 (UTC)

@@ Line 509: / Line 509: @@
 Cordialement. [[User:Penegal|Penegal]] ([[User talk:Penegal|talk]]) 17:36, 20 January 2019 (UTC)
 :Hello Penegal ! This feature has been requested before. We have a phabricator task on it ([https://phabricator.wikimedia.org/T212580 T212580]),defining the problem and storing on the [https://phabricator.wikimedia.org/tag/lingua_libre/ LinguaLibre developers' dashboard]. Previous discussion have concluded that this feature isn't easy to provide. We call for volunteer developer.s with required skills to jump in and develop a script providing this service.
-:Which word lists do you work with ? You could compare the lists before work, using [https://linux.die.net/man/1/comm comm]. An alternative is to progress not via thematic lists or extracts from texts as of now on FRA, but with method, more specifically by recording words from the most frequent to the lesser ones. If this would satisfy your need, please message me. [[User:Yug|Yug]] ([[User talk:Yug|talk]]) 22:12, 20 January 2019 (UTC)
+:Which word lists do you work with ? You could compare the lists before work, using [https://linux.die.net/man/1/comm comm]. An alternative is to progress not via thematic lists or extracts from texts as of now on FRA, but with method, more specifically by recording words from the most frequent to the lesser ones. We currently don't have large frequency list for FRA. If this would satisfy your needs, please message me. [[User:Yug|Yug]] ([[User talk:Yug|talk]]) 22:12, 20 January 2019 (UTC)

LinguaLibre

Difference between revisions of "Chat room"

Revision as of 23:02, 20 January 2019

Contents

Chatroom FAQ

Utiliser le Lingua Libre Bot dans l'incubator:shy

Liste sur le modèle de Petscan

Variations géographiques

Thésaurus

General issues + issues with Odia and Asian writing systems

Première utilisation : quelques questionnements

Formosan languages workshop

1) Requesting languages additions

2) "There are 42 dialects under 16 aboriginal languages".

3) "Is it ok to use `mhway su (谢谢)` ?" (target word + translation)

Thésaurus (2)

Feature request: ask to reuse existing identical audio if available

Homonymy

Categories

How to properly credit lists

Erreur de téléversements

Menu and naming

2019 Prague Wikimedia Hackathon and scholarship (bourse)

Word frequencies for prioritizing, UNILEX and licence

Enable all human languages in bulk?

Use IETF BCP47 instead of ISO 639?

Documenting langtag usages on LL

Support variants of Romansh

Chakma

Compress audio?

Category “Lingua Libre pronunciation-other”

Normalize loudness

Request for Comment: Moving from ISO 639-3 language codes to IETF BCP47

Overview

Comments

Tricky example

Encoding Wikidata IDs into BCP47

Ne pas proposer les termes pour lesquels on a déjà téléversé un enregistrement

Revision as of 23:02, 20 January 2019

Chatroom FAQ

Utiliser le Lingua Libre Bot dans l'incubator:shy

Liste sur le modèle de Petscan

Variations géographiques

Thésaurus

General issues + issues with Odia and Asian writing systems

Première utilisation : quelques questionnements

Formosan languages workshop

1) Requesting languages additions

2) "There are 42 dialects under 16 aboriginal languages".

3) "Is it ok to use mhway su (谢谢) ?" (target word + translation)

Thésaurus (2)

Feature request: ask to reuse existing identical audio if available

Homonymy

Categories

How to properly credit lists

Erreur de téléversements

Menu and naming

2019 Prague Wikimedia Hackathon and scholarship (*bourse*)

Word frequencies for prioritizing, UNILEX and licence

Enable all human languages in bulk?

Use IETF BCP47 instead of ISO 639?

Documenting langtag usages on LL

Support variants of Romansh

Chakma

Compress audio?

Category “Lingua Libre pronunciation-other”

Normalize loudness

Request for Comment: Moving from ISO 639-3 language codes to IETF BCP47

Overview

Comments

Tricky example

Encoding Wikidata IDs into BCP47

Ne pas proposer les termes pour lesquels on a déjà téléversé un enregistrement

3) "Is it ok to use `mhway su (谢谢)` ?" (target word + translation)

2019 Prague Wikimedia Hackathon and scholarship (bourse)