User
Difference between revisions of "Psubhashish"
Psubhashish (talk | contribs) (→Lists) |
Psubhashish (talk | contribs) m (I AM ON A BREAK) |
||
(36 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
− | + | {{Userboxtop|Rewards}} | |
+ | {{50k barnstar}} | ||
+ | {{Speaker of the month|07/2021|5960}} | ||
+ | {{Speaker of the month|08/2021|8381}} | ||
+ | {{Speaker of the month|10/2021|5464}} | ||
+ | {{Speaker of the month|11/2021|5536}} | ||
+ | {{Speaker of the month|01/2022|7259}} | ||
+ | {{Speaker of the month|08/2022|5003}} | ||
+ | {{Speaker of the month|03/2023|3346}} | ||
+ | {{Speaker of the month|06/2023|3974}} | ||
+ | {{Userboxbottom}} | ||
+ | {{#babel:records-ori}} | ||
− | == | + | :: ''I AM ON A BREAK to recalibrate, focus on other life priorities, regain energy to come back to this beautiful project soon.'' |
− | + | ||
+ | I am a Wikimedian, documentary filmmaker and [https://www.nationalgeographic.org/find-explorers/subhashish-panigrahi National Geographic Explorer]. I am interested in studying access, decolonization of knowledges and the free-culture movement. I have been active in language documentation with a focus on endangered languages and the use of multimedia as a democratic tool. I also have been an organizational leader and have served both in professional and volunteer-advisory roles at the Internet Society, Wikimedia Foundation, Mozilla, Centre for Internet Society, Creative Commons, Digital Language Diversity Project (DLDP), Wikitongues and now defunct ScholarlyHub. | ||
+ | |||
+ | I am very interested in publicly-owned and public-governed multimedia archives, and I put some volunteer time into action. I have contributed over [https://lingualibre.org/wiki/LinguaLibre:Stats/Speakers 68,000 pronunciation recordings On Lingua Libre] and over 4,000 sentence recordings on Mozilla Common Voice. My primary contribution to Lingua Libre is in the [[:w:Odia_language#Standardization_and_dialects|Central]] (''Mugalbandi'') and [[:w:Baleswari Odia|Baleswari]] dialects of the Odia language. | ||
+ | |||
+ | == LinguaLibre/other pronunciation-related publications == | ||
+ | * Subhashish Panigrahi (2022), [https://dl.acm.org/doi/10.1145/3487553.3524931 Building a Public Domain Voice Database for Odia], Companion Proceedings of the Web Conference 2022, Virtual Event, Lyon, France, pp. 1331–1338, DOI: 10.1145/3487553.3524931, ISBN: 978-1-4503-9130-6 | ||
+ | * Subhashish Panigrahi (2022), [https://diff.wikimedia.org/2022/03/10/building-a-50000-pronunciation-data-repository-in-the-odia-language/ Building a 50,000 pronunciation data repository in the Odia language], Diff | ||
+ | |||
+ | == Things I have made/broken == | ||
+ | * [[/tools/Prepare words for Lingua Libre|Prepare words for Lingua Libre]]: a tool to copy text from any source and clean up to create a list of words ready to be used in RecordWizard of Lingua Libre. | ||
+ | * [https://github.com/ofdn/Kathabhidhana Kathabhidhana]: an open-source toolkit to record a large number of words in any language (inspired from another open project by T. Shrinivasan) (see [https://twitter.com/i/events/898061810217213956 tweet thread], [https://rising.globalvoices.org/blog/2017/03/28/a-new-audio-uploading-tool-for-crowdsourced-wiktionary-project-in-odia-language/ coverage on Rising Voices], [https://opensource.com/article/17/5/simple-command-line-tool-recording-audio blog], [https://wikimania2017.wikimedia.org/wiki/Submissions/Kathabhidhana:_Recording_words_for_Wiktionary_and_preparing_for_an_AI_assistant selected talk at Wikimania 2017] and coverage on [https://fr.m.wikipedia.org/wiki/Wikip%C3%A9dia:RAW/2017-05-25 French Wikipedia newsletter RAW]) | ||
+ | |||
+ | == Personal lists == | ||
+ | * [https://w.wiki/77Hs All lexeme forms in Odia missing a pronunciation] (inspired by Adithya K's [https://w.wiki/77Hu query]) | ||
+ | * [[List:Ory/All standard Odia|All standard Odia]] ([https://lingualibre.org/index.php?search=&search=List%3AOry all lists]) | ||
+ | * [[List:Ory/Baleswaria]] ([[:w:Baleswari Odia|Baleswaria]] dialect of Odia; ongoing, total #words: 1046 by June 1, 2020; [[List:Ory/Baleswaria/recording_complete|words with recording completed]]) | ||
+ | * TBD: [[:or:wikt:ଶ୍ରେଣୀ:ବାଲେଶ୍ୱରୀ ଶବ୍ଦ|Baleswari words from Ordia Purnachandra Bhashakosha]] | ||
+ | * [[List:Ori/Places of Odisha|List of Places in Odisha]] (villages, Towns, Administrative blocks, etc.)। Words collected from: | ||
+ | ** [https://kalahandi.nic.in/od/%e0%ac%97%e0%ad%8d%e0%ac%b0%e0%ac%be%e0%ac%ae-%e0%ac%93-%e0%ac%aa%e0%ac%9e%e0%ad%8d%e0%ac%9a%e0%ac%be%e0%ad%9f%e0%ac%a4/ Kalahandi district official site] | ||
+ | ** [https://malkangiri.nic.in/od/ Malkangiri] (village names missing) | ||
+ | ** [https://koraput.nic.in/od/ Koraput] (village names exist) | ||
+ | ** [https://bhadrak.nic.in/od/ Bhadrak] (village names exist) | ||
+ | ** [https://rayagada.nic.in/od/ Rayagada] (village names missing) | ||
+ | |||
+ | == Potential bugs or required features == | ||
+ | {| class="wikitable" | ||
+ | |- | ||
+ | ! Kind (issue/new feature request) !! Summary !! Context/Steps to reproduce !! Response | ||
+ | |- | ||
+ | | Suspected issue | ||
+ | || Words already uploaded using LL does not get removed while creating a new list | ||
+ | || | ||
+ | # Included a word "ଉଦ୍ଦେଶ୍ୟରେ" in a new batch and selected "Remove words already recorded" while loading words from a [[List:Ory/All standard Odia|local list]] | ||
+ | # Even though the word already exists in two places ([[:commons:File:Or-ଉଦ୍ଦେଶ୍ୟରେ.wav|first]], [[:commons:File:Or-ଉଦ୍ଦେଶ୍ୟରେ 01.wav|second]] -- both uploading using LL) on Commons, it does not appear as a duplicate on LL Record Wizard | ||
+ | || | ||
+ | Hi {{ping|Psubhashish}} could you please try to reproduce this issue with recordings that were not renamed? Just to be sure: the Record wizard can only remove words that the current speaker already recorded, for the moment it can't remove words recorded by other speakers (there is a [[phabricator:T231559|ticket on phabricator]] asking for this feature). — '''[[User:WikiLucas00|WikiLucas]]''' [[User talk:WikiLucas00|(🖋️)]] 12:13, 18 August 2021 (UTC) | ||
+ | |- | ||
+ | | Feature | ||
+ | || LL helps remove words recorded already. But there is no way to download that word. This would help a lot in creating a list locally. | ||
+ | || | ||
+ | || | ||
+ | :Could you develop a little bit your idea please? You would like to export a textual file containing the words that you already recorded? Or you would like to download the sound files you uploaded? — '''[[User:WikiLucas00|WikiLucas]]''' [[User talk:WikiLucas00|(🖋️)]] 12:15, 18 August 2021 (UTC) | ||
+ | :: {{ping|WikiLucas00}} ha ha you caught off guard! I was trying to make a rough list as I am discovering new things here first before fleshing out suggestions for improvement. By listing, I mean a text file containing the words recorded, not the audio file. I guess one can download audio files from Commons in bulk too. But that's another question and do share if there is a way that you might know. --[[User:Psubhashish|Subhashish]] ([[User talk:Psubhashish|talk]]) 09:32, 21 August 2021 (UTC) | ||
+ | :::{{ping|Psubhashish}} Using Petscan and your Lingua Libre category on Commons, you can export the text list of all your recorded files. [https://petscan.wmflabs.org/?psid=19878687 Here is the query]. You can change the output to plain text, wikicode, json etc if you want to (in the Output tab). I hope this fits to your needs. All the best — '''[[User:WikiLucas00|WikiLucas]]''' [[User talk:WikiLucas00|(🖋️)]] 15:17, 21 August 2021 (UTC) | ||
+ | |- | ||
+ | | Feature | ||
+ | || Number counter while reviewing recorded audio | ||
+ | || While reviewing recorded audio it is not possible to see the change in the counter at the bottom. For instance, I am reviewing the recorded audio number 10 and the total number of recorded sounds is 300. I cannot see the exact number of a particular sound in the counter. | ||
+ | || | ||
+ | |- | ||
+ | | Issue | ||
+ | || RecordWizard field "Spoken languages" is confusing. | ||
+ | || ''Should one add all the languages/dialects they know or the one they are going to speak in the next step in a particular batch? If I am a speaker who is multilingual (which is the case for most people in South Asia), I'd prefer that the form asks me the specific dialect/language I am going to speak in a batch. I might speak six languages but they are not relevant for each word in a particular batch.'' | ||
+ | || | ||
+ | |- | ||
+ | | Issue | ||
+ | || "Place of residence" is meaningless without the "place of language learning". | ||
+ | || ''One might have learned a language in one place but might be living in another. The latter might or might not have impact on the language that they speak. However, where they learned the language is very important (in most cases).'' | ||
+ | || | ||
+ | |- | ||
+ | | Feature | ||
+ | || Need an option to record offline and upload/sync when connected to the internet | ||
+ | || | ||
+ | * I am planning for a workshop to record pronunciation of words in an indigenous language in a remote place. This would mean traveling to places with probably no internet connectivity, and then recording there offline, and uploading to LL later when connected to the internet. | ||
+ | * This might be possible to have a MediaWiki + Wikibase environment locally by forking LL. There are two challenges: | ||
+ | :: a. I don't know yet how to set up one such environment locally. | ||
+ | :: b. I don't know how to enable the local wiki to speak to LL when connecting to the internet | ||
+ | || | ||
+ | |- | ||
+ | || Potential feature | ||
+ | || How to record words in a language with no writing system/script? | ||
+ | || | ||
+ | * When a language is only oral and has no formal writing system/script of its own, International Phonetic Alphabet (IPA) is often used by linguists to "write" the pronunciations. Will IPA-based word listing work on LL? | ||
+ | * Another possibility in such a case is the speaker's familiarity of a neighboring dominant script. This can be problematic in many levels (for starters, colonization by users of dominant scripts) but can be a temporary fix just for the field recording. If such recordings are made and uploaded, how can they be converted into IPA later so that the file names do not show the dominant script? | ||
+ | || | ||
+ | |- | ||
+ | || Feature | ||
+ | || Parsing words from any public web page | ||
+ | || Legally and technically, words per se are not copyrighted. Hence, parsing and creating a list of words is a great way to make way for recording words from different topics. Wikipedia categories or Wiktionary entries are not always diverse, considering their diversity scope is limited to the personal interest of active Wikimedians and/or a good amount of content don't make their way to these projects because of citation issues (not everything that is public is citable -- they might have many words in a particular topic though and hence are of interest to LL). | ||
+ | || | ||
+ | |- | ||
+ | || Bug | ||
+ | || All words under a dialect (e.g. Baleswari-Odia) should be listed under the language (e.g. Odia) in Statistics | ||
+ | || A language being a superset of a dialect, all words recorded under a dialect should be listed under a language as well. Right now each dialect has its own category in the Statistics page which is great. But these words do not appear in the total number of recordings its respective language name. | ||
+ | || | ||
+ | |} |
Latest revision as of 18:29, 19 November 2023
Rewards |
---|
|
Babel user information | ||
---|---|---|
| ||
Users by language |
- I AM ON A BREAK to recalibrate, focus on other life priorities, regain energy to come back to this beautiful project soon.
I am a Wikimedian, documentary filmmaker and National Geographic Explorer. I am interested in studying access, decolonization of knowledges and the free-culture movement. I have been active in language documentation with a focus on endangered languages and the use of multimedia as a democratic tool. I also have been an organizational leader and have served both in professional and volunteer-advisory roles at the Internet Society, Wikimedia Foundation, Mozilla, Centre for Internet Society, Creative Commons, Digital Language Diversity Project (DLDP), Wikitongues and now defunct ScholarlyHub.
I am very interested in publicly-owned and public-governed multimedia archives, and I put some volunteer time into action. I have contributed over 68,000 pronunciation recordings On Lingua Libre and over 4,000 sentence recordings on Mozilla Common Voice. My primary contribution to Lingua Libre is in the Central (Mugalbandi) and Baleswari dialects of the Odia language.
- Subhashish Panigrahi (2022), Building a Public Domain Voice Database for Odia, Companion Proceedings of the Web Conference 2022, Virtual Event, Lyon, France, pp. 1331–1338, DOI: 10.1145/3487553.3524931, ISBN: 978-1-4503-9130-6
- Subhashish Panigrahi (2022), Building a 50,000 pronunciation data repository in the Odia language, Diff
Things I have made/broken
- Prepare words for Lingua Libre: a tool to copy text from any source and clean up to create a list of words ready to be used in RecordWizard of Lingua Libre.
- Kathabhidhana: an open-source toolkit to record a large number of words in any language (inspired from another open project by T. Shrinivasan) (see tweet thread, coverage on Rising Voices, blog, selected talk at Wikimania 2017 and coverage on French Wikipedia newsletter RAW)
Personal lists
- All lexeme forms in Odia missing a pronunciation (inspired by Adithya K's query)
- All standard Odia (all lists)
- List:Ory/Baleswaria (Baleswaria dialect of Odia; ongoing, total #words: 1046 by June 1, 2020; words with recording completed)
- TBD: Baleswari words from Ordia Purnachandra Bhashakosha
- List of Places in Odisha (villages, Towns, Administrative blocks, etc.)। Words collected from:
- Kalahandi district official site
- Malkangiri (village names missing)
- Koraput (village names exist)
- Bhadrak (village names exist)
- Rayagada (village names missing)
Potential bugs or required features
Kind (issue/new feature request) | Summary | Context/Steps to reproduce | Response |
---|---|---|---|
Suspected issue | Words already uploaded using LL does not get removed while creating a new list |
|
Hi @Psubhashish could you please try to reproduce this issue with recordings that were not renamed? Just to be sure: the Record wizard can only remove words that the current speaker already recorded, for the moment it can't remove words recorded by other speakers (there is a ticket on phabricator asking for this feature). — WikiLucas (🖋️) 12:13, 18 August 2021 (UTC) |
Feature | LL helps remove words recorded already. But there is no way to download that word. This would help a lot in creating a list locally. |
| |
Feature | Number counter while reviewing recorded audio | While reviewing recorded audio it is not possible to see the change in the counter at the bottom. For instance, I am reviewing the recorded audio number 10 and the total number of recorded sounds is 300. I cannot see the exact number of a particular sound in the counter. | |
Issue | RecordWizard field "Spoken languages" is confusing. | Should one add all the languages/dialects they know or the one they are going to speak in the next step in a particular batch? If I am a speaker who is multilingual (which is the case for most people in South Asia), I'd prefer that the form asks me the specific dialect/language I am going to speak in a batch. I might speak six languages but they are not relevant for each word in a particular batch. | |
Issue | "Place of residence" is meaningless without the "place of language learning". | One might have learned a language in one place but might be living in another. The latter might or might not have impact on the language that they speak. However, where they learned the language is very important (in most cases). | |
Feature | Need an option to record offline and upload/sync when connected to the internet |
|
|
Potential feature | How to record words in a language with no writing system/script? |
|
|
Feature | Parsing words from any public web page | Legally and technically, words per se are not copyrighted. Hence, parsing and creating a list of words is a great way to make way for recording words from different topics. Wikipedia categories or Wiktionary entries are not always diverse, considering their diversity scope is limited to the personal interest of active Wikimedians and/or a good amount of content don't make their way to these projects because of citation issues (not everything that is public is citable -- they might have many words in a particular topic though and hence are of interest to LL). | |
Bug | All words under a dialect (e.g. Baleswari-Odia) should be listed under the language (e.g. Odia) in Statistics | A language being a superset of a dialect, all words recorded under a dialect should be listed under a language as well. Right now each dialect has its own category in the Statistics page which is great. But these words do not appear in the total number of recordings its respective language name. |