User
Titodutta/Bengali words from pages
< User:Titodutta
This page narrates how I extract the Bengali word list from a page of a book.
Caveat lector
- Please use your common sense if anyone in using it for your language. Each language may have its own issue, for example in Bengali I struggle with "hyphen/dash" and । (Bengali full stop) in this process, although those are fixable also.
- A language (community) may have their own workflow or word priority list. If you/your community has one, you may work on that first.
- Copyright is important. Do not put copyrighted content.
Extracting from a page
I sometimes use this process
- Go to a page which you want to use, such as: s:bn:গল্পগুচ্ছ/ঘাটের কথা
- Copy the entire text. Now there are three things to be done a) Punctuation removal, b) Space to new line convert, c) Remove duplicate (and possibly alphabetical sorting, which I prefer)
- One step skipped here, how I remove । which is Bengali-language related and does not have universal scope
- Punctuation removal: Paste the entire text on a tool like this, remove punctuations (be careful, hyphen also gets removed also, "intra-word" becomes intraword)
- Spaces to new lines: Now you have a lot of text without punctuations and a lot of spaces, use a spaces to new lines tool to do this.
- Duplication removal: Now there are many words, I add another step. Take the result text to this tool. Remove duplicates and I also click on "Sort alphabetically". (note Lingua Libre itself a duplicate remover, but I keep this step to clean word list. For example if you have a 10,000 words then every word counts. Lingua Libre (or my computer :-D) starts going slower for a list of more than ~4,000 words)
- You have the clean list. Copy paste onto a List page. Additional tips: Once you sort alphabetically in the list you will mostly find it problematic in the beginning or at the end)
- I gave an example from Wikisource just above. This was a real example. See how the result page looks like List:Ben/রবি ঠাকুর/ঘাটের কথা
Questions
These things can heavily improve the process.
- Step 4 above shows I can not easily remove ।. Sometimes I do not want to remove hyphen. Do you know about any tool or process where I can instruct please remove ।, but do not remove -?
- Step 4 to Step 6, I am actually going to three different tools to clean the data. Do you know "one tool" that can do all these together?
See also
- User:Titodutta/Bengali Lexeme workflow (the power-process of Bengali)