User

Titodutta/Bengali words from pages

< User:Titodutta
Revision as of 21:04, 18 January 2021 by Titodutta (talk | contribs) (++)

This page narrates how I extract the Bengali word list from a page of a book.

Caveat lector

  • Please use your common sense if anyone in using it for your language. Each language may have its own issue, for example in Bengali I struggle with "hyphen/dash" and । (Bengali full stop) in this process, although those are fixable also.
  • A language (community) may have their own workflow or word priority list. If you/your community has one, you may work on that first.
  • Copyright is important. Do not put copyrighted content.

Extracting from a page

I sometimes use this process

  1. Go to a page which you want to use, such as: s:bn:গল্পগুচ্ছ/ঘাটের কথা
  2. Copy the entire text. Now there are three things to be done a) Punctuation removal, b) Space to new line convert, c) Remove duplicate (and possibly alphabetical sorting, which I prefer)
  3. One step skipped here, how I remove । which is Bengali-language related and does not have universal scope
  4. Punctuation removal: Paste the entire text on a tool like this, remove punctuations (be careful, hyphen also gets removed also, "intra-word" becomes intraword)
  5. Spaces to new lines: Now you have a lot of text without punctuations and a lot of spaces, use a spaces to new lines tool to do this.
  6. Duplication removal: Now there are many words, I add another step. Take the result text to this tool. Remove duplicates and I also click on "Sort alphabetically". (note Lingua Libre itself a duplicate remover, but I keep this step to clean word list. For example if you have a 10,000 words then every word counts. Lingua Libre (or my computer :-D) starts going slower for a list of more than ~4,000 words)
  7. You have the clean list. Copy paste onto a List page. Additional tips: Once you sort alphabetically in the list you will mostly find it problematic in the beginning or at the end)
  8. I gave an example from Wikisource just above. This was a real example. See how the result page looks like List:Ben/রবি ঠাকুর/ঘাটের কথা

Questions

These things can heavily improve the process.

  1. Step 4 above shows I can not easily remove ।. Sometimes I do not want to remove hyphen. Do you know about any tool or process where I can instruct please remove ।, but do not remove -?
  2. Step 4 to Step 6, I am actually going to three different tools to clean the data. Do you know "one tool" that can do all these together?


See also