View source for Titodutta/Bengali words from pages

This page narrates how I extract the Bengali word list from a page of a book.

''Caveat lector''
* Please use your common sense if anyone is using it for your language. Each language may have its own issue, for example in Bengali I struggle with "hyphen/dash" and । (Bengali full stop) in this process, although those are fixable also.
* A language (community) may have its own workflow or word priority list. If you/your community has one, you may work on that first.
* Copyright is important. Do not put copyrighted content.

== Extracting from a page == 

I ''sometimes'' use this process
# Go to a page which you want to use, such as: [[:s:bn:গল্পগুচ্ছ/ঘাটের কথা]]
# Copy the entire text. Now there are three things to be done a) Punctuation removal, b) Space to new line convert, c) Remove duplicate (and possibly alphabetical sorting, which I prefer)
# <small>One step skipped here, how I remove । which is Bengali-language related and does not have universal scope</small>
# '''Punctuation removal:''' Paste the entire text on a tool like [https://www.browserling.com/tools/remove-punctuation this], remove punctuations (be careful, hyphen gets removed also, "intra-word" becomes intraword)
# '''Spaces to new lines:''' Now you have a lot of text without punctuations and a lot of spaces, use a [https://www.browserling.com/tools/spaces-to-newlines spaces to new lines tool] to do this.
# '''Duplication removal:''' Now there are many words, I add another step. Take the result text to [https://www.textfixer.com/tools/remove-duplicate-lines.php this tool]. Remove duplicates and I also click on "Sort alphabetically". (note Lingua Libre itself a duplicate remover, but I keep this step to clean word list. For example if you have a 10,000 words then every word counts. Lingua Libre (or my computer :-D) starts going slower for a list of more than ~4,000 words)
# You have the clean list. Copy paste onto a List page. Additional tips: Once you sort alphabetically in the list you will mostly find it problematic in the beginning or at the end)
# I gave an example from Wikisource just above. This was a real example. See how the result page looks like [[List:Ben/রবি ঠাকুর/ঘাটের কথা]]

== Questions ==
These things can heavily improve the process. 
# Step 4 above shows I can not easily remove ।. Sometimes I do not want to remove hyphen. Do you know about any tool or process where I can instruct please remove ।, but do not remove -?
# Step 4 to Step 6, I am actually going to three different tools to clean the data. Do you know "one tool" that can do all these together? 


== See also ==
* [[User:Titodutta/Bengali Lexeme workflow]] (the power-process of Bengali)