User View source for Titodutta/Bengali words from pages ← User:Titodutta/Bengali words from pages You do not have permission to edit this page, for the following reason: The action you have requested is limited to users in the group: Users. You can view and copy the source of this page. This page narrates how I extract the Bengali word list from a page of a book. ''Caveat lector'' * Please use your common sense if anyone is using it for your language. Each language may have its own issue, for example in Bengali I struggle with "hyphen/dash" and । (Bengali full stop) in this process, although those are fixable also. * A language (community) may have its own workflow or word priority list. If you/your community has one, you may work on that first. * Copyright is important. Do not put copyrighted content. == Extracting from a page == I ''sometimes'' use this process # Go to a page which you want to use, such as: [[:s:bn:গল্পগুচ্ছ/ঘাটের কথা]] # Copy the entire text. Now there are three things to be done a) Punctuation removal, b) Space to new line convert, c) Remove duplicate (and possibly alphabetical sorting, which I prefer) # <small>One step skipped here, how I remove । which is Bengali-language related and does not have universal scope</small> # '''Punctuation removal:''' Paste the entire text on a tool like [https://www.browserling.com/tools/remove-punctuation this], remove punctuations (be careful, hyphen gets removed also, "intra-word" becomes intraword) # '''Spaces to new lines:''' Now you have a lot of text without punctuations and a lot of spaces, use a [https://www.browserling.com/tools/spaces-to-newlines spaces to new lines tool] to do this. # '''Duplication removal:''' Now there are many words, I add another step. Take the result text to [https://www.textfixer.com/tools/remove-duplicate-lines.php this tool]. Remove duplicates and I also click on "Sort alphabetically". (note Lingua Libre itself a duplicate remover, but I keep this step to clean word list. For example if you have a 10,000 words then every word counts. Lingua Libre (or my computer :-D) starts going slower for a list of more than ~4,000 words) # You have the clean list. Copy paste onto a List page. Additional tips: Once you sort alphabetically in the list you will mostly find it problematic in the beginning or at the end) # I gave an example from Wikisource just above. This was a real example. See how the result page looks like [[List:Ben/রবি ঠাকুর/ঘাটের কথা]] == Questions == These things can heavily improve the process. # Step 4 above shows I can not easily remove ।. Sometimes I do not want to remove hyphen. Do you know about any tool or process where I can instruct please remove ।, but do not remove -? # Step 4 to Step 6, I am actually going to three different tools to clean the data. Do you know "one tool" that can do all these together? == See also == * [[User:Titodutta/Bengali Lexeme workflow]] (the power-process of Bengali) Return to User:Titodutta/Bengali words from pages. Retrieved from "https://lingualibre.org/wiki/User:Titodutta/Bengali_words_from_pages"