User
Difference between revisions of "Titodutta/Bengali words from pages"
< User:Titodutta
(++) |
(++) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | This page narrates how | + | This page narrates how I extract the Bengali word list from a page of a book. |
''Caveat lector'' | ''Caveat lector'' | ||
− | * Please use your common sense if anyone | + | * Please use your common sense if anyone is using it for your language. Each language may have its own issue, for example in Bengali I struggle with "hyphen/dash" and । (Bengali full stop) in this process, although those are fixable also. |
− | * A language (community) may have | + | * A language (community) may have its own workflow or word priority list. If you/your community has one, you may work on that first. |
* Copyright is important. Do not put copyrighted content. | * Copyright is important. Do not put copyrighted content. | ||
Line 12: | Line 12: | ||
# Copy the entire text. Now there are three things to be done a) Punctuation removal, b) Space to new line convert, c) Remove duplicate (and possibly alphabetical sorting, which I prefer) | # Copy the entire text. Now there are three things to be done a) Punctuation removal, b) Space to new line convert, c) Remove duplicate (and possibly alphabetical sorting, which I prefer) | ||
# <small>One step skipped here, how I remove । which is Bengali-language related and does not have universal scope</small> | # <small>One step skipped here, how I remove । which is Bengali-language related and does not have universal scope</small> | ||
− | # '''Punctuation removal:''' Paste the entire text on a tool like [https://www.browserling.com/tools/remove-punctuation this], remove punctuations (be careful, hyphen | + | # '''Punctuation removal:''' Paste the entire text on a tool like [https://www.browserling.com/tools/remove-punctuation this], remove punctuations (be careful, hyphen gets removed also, "intra-word" becomes intraword) |
# '''Spaces to new lines:''' Now you have a lot of text without punctuations and a lot of spaces, use a [https://www.browserling.com/tools/spaces-to-newlines spaces to new lines tool] to do this. | # '''Spaces to new lines:''' Now you have a lot of text without punctuations and a lot of spaces, use a [https://www.browserling.com/tools/spaces-to-newlines spaces to new lines tool] to do this. | ||
# '''Duplication removal:''' Now there are many words, I add another step. Take the result text to [https://www.textfixer.com/tools/remove-duplicate-lines.php this tool]. Remove duplicates and I also click on "Sort alphabetically". (note Lingua Libre itself a duplicate remover, but I keep this step to clean word list. For example if you have a 10,000 words then every word counts. Lingua Libre (or my computer :-D) starts going slower for a list of more than ~4,000 words) | # '''Duplication removal:''' Now there are many words, I add another step. Take the result text to [https://www.textfixer.com/tools/remove-duplicate-lines.php this tool]. Remove duplicates and I also click on "Sort alphabetically". (note Lingua Libre itself a duplicate remover, but I keep this step to clean word list. For example if you have a 10,000 words then every word counts. Lingua Libre (or my computer :-D) starts going slower for a list of more than ~4,000 words) |
Latest revision as of 21:31, 18 January 2021
This page narrates how I extract the Bengali word list from a page of a book.
Caveat lector
- Please use your common sense if anyone is using it for your language. Each language may have its own issue, for example in Bengali I struggle with "hyphen/dash" and । (Bengali full stop) in this process, although those are fixable also.
- A language (community) may have its own workflow or word priority list. If you/your community has one, you may work on that first.
- Copyright is important. Do not put copyrighted content.
Extracting from a page
I sometimes use this process
- Go to a page which you want to use, such as: s:bn:গল্পগুচ্ছ/ঘাটের কথা
- Copy the entire text. Now there are three things to be done a) Punctuation removal, b) Space to new line convert, c) Remove duplicate (and possibly alphabetical sorting, which I prefer)
- One step skipped here, how I remove । which is Bengali-language related and does not have universal scope
- Punctuation removal: Paste the entire text on a tool like this, remove punctuations (be careful, hyphen gets removed also, "intra-word" becomes intraword)
- Spaces to new lines: Now you have a lot of text without punctuations and a lot of spaces, use a spaces to new lines tool to do this.
- Duplication removal: Now there are many words, I add another step. Take the result text to this tool. Remove duplicates and I also click on "Sort alphabetically". (note Lingua Libre itself a duplicate remover, but I keep this step to clean word list. For example if you have a 10,000 words then every word counts. Lingua Libre (or my computer :-D) starts going slower for a list of more than ~4,000 words)
- You have the clean list. Copy paste onto a List page. Additional tips: Once you sort alphabetically in the list you will mostly find it problematic in the beginning or at the end)
- I gave an example from Wikisource just above. This was a real example. See how the result page looks like List:Ben/রবি ঠাকুর/ঘাটের কথা
Questions
These things can heavily improve the process.
- Step 4 above shows I can not easily remove ।. Sometimes I do not want to remove hyphen. Do you know about any tool or process where I can instruct please remove ।, but do not remove -?
- Step 4 to Step 6, I am actually going to three different tools to clean the data. Do you know "one tool" that can do all these together?
See also
- User:Titodutta/Bengali Lexeme workflow (the power-process of Bengali)