Compund word splitter for enchant supported languages
Compound Word Splitter (cwsplit) for any language supported by enchant.
Make sure you have enchant dictionary installed.
You can check the list of installed packages by running:
import enchant print(enchant.list_languages())
from cwsplit import split
For German (Default)
split('Rindfleisch') # ['rind', 'fleisch']
split('blackboard', 'en_en') # ['black', 'board']
from cwsplit import load_dict load_dict('en_en') split('blackboard') # ['black', 'board']
Sometimes the word is misspelled or just doesn’t exist. By deafult the word will be split in characters until the longer word shows up.
Positive effect of this behaviour is the connecting letters like ‘s’ in überwachungsaufgaben will be isolated.
On the other hand, let’s imagine we have a non-existing word gibberishfleisch, this will be decompounded into words gib, b, e, r, i, s, h and fleisch.
split('gibberishfleisch', language='de_de') # ['gib', 'b', 'e', 'r', 'i', 's', 'h', 'fleisch']
This does not look good at all. This is why you can select the sortest word size, so all shorter consecutive words will be concatenated. For example, let’s define the shortest ward as 4 characters long:
split('gibberishfleisch', language='de_de', min_word_size=4) # ['gibberish', 'fleisch']
Now we get two words gibberish and fleisch, which is something you would expect.
This will not affect the correct words that have a connecting ‘s’.
split('übertragungsgesetz', min_word_size=4) # ['übertragung','s', 'gesetz']
This is a very simple recursive algorithm that looks up for the longest word inside of the provided word, by checking if it exists in the enchant dictionary. The output is always returned as a list of strings. If no shorter words are found, the input word will be return as a single element list.
Upload script uses pandoc to convert README.md to README in rst fromat, needed in order to create the package. Make sure you have it installed if you plan to use the script.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size cwsplit-0.4.1.tar.gz (3.2 kB)||File type Source||Python version None||Upload date||Hashes View hashes|