Compund word splitter for enchant supported languages
Project description
Compound Word Splitter (cwsplit) for any language supported by enchant.
Installation
Make sure you have enchant dictionary installed.
You can check the list of installed packages by running:
import enchant
print(enchant.list_languages())
Usage
Import module:
from cwsplit import split
For German (Default)
split('Rindfleisch')
# ['rind', 'fleisch']
For English:
split('blackboard', 'en_en')
# ['black', 'board']
or
from cwsplit import load_dict
load_dict('en_en')
split('blackboard')
# ['black', 'board']
Sometimes the word is misspelled or just doesn’t exist. By deafult the word will be split in characters until the longer word shows up.
Positive effect of this behaviour is the connecting letters like ‘s’ in überwachungsaufgaben will be isolated.
On the other hand, let’s imagine we have a non-existing word gibberishfleisch, this will be decompounded into words gib, b, e, r, i, s, h and fleisch.
split('gibberishfleisch', language='de_de')
# ['gib', 'b', 'e', 'r', 'i', 's', 'h', 'fleisch']
This does not look good at all. This is why you can select the sortest word size, so all shorter consecutive words will be concatenated. For example, let’s define the shortest ward as 4 characters long:
split('gibberishfleisch', language='de_de', min_word_size=4)
# ['gibberish', 'fleisch']
Now we get two words gibberish and fleisch, which is something you would expect.
This will not affect the correct words that have a connecting ‘s’.
For example:
split('übertragungsgesetz', min_word_size=4)
# ['übertragung','s', 'gesetz']
remains correct.
Algorithm
This is a very simple recursive algorithm that looks up for the longest word inside of the provided word, by checking if it exists in the enchant dictionary. The output is always returned as a list of strings. If no shorter words are found, the input word will be return as a single element list.
Developers
Upload script uses pandoc to convert README.md to README in rst fromat, needed in order to create the package. Make sure you have it installed if you plan to use the script.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file cwsplit-0.4.1.tar.gz
.
File metadata
- Download URL: cwsplit-0.4.1.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 66f52f9fcc6f5095c3c3539cc3a0fc2a7998ac620eda0412158381afe76c6c4a |
|
MD5 | 7b504f02f2871d68354c8c806ab7885f |
|
BLAKE2b-256 | 289905c7d0316fb3aed3065fb35b9a9182c4fb4242524668aafbbff9a6dac85b |