Skip to main content

Compund word splitter for enchant supported languages

Project description

build status

Compound Word Splitter (cwsplit) for any language supported by enchant.

Installation

Make sure you have enchant dictionary installed.

You can check the list of installed packages by running:

import enchant
print(enchant.list_languages())

Check the pyenchant and enchant links for more info.

Usage

Import module:

from cwsplit import split

For German (Default)

split('Rindfleisch')
# ['rind', 'fleisch']

For English:

split('blackboard', 'en_en')
# ['black', 'board']

or

from cwsplit import load_dict
load_dict('en_en')
split('blackboard')
# ['black', 'board']

Sometimes the word is misspelled or just doesn’t exist. By deafult the word will be split in characters until the longer word shows up.

Positive effect of this behaviour is the connecting letters like ‘s’ in überwachungsaufgaben will be isolated.

On the other hand, let’s imagine we have a non-existing word gibberishfleisch, this will be decompounded into words gib, b, e, r, i, s, h and fleisch.

split('gibberishfleisch', language='de_de')
# ['gib', 'b', 'e', 'r', 'i', 's', 'h', 'fleisch']

This does not look good at all. This is why you can select the sortest word size, so all shorter consecutive words will be concatenated. For example, let’s define the shortest ward as 4 characters long:

split('gibberishfleisch', language='de_de', min_word_size=4)
# ['gibberish', 'fleisch']

Now we get two words gibberish and fleisch, which is something you would expect.

This will not affect the correct words that have a connecting ‘s’.

For example:

split('übertragungsgesetz', min_word_size=4)
# ['übertragung','s', 'gesetz']

remains correct.

Algorithm

This is a very simple recursive algorithm that looks up for the longest word inside of the provided word, by checking if it exists in the enchant dictionary. The output is always returned as a list of strings. If no shorter words are found, the input word will be return as a single element list.

Developers

Upload script uses pandoc to convert README.md to README in rst fromat, needed in order to create the package. Make sure you have it installed if you plan to use the script.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cwsplit-0.4.1.tar.gz (3.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page