Skip to main content

Compund word splitter for enchant supported languages

Project description

build status

Compound Word Splitter (cwsplit) for any language supported by enchant.

Installation

Make sure you have enchant dictionary installed.

You can check the list of installed packages by running:

import enchant
print(enchant.list_languages())

Check the pyenchant and enchant links for more info.

Usage

Import module:

from cwsplit import split

For German (Default)

split('Rindfleisch')
# ['rind', 'fleisch']

For English:

split('blackboard', 'en_en')
# ['black', 'board']

or

from cwsplit import load_dict
load_dict('en_en')
split('blackboard')
# ['black', 'board']

Sometimes the word is misspelled or just doesn’t exist. By deafult the word will be split in characters until the longer word shows up.

Positive effect of this behaviour is the connecting letters like ‘s’ in überwachungsaufgaben will be isolated.

On the other hand, let’s imagine we have a non-existing word gibberishfleisch, this will be decompounded into words gib, b, e, r, i, s, h and fleisch.

split('gibberishfleisch', language='de_de')
# ['gib', 'b', 'e', 'r', 'i', 's', 'h', 'fleisch']

This does not look good at all. This is why you can select the sortest word size, so all shorter consecutive words will be concatenated. For example, let’s define the shortest ward as 4 characters long:

split('gibberishfleisch', language='de_de', min_word_size=4)
# ['gibberish', 'fleisch']

Now we get two words gibberish and fleisch, which is something you would expect.

This will not affect the correct words that have a connecting ‘s’.

For example:

split('übertragungsgesetz', min_word_size=4)
# ['übertragung','s', 'gesetz']

remains correct.

Algorithm

This is a very simple recursive algorithm that looks up for the longest word inside of the provided word, by checking if it exists in the enchant dictionary. The output is always returned as a list of strings. If no shorter words are found, the input word will be return as a single element list.

Developers

Upload script uses pandoc to convert README.md to README in rst fromat, needed in order to create the package. Make sure you have it installed if you plan to use the script.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cwsplit-0.4.1.tar.gz (3.2 kB view details)

Uploaded Source

File details

Details for the file cwsplit-0.4.1.tar.gz.

File metadata

  • Download URL: cwsplit-0.4.1.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cwsplit-0.4.1.tar.gz
Algorithm Hash digest
SHA256 66f52f9fcc6f5095c3c3539cc3a0fc2a7998ac620eda0412158381afe76c6c4a
MD5 7b504f02f2871d68354c8c806ab7885f
BLAKE2b-256 289905c7d0316fb3aed3065fb35b9a9182c4fb4242524668aafbbff9a6dac85b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page