Skip to main content

Probabilistically split concatenated words using NLP based on English Wikipedia uni-gram frequencies.

Project description

image

C Word Ninja

Slice your munged together words! Seriously, Take anything, 'imateapot' for example, would become ['im', 'a', 'teapot']. Useful for humanizing stuff (like database tables when people don't like underscores).

This project is repackaging the excellent work from here: http://stackoverflow.com/a/11642687/2449774

cwordninja is rewritten using cython based on wordninja.

Usage

$ python
>>> import cwordninja
>>> cwordninja.split('derekanderson')
['derek', 'anderson']
>>> cwordninja.split('imateapot')
['im', 'a', 'teapot']
>>> cwordninja.split('heshotwhointhewhatnow')
['he', 'shot', 'who', 'in', 'the', 'what', 'now']
>>> cwordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Performance

It's super fast!

Code:

import cwordninja
import wordninja
import timeit

def a():
    cwordninja.split("derek anderson")

c_time = int(timeit.timeit(a, number=10000) * 1000)
print("cwordninja:", c_time, "ms")

def b():
    wordninja.split("derek anderson")

r_time = int(timeit.timeit(b, number=10000) * 1000)
print("wordninja:", r_time, "ms")

print(int(r_time / c_time), "x")

Result:

cwordninja: 2 ms
wordninja: 1507 ms
753 x

It can handle long strings:

>>> cwordninja.lsplit('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')
['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']

And scales well. (This string takes ~0.1ms to compute.)

How to Install

pip3 install cwordninja

Custom Language Models

#1 most requested feature! If you want to do something other than english (or want to specify your own model of english), this is how you do it.

>>> lm = cwordninja.LanguageModel('my_lang.txt.gz')
>>> lm.split('derek')
['der','ek']

Language files must be gziped text files with one word per line in decreasing order of probability.

If you want to make your model the default, set:

cwordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('my_lang.txt.gz')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cwordninja-2.0.4.tar.gz (601.7 kB view details)

Uploaded Source

File details

Details for the file cwordninja-2.0.4.tar.gz.

File metadata

  • Download URL: cwordninja-2.0.4.tar.gz
  • Upload date:
  • Size: 601.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for cwordninja-2.0.4.tar.gz
Algorithm Hash digest
SHA256 48ce679363906bcbd04bc083f975c1f3028f25ced96c08a5992864c345863de6
MD5 3d32d6cac211ce81aeebb5fdc6f1dba2
BLAKE2b-256 24620749ecf8752f6f755186350304b93db9cd82d4fcc808d8cfe49563e2a2ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page