Probabilistically split concatenated words using NLP based on English Wikipedia uni-gram frequencies.

These details have not been verified by PyPI

Project links

Homepage

Project description

C Word Ninja

Slice your munged together words! Seriously, Take anything, 'imateapot' for example, would become ['im', 'a', 'teapot']. Useful for humanizing stuff (like database tables when people don't like underscores).

This project is repackaging the excellent work from here: http://stackoverflow.com/a/11642687/2449774

cwordninja is rewritten using cython based on wordninja.

Usage

$ python
>>> import cwordninja
>>> cwordninja.split('derekanderson')
['derek', 'anderson']
>>> cwordninja.split('imateapot')
['im', 'a', 'teapot']
>>> cwordninja.split('heshotwhointhewhatnow')
['he', 'shot', 'who', 'in', 'the', 'what', 'now']
>>> cwordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Performance

It's super fast!

Code:

import cwordninja
import wordninja
import timeit

def a():
    cwordninja.split("derek anderson")

c_time = int(timeit.timeit(a, number=10000) * 1000)
print("cwordninja:", c_time, "ms")

def b():
    wordninja.split("derek anderson")

r_time = int(timeit.timeit(b, number=10000) * 1000)
print("wordninja:", r_time, "ms")

print(int(r_time / c_time), "x")

Result:

cwordninja: 2 ms
wordninja: 1507 ms
753 x

It can handle long strings:

>>> cwordninja.lsplit('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')
['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']

And scales well. (This string takes ~0.1ms to compute.)

How to Install

pip3 install cwordninja

Custom Language Models

#1 most requested feature! If you want to do something other than english (or want to specify your own model of english), this is how you do it.

>>> lm = cwordninja.LanguageModel('my_lang.txt.gz')
>>> lm.split('derek')
['der','ek']

Language files must be gziped text files with one word per line in decreasing order of probability.

If you want to make your model the default, set:

cwordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('my_lang.txt.gz')

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0.4

Dec 6, 2024

2.0.3

Dec 6, 2024

2.0.2

Dec 6, 2024

2.0.1

Dec 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cwordninja-2.0.4.tar.gz (601.7 kB view details)

Uploaded Dec 6, 2024 Source

File details

Details for the file cwordninja-2.0.4.tar.gz.

File metadata

Download URL: cwordninja-2.0.4.tar.gz
Upload date: Dec 6, 2024
Size: 601.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for cwordninja-2.0.4.tar.gz
Algorithm	Hash digest
SHA256	`48ce679363906bcbd04bc083f975c1f3028f25ced96c08a5992864c345863de6`
MD5	`3d32d6cac211ce81aeebb5fdc6f1dba2`
BLAKE2b-256	`24620749ecf8752f6f755186350304b93db9cd82d4fcc808d8cfe49563e2a2ce`

See more details on using hashes here.

cwordninja 2.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

C Word Ninja

Usage

Performance

How to Install

Custom Language Models

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes