Probabilistically split concatenated words using NLP based on English Wikipedia uni-gram frequencies.
Project description
C Word Ninja
Slice your munged together words! Seriously, Take anything, 'imateapot' for example, would become ['im', 'a', 'teapot']. Useful for humanizing stuff (like database tables when people don't like underscores).
This project is repackaging the excellent work from here: http://stackoverflow.com/a/11642687/2449774
cwordninja is rewritten using cython based on wordninja.
Usage
$ python
>>> import cwordninja
>>> cwordninja.split('derekanderson')
['derek', 'anderson']
>>> cwordninja.split('imateapot')
['im', 'a', 'teapot']
>>> cwordninja.split('heshotwhointhewhatnow')
['he', 'shot', 'who', 'in', 'the', 'what', 'now']
>>> cwordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Performance
It's super fast!
Code:
import cwordninja
import wordninja
import timeit
def a():
cwordninja.split("derek anderson")
c_time = int(timeit.timeit(a, number=10000) * 1000)
print("cwordninja:", c_time, "ms")
def b():
wordninja.split("derek anderson")
r_time = int(timeit.timeit(b, number=10000) * 1000)
print("wordninja:", r_time, "ms")
print(int(r_time / c_time), "x")
Result:
cwordninja: 2 ms
wordninja: 1507 ms
753 x
It can handle long strings:
>>> cwordninja.lsplit('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')
['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']
And scales well. (This string takes ~0.1ms to compute.)
How to Install
pip3 install cwordninja
Custom Language Models
#1 most requested feature! If you want to do something other than english (or want to specify your own model of english), this is how you do it.
>>> lm = cwordninja.LanguageModel('my_lang.txt.gz')
>>> lm.split('derek')
['der','ek']
Language files must be gziped text files with one word per line in decreasing order of probability.
If you want to make your model the default, set:
cwordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('my_lang.txt.gz')
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file cwordninja-2.0.4.tar.gz.
File metadata
- Download URL: cwordninja-2.0.4.tar.gz
- Upload date:
- Size: 601.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48ce679363906bcbd04bc083f975c1f3028f25ced96c08a5992864c345863de6
|
|
| MD5 |
3d32d6cac211ce81aeebb5fdc6f1dba2
|
|
| BLAKE2b-256 |
24620749ecf8752f6f755186350304b93db9cd82d4fcc808d8cfe49563e2a2ce
|