Skip to main content

De-concatenate strings that do not have white-spaces.

Project description

Decat

thisisawesome --> ['this', 'is', 'awesome']



Decat is a Python package capable of de-concatenating strings that do not have white-spaces in them, or in other words, it allows the user to infer spaces programmatically. This is a simple utility that comes in handy with various modern Natural Language Processing(NLP) tasks such as cleaning, exploration or even manipulation of text. Zipf's Law is at the core of this project, the aim is to provide an easy interface for programmers to extract meaningful information out of deformed pieces of texts.

Get Started

Install It

>> pip install decat

Play With It

>> decat -i someweirdtext
>> ['some', 'weird', 'text']

or

>> python -m decat -i justanotherstring
>> ['just', 'another', 'string']

Use It In Your Projects

Sample Code

from decat import decat


weird_text = '“AnyfoolcanwritecodethatacomputercanunderstandGoodprogrammerswritecodethathumanscanunderstand.”–MartinFowler'
weird_text_simplified = decat(weird_text)
print(weird_text_simplified)

Console

['any', 'fool', 'can', 'write', 'code', 'that', 'a', 'computer', 'can', 'understand', 'good', 'programmers', 'write', 'code', 'that', 'humans', 'can', 'understand', 'martin', 'fowler']

Features

🪶 A light weight package, built around the features available in standard library

📚 An ever-expanding vocabulary, knows more than 300K English words

🪃 Simplistic design, allows for easy expansion to new languages and custom vocabulary sets

Dependencies

⭕️ None 🎉

Limitations

❗ Requires Python >= 3.6

❗ ️All input will be treated as lower-case

>> ATitleCaseString --> ['a', 'title', 'case', 'string']

❗️ Punctuation marks, numbers and special characters will be stripped from the input and will not be preserved in the output

>>  dummy.email1234@gmail.com --> ['dummy', 'email', 'gmail', 'com']

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

decat-1.0.3.tar.gz (2.0 MB view hashes)

Uploaded Source

Built Distribution

decat-1.0.3-py3-none-any.whl (2.0 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page