Skip to main content

get rid of unicode punctuation and other garbage from strings

Project description

Micro-library to quickly clean up text for textmining purposes. Mostly devoted to getting rid of unicode punctuation, random extra spaces, etc. etc.

This is meant as a very quick and dirty cleanup process (hence the name). It throws out numbers, currency symbols, math, punctuation, superscripts and subscripts etc. The objective is primarily to eliminate an annoying step in bag-of-words style textmining.

Basic Usage

Usage is really simple:

from dirtyclean import clean
clean("some string full of punctuation and stuff")

Right now, the only options given are to convert combined letters (letters with accent marks, umlats, etc.) into their unicode primitives or not. Default behavior is to not do so, and hence retain all characters that might actually distinguish different words.

If you want to get rid of umlats and such, you can try clean("somestring", simplify_letters = True) However, I don’t guarantee that letter simplification will work. If it breaks, please file an issue, or, better yet, a PR.

More Details

This micro-library just tries to take a string with arbitrary unicode in it and emit a string with all the words, separated by single spaces, where “word” is defined as “continuous sequence of letters” and “letter” is defined as “stuff in the unicode categories Lu, Ll, Lt, and Lo.” Does not lowercase or anything like that; that’s a single method call you can do it yourself (also, I don’t want to throw away case information for users who need it).

One known problem is that hyphenated words and contractions will be split up. In view of the fact that in natural language English as actually used (at least, not sure about other languages) uses “-”, “’” and various other characters with similar shapes to split words and to serve as ordinary punctuation, there’s no simple and quick way to detect the difference between these uses, at least not without some kind of dictionary. So I just let stuff like “don’t” be two words. If this is a problem for your use case, maybe try a fancy NLTK tokenizer?

One possible heuristic solution would be to replace the punctuation and such with a placeholder on the translation pass, and then on the regular expression pass, replace the with a space iff there’s a space (but not newline, because of line-break hyphenation) on either side. This could work, but would involve making the regex a bit more convoluted + would smush together rather than separate compound constructions that are normally hyphenated. For example, “injury-prone” would become “injuryprone” rather than the more sensible “injury prone.” Nonetheless, it’s worth a try as an alternative implementation, perhaps under a flag like heuristic-hyphens = True. I might implement this down the line, but I’d welcome a PR to do it before I get there.

Contributing, license, tests, etc.

In general, this is a work in progress, pull requests very much desired. Please add a test for any functionality you add, and please keep the default behavior the same (except for bugfixed) without any new arguments (i.e., for new functionality, just add another optional argument).

Tests are bare unittest; I use nose2 for testing, but you can use whatever. Run tests from the test directory so it can find the text file in there.

MIT license.

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dirtyclean-0.1.tar.gz (3.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page