Skip to main content

Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities)

Project description

Build status Docs status

text-scrubber is a Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities).

Full documentation is available at https://slimmer-ai.github.io/text-scrubber/.

TextScrubber

The TextScrubber class cleans a single or a collection of strings. It can be easily constructed and configured with building blocks:

from text_scrubber import TextScrubber

ts = (TextScrubber().to_ascii()
                    .lowercase()
                    .tokenize()
                    .remove_stop_words()
                    .join())

which can then be used as

ts.transform('héLlô there, WòrlD')  # outputs 'hello world'

or

ts.transform(['héLlô there, WòrlD', 'slímm̀er ÀI'])  # outputs ['hello world', 'slimmer AI']

Geo

The geo module contains functions to normalize geographical data which deal with spelling errors, country name variations, etc.:

from text_scrubber.geo import normalize_country, normalize_state, normalize_city

# Countries
normalize_country('Peoples rep. of China')  # ['China']
normalize_country('Deutschland')            # ['Germany']
normalize_country('st Nevis and Kitties')   # ['Saint Kitts and Nevis']
normalize_country('ira')                    # ['Iran', 'Iraq']

# States
normalize_state('Qld')         # [('Queensland', 'Australia')]
normalize_state('AR')          # [('Arkansas', 'United States'), ('Arunachal Pradesh', 'India')]
normalize_state('King Kong')   # [('Hong Kong', 'China')]

# Cities
normalize_city('Leibnitz')    # [('Leibnitz', 'Austria')]
normalize_city('heidelberg')  # [('Heidelberg', 'Australia'), ('Heidelberg', 'Germany'),
                              #  ('Heidelberg', 'South Africa'), ('Heidelberg', 'United States')]
normalize_city('texas')       # [('Texas City', 'United States')]
normalize_city('Pari')        # [('Parai', 'Brazil'), ('Paris', 'Canada'), ('Paris', 'France'),
                              #  ('Paris', 'United States'), ('Parit', 'Malaysia'),
                              #  ('Pariz', 'Czech Republic')]

Documentation

If you want to build the documentation, please install the documentation dependencies by executing:

pip install .[docs]

Documentation can be build by executing:

python setup.py build_docs

Documentation can also be build from the docs folder directly. In that case text-scrubber should be installed and available in your current working environment. Execute:

make html

in the docs folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-scrubber-0.2.0.tar.gz (364.9 kB view hashes)

Uploaded Source

Built Distribution

text_scrubber-0.2.0-py3-none-any.whl (369.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page