text-scrubber

Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities)

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
License
- OSI Approved :: MIT License
Programming Language
Topic
- Text Processing

Project description

Build status Docs status

text-scrubber is a Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities).

Full documentation is available at https://slimmer-ai.github.io/text-scrubber/.

TextScrubber

The TextScrubber class cleans a single or a collection of strings. It can be easily constructed and configured with building blocks:

from text_scrubber import TextScrubber

ts = (TextScrubber().to_ascii()
                    .lowercase()
                    .tokenize()
                    .remove_stop_words()
                    .join())

which can then be used as

ts.transform('héLlô there, WòrlD')  # outputs 'hello world'

ts.transform(['héLlô there, WòrlD', 'slímm̀er ÀI'])  # outputs ['hello world', 'slimmer AI']

Geo

The geo module contains functions to normalize geographical data which deal with spelling errors, country name variations, etc.:

from text_scrubber.geo import normalize_country, normalize_state, normalize_city

# Countries
normalize_country('Peoples rep. of China')  # ['China']
normalize_country('Deutschland')            # ['Germany']
normalize_country('st Nevis and Kitties')   # ['Saint Kitts and Nevis']
normalize_country('ira')                    # ['Iran', 'Iraq']

# States
normalize_state('Qld')         # [('Queensland', 'Australia')]
normalize_state('AR')          # [('Arkansas', 'United States'), ('Arunachal Pradesh', 'India')]
normalize_state('King Kong')   # [('Hong Kong', 'China')]

# Cities
normalize_city('Leibnitz')    # [('Leibnitz', 'Austria')]
normalize_city('heidelberg')  # [('Heidelberg', 'Australia'), ('Heidelberg', 'Germany'),
                              #  ('Heidelberg', 'South Africa'), ('Heidelberg', 'United States')]
normalize_city('texas')       # [('Texas City', 'United States')]
normalize_city('Pari')        # [('Parai', 'Brazil'), ('Paris', 'Canada'), ('Paris', 'France'),
                              #  ('Paris', 'United States'), ('Parit', 'Malaysia'),
                              #  ('Pariz', 'Czech Republic')]

Documentation

If you want to build the documentation, please install the documentation dependencies by executing:

pip install .[docs]

Documentation can be build by executing:

python setup.py build_docs

Documentation can also be build from the docs folder directly. In that case text-scrubber should be installed and available in your current working environment. Execute:

make html

in the docs folder.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
License
- OSI Approved :: MIT License
Programming Language
Topic
- Text Processing

Release history Release notifications | RSS feed

0.5.0

Aug 26, 2024

0.4.2

Apr 14, 2023

0.4.1

May 27, 2022

0.3.2

May 19, 2022

0.3.1

May 19, 2022

0.3.0

Apr 13, 2022

0.2.0

May 10, 2021

This version

0.1.1

Sep 10, 2020

0.1.0

Sep 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-scrubber-0.1.1.tar.gz (364.2 kB view hashes)

Uploaded Sep 10, 2020 Source

Built Distribution

text_scrubber-0.1.1-py3-none-any.whl (369.3 kB view hashes)

Uploaded Sep 10, 2020 Python 3

Hashes for text-scrubber-0.1.1.tar.gz

Hashes for text-scrubber-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`0ac187609dfa1b1807e00d1f762dfd030a9c407270da20e7c298792e5463d776`
MD5	`47fb73658a72e23888c118451c18f6e1`
BLAKE2b-256	`3e8ff6bfe37e843f8374cff41eb5cc920b18a045870434c4c9c9f4ab4fb09926`

Hashes for text_scrubber-0.1.1-py3-none-any.whl

Hashes for text_scrubber-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f53317bb8a3db03e38f141c6293193cce2e63470d3dfb17785d6de6f8998f30c`
MD5	`f81bfc4cdec62c94a94203684f1ba01b`
BLAKE2b-256	`800653c87b20f1c51419b32acc14cf446e6754bfbef7f3b6a1eec7ef8568f8df`