Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities)
Project description
text-scrubber is a Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities).
Full documentation is available at https://slimmer-ai.github.io/text-scrubber/.
TextScrubber
The TextScrubber class cleans a single or a collection of strings. It can be easily constructed and configured with building blocks:
from text_scrubber import TextScrubber
ts = (TextScrubber().to_ascii()
.lowercase()
.tokenize()
.remove_stop_words()
.join())
which can then be used as
ts.transform('héLlô there, WòrlD') # outputs 'hello world'
or
ts.transform(['héLlô there, WòrlD', 'slímm̀er ÀI']) # outputs ['hello world', 'slimmer AI']
Geo
The geo module contains functions to normalize geographical data which deal with spelling errors, country name variations, etc.:
from text_scrubber.geo import normalize_country, normalize_state, normalize_city
# Countries
normalize_country('Peoples rep. of China') # ['China']
normalize_country('Deutschland') # ['Germany']
normalize_country('st Nevis and Kitties') # ['Saint Kitts and Nevis']
normalize_country('ira') # ['Iran', 'Iraq']
# States
normalize_state('Qld') # [('Queensland', 'Australia')]
normalize_state('AR') # [('Arkansas', 'United States'), ('Arunachal Pradesh', 'India')]
normalize_state('King Kong') # [('Hong Kong', 'China')]
# Cities
normalize_city('Leibnitz') # [('Leibnitz', 'Austria')]
normalize_city('heidelberg') # [('Heidelberg', 'Australia'), ('Heidelberg', 'Germany'),
# ('Heidelberg', 'South Africa'), ('Heidelberg', 'United States')]
normalize_city('texas') # [('Texas City', 'United States')]
normalize_city('Pari') # [('Parai', 'Brazil'), ('Paris', 'Canada'), ('Paris', 'France'),
# ('Paris', 'United States'), ('Parit', 'Malaysia'),
# ('Pariz', 'Czech Republic')]
Documentation
If you want to build the documentation, please install the documentation dependencies by executing:
pip install .[docs]
Documentation can be build by executing:
python setup.py build_docs
Documentation can also be build from the docs folder directly. In that case text-scrubber should be installed and available in your current working environment. Execute:
make html
in the docs folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_scrubber-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f53317bb8a3db03e38f141c6293193cce2e63470d3dfb17785d6de6f8998f30c |
|
MD5 | f81bfc4cdec62c94a94203684f1ba01b |
|
BLAKE2b-256 | 800653c87b20f1c51419b32acc14cf446e6754bfbef7f3b6a1eec7ef8568f8df |