Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities)
Project description
text-scrubber is a Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities).
Full documentation is available at https://slimmer-ai.github.io/text-scrubber/.
TextScrubber
The TextScrubber class cleans a single or a collection of strings. It can be easily constructed and configured with building blocks:
from text_scrubber import TextScrubber
ts = (TextScrubber().to_ascii()
.lowercase()
.tokenize()
.remove_stop_words()
.join())
which can then be used as:
ts.transform('héLlô there, WòrlD') # outputs 'hello world'
or with an iterable of input:
ts.transform(['héLlô there, WòrlD', 'slímm̀er ÀI']) # outputs ['hello world', 'slimmer AI']
For a complete list of building blocks please refer to the TextScrubber API reference.
Geo
The text_scrubber.geo module contains functions to normalize geographical data which deal with spelling errors, country name variations, etc.:
from text_scrubber.geo import normalize_country, normalize_region, normalize_city
# Countries
normalize_country('Peoples rep. of China') # [('China', 1.0)]
normalize_country('Deutschland') # [('Germany', 1.0)]
normalize_country('st Nevis and Kitties') # [('Saint Kitts and Nevis', 0.75)]
normalize_country('ira') # [('Iran', 0.857), ('Iraq', 0.857)]
# Cities
normalize_city('Leibnitz', ['Austria']) # [('Leibnitz', 'Austria', 1.0)]
normalize_city('heidelberg') # [('Heidelberg', 'Germany', 1.0),
# ('Heidelberg', 'South Africa', 1.0),
# ('Heidelberg', 'United States', 1.0)]
normalize_city('ohioo', ['US']) # [('Ohio', 'United States', 0.889)]
normalize_city('Madri', ['Spain', 'US', 'Brazil']) # [('Madrid', 'Spain', 0.909),
# ('Madrid', 'United States', 0.909),
# ('Mari', 'Brazil', 0.889)]
# Regions
normalize_region('triangle park', ['US']) # [('The Triangle Park', 'United States', 1.0)]
normalize_region('Fur', ['Denmark']) # [('Fur', 'Denmark', 1.0)]
normalize_region('texel', ['NL']) # [('Texel', 'Netherlands', 1.0)]
Each of the above normalization functions will return the match score as last entry in the tuple. These scores are always between 0.0 and 1.0, where 1.0 is a perfect match. If a known mapping exists, like Deutschland to Germany, then the match score will be 1.0.
The text_scrubber.geo module also contains functions to find the name of places (country, region, and city) in text dealing with spelling errors, country name variations, etc.:
from text_scrubber.geo import (find_city_in_string, find_country_in_string,
find_region_in_string)
# Countries
find_country_in_string("Institute of German study, Accra, Ghana")
# Returns: [Match(substring_range=(34, 39), substring='Ghana',
# normalized='Ghana', score=1.0),
# Match(substring_range=(13, 19), substring='German',
# normalized='Germany', score=0.923)]
find_country_in_string("Peking University, 5 Yiheyuan Rd, "
"Haidian District, Beijing, CH, 100871")
# Returns: [Match(substring_range=(61, 63), substring="CH",
# normalized="China", score=1.0)]
# Cities
find_city_in_string("Météorage Pau France", {"France"})
# Returns: [Match(substring_range=(10, 13), substring="Pau",
# normalized=("Pau", "France"), score=1.0),
# Match(substring_range=(14, 20), substring="France",
# normalized=("La Frasnée", "France"), score=0.909)]
find_city_in_string("Bavarian Environment Agency, Hans Högn Straße 12, "
"95030 Hof Saale, Bavaria, Germany", {"Germany})
# Returns: [Match(substring_range=(56, 59), substring='Hof',
# normalized=('Hof', 'Germany'), score=1.0),
# Match(substring_range=(39, 45), substring="Straße",
# normalized=("Trassem", "Germany"), score=0.857)]
# Regions
find_region_in_string("Fur Museum, 7884 Fur, Denmark.", {"Denmark"})
# Returns: [Match(substring_range=(0, 3), substring='Fur',
# normalized=('Fur', 'Denmark'), score=1.0),
# Match(substring_range=(17, 20), substring='Fur',
# normalized=('Fur', 'Denmark'), score=1.0),
# Match(substring_range=(22, 29), substring='Denmark',
# normalized=('Kingdom of Denmark', 'Denmark'), score=1.0)]
find_region_in_string("Department of Biological Oceanography, Royal Netherlands Institute "
"for Sea Research (NIOZ), Texel, The Netherlands", {"Netherlands"})
# Returns: [Match(substring_range=(45, 56), substring='Netherlands',
# normalized=('Kingdom of the Netherlands', 'Netherlands'), score=1.0),
# Match(substring_range=(92, 97), substring='Texel',
# normalized=('Texel', 'Netherlands'), score=1.0),
# Match(substring_range=(103, 114), substring='Netherlands',
# normalized=('Kingdom of the Netherlands', 'Netherlands'), score=1.0)]
Cleaning
There are clean functions available for countries/regions/cities, which all follow the same cleaning pipeline:
from text_scrubber.geo import clean_country, clean_region, clean_city
clean_country('cent afr rep.') # 'central african republic'
clean_region('Hyōgo') # 'hyogo'
clean_city('płońsk') # 'plonsk'
clean_city('neustadt/westerwald') # 'neustadt westerwald'
Documentation
If you want to build the documentation, please install the documentation dependencies by executing:
pip install .[docs]
Documentation can be build by executing:
python setup.py build_docs
Documentation can also be build from the docs folder directly. In that case text-scrubber should be installed and available in your current working environment. Execute:
make html
in the docs folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_scrubber-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74ab0c478d90500f7379c64d8f48ebf1237a3ee6c53df208a775f9118e64fbef |
|
MD5 | fe4c03dbdce742942849dadf9a96a3fc |
|
BLAKE2b-256 | 19d3a977b71d3e162459d62793d428122cc2c90390efaf082d1dd5a8cd48b4ba |