Skip to main content

Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities)

Project description

Build status Docs status

text-scrubber is a Python package that offers text scrubbing functionality, providing building blocks for string cleaning as well as normalizing geographical text (countries/states/cities).

Full documentation is available at https://sybrenjansen.github.io/text-scrubber/.

TextScrubber

The TextScrubber class cleans a single or a collection of strings. It can be easily constructed and configured with building blocks:

from text_scrubber import TextScrubber

ts = (TextScrubber().to_ascii()
                    .lowercase()
                    .tokenize()
                    .remove_stop_words()
                    .join())

which can then be used as:

ts.transform('héLlô there, WòrlD')  # outputs 'hello world'

or with an iterable of input:

ts.transform(['héLlô there, WòrlD', 'slímm̀er ÀI'])  # outputs ['hello world', 'slimmer AI']

For a complete list of building blocks please refer to the TextScrubber API reference.

Geo

The text_scrubber.geo module contains functions to normalize geographical data which deal with spelling errors, country name variations, etc.:

from text_scrubber.geo import normalize_country, normalize_region, normalize_city

"""
Countries
"""

normalize_country('Peoples rep. of China')
# [Location(canonical_name='China', matched_name='Peoples Republic of China', country=None,
#           score=1.0)]

normalize_country('Deutschland')
# [Location(canonical_name='Germany', matched_name='Deutschland', country=None, score=1.0)]

normalize_country('st Nevis and Kitties')
# [Location(canonical_name='Saint Kitts and Nevis', matched_name='Saint Kitts and Nevis',
#           country=None, score=0.75)]

normalize_country('ira')
# [Location(canonical_name='Iran', matched_name='Iran', country=None, score=0.857...),
#  Location(canonical_name='Iraq', matched_name='Iraq', country=None, score=0.857...)]

"""
Cities
"""

normalize_city('Leibnitz', ['Austria'])
# [Location(canonical_name='Leibnitz', matched_name='Leibnitz', country='Austria', score=1.0)]

normalize_city('heidelberg')
# [Location(canonical_name='Heidelberg', matched_name='Heidelberg', country='Germany',
#           score=1.0),
#  Location(canonical_name='Heidelberg', matched_name='Heidelberg', country='South Africa',
#           score=1.0),
#  Location(canonical_name='Heidelberg', matched_name='Heidelberg', country='United States',
#           score=1.0)]

normalize_city('ohioo', ['US'])
# [Location(canonical_name='Ohio', matched_name='Ohio', country='United States',
#           score=0.888...)]

normalize_city('Madri', ['Spain', 'US', 'Brazil'])
# [Location(canonical_name='Madrid', matched_name='Madrid', country='Spain',
#           score=0.909...),
#  Location(canonical_name='Madrid', matched_name='Madrid', country='United States',
#           score=0.909...),
#  Location(canonical_name='Mari', matched_name='Mari', country='Brazil',
#           score=0.888...)]

"""
Regions
"""

normalize_region('triangle park', ['US'])
# [Location(canonical_name='The Triangle Park', matched_name='The Triangle Park',
#           country='United States', score=1.0)]

normalize_region('Fur', ['Denmark'])
# [Location(canonical_name='Fur', matched_name='Fur', country='Denmark', score=1.0)]

normalize_region('texel', ['NL'])
# [Location(canonical_name='Texel', matched_name='Texel', country='Netherlands', score=1.0)]

Each of the above normalization functions return the canonical name, matched name, the match score, and when normalizing cities or regions it will also contain the corresponding country. The difference between canonical and matched name stems from the fact that some countries, cities, or regions can have alternative names. E.g., NYC maps to New York City. When the query was NYCC the canonical name will be New York City, but the matched name NYC. The match scores are always between 0.0 and 1.0, where 1.0 is a perfect match. If a known mapping exists, like Deutschland to Germany, then the match score will be 1.0.

The text_scrubber.geo module also contains functions to find the name of places (country, region, and city) in text dealing with spelling errors, country name variations, etc.:

from text_scrubber.geo import (find_city_in_string, find_country_in_string,
                               find_region_in_string)

"""
Countries
"""

find_country_in_string("Institute of German study, Accra, Ghana")
# [ExtractedLocation(location=Location(canonical_name='Ghana', matched_name='Ghana',
#                                      country=None, score=1.0),
#                    substring='Ghana', substring_range=Range(start=34, end=39)),
#  ExtractedLocation(location=Location(canonical_name='Germany', matched_name='Germany',
#                                      country=None, score=0.923...),
#                    substring='German', substring_range=Range(start=13, end=19))]

find_country_in_string("Peking University, 5 Yiheyuan Rd, "
                       "Haidian District, Beijing, CH, 100871")
# This was a trick question though, as CH=Switzerland. China is CN
# [ExtractedLocation(location=Location(canonical_name='Switzerland', matched_name='CH',
#                                      country=None, score=1.0),
#                    substring='CH', substring_range=Range(start=61, end=63))]

"""
Cities
"""

find_city_in_string("Météorage Pau France", {"France"})
# [ExtractedLocation(location=Location(canonical_name='Pau', matched_name='Pau',
#                                      country='France', score=1.0),
#                    substring='Pau', substring_range=Range(start=10, end=13)),
#  ExtractedLocation(location=Location(canonical_name='La Frasnée', matched_name='Фране',
#                                      country='France', score=0.909...),
#                    substring='France', substring_range=Range(start=14, end=20))]

find_city_in_string("Bavarian Environment Agency, Hans Högn Straße 12, "
                    "95030 Hof Saale, Bavaria, Germany", {"Germany"})
# [ExtractedLocation(location=Location(canonical_name='Hof', matched_name='Hof',
#                                      country='Germany', score=1.0),
#                    substring='Hof', substring_range=Range(start=56, end=59)),
#  ExtractedLocation(location=Location(canonical_name='Saal', matched_name='Saal',
#                                      country='Germany', score=0.888...),
#                    substring='Saale', substring_range=Range(start=60, end=65)),
#  ExtractedLocation(location=Location(canonical_name='Trassem', matched_name='Trassem',
#                                      country='Germany', score=0.857...),
#                    substring='Straße', substring_range=Range(start=39, end=45))]

"""
Regions
"""

find_region_in_string("Fur Museum, 7884 Fur, Denmark.", {"Denmark"})
# [ExtractedLocation(location=Location(canonical_name='Fur', matched_name='Fur',
#                                      country='Denmark', score=1.0),
#                    substring='Fur', substring_range=Range(start=0, end=3)),
#  ExtractedLocation(location=Location(canonical_name='Fur', matched_name='Fur',
#                                      country='Denmark', score=1.0),
#                    substring='Fur', substring_range=Range(start=17, end=20)),
#  ExtractedLocation(location=Location(canonical_name='Kingdom of Denmark',
#                                      matched_name='Denmark', country='Denmark', score=1.0),
#                    substring='Denmark', substring_range=Range(start=22, end=29))]

find_region_in_string("Department of Biological Oceanography, Royal Netherlands Institute "
                      "for Sea Research (NIOZ), Texel, The Netherlands", {"Netherlands"})
# [ExtractedLocation(location=Location(canonical_name='Kingdom of the Netherlands',
#                                      matched_name='Netherlands', country='Netherlands',
#                                      score=1.0),
#                    substring='Netherlands', substring_range=Range(start=45, end=56)),
#  ExtractedLocation(location=Location(canonical_name='Texel', matched_name='Texel',
#                                      country='Netherlands', score=1.0),
#                    substring='Texel', substring_range=Range(start=92, end=97)),
#  ExtractedLocation(location=Location(canonical_name='Kingdom of the Netherlands',
#                                      matched_name='Netherlands', country='Netherlands',
#                                      score=1.0),
#                    substring='Netherlands', substring_range=Range(start=103, end=114))]

Resource loading

Resources for cities and regions aren’t all loaded when you import TextScrubber, they’re loaded on the fly per country. This means that the first time you do a query it can take a while. The second time around the same query will be much faster, as will all other queries involving the same countr(y)(ies). You can load in resources per country in advance by using:

from text_scrubber.geo import (add_city_resources, add_region_resources,
                               normalize_country_to_country_codes)

country_codes = normalize_country_to_country_codes(['Netherlands', 'China', 'USA'])
add_city_resources(country_codes)
add_region_resources(country_codes, progress_bar=True)

Cleaning

There are clean functions available for countries/regions/cities, which all follow the same cleaning pipeline:

from text_scrubber.geo import clean_country, clean_region, clean_city

clean_country('cent afr rep.')     # 'central african republic'
clean_region('Hyōgo')              # 'hyogo'
clean_city('płońsk')               # 'plonsk'
clean_city('neustadt/westerwald')  # 'neustadt westerwald'

Documentation

If you want to build the documentation, please install the documentation dependencies by executing:

pip install .[docs]

Documentation can be build by executing:

python setup.py build_docs

Documentation can also be build from the docs folder directly. In that case text-scrubber should be installed and available in your current working environment. Execute:

make html

in the docs folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_scrubber-0.5.0.tar.gz (30.1 MB view details)

Uploaded Source

Built Distribution

text_scrubber-0.5.0-py3-none-any.whl (30.2 MB view details)

Uploaded Python 3

File details

Details for the file text_scrubber-0.5.0.tar.gz.

File metadata

  • Download URL: text_scrubber-0.5.0.tar.gz
  • Upload date:
  • Size: 30.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for text_scrubber-0.5.0.tar.gz
Algorithm Hash digest
SHA256 c083fb7ba7f4ee5343fd1b83491c226c1b6ad3ea947e5052fc48cb284111c9b2
MD5 eac8f1b7fa2f2cc4d83526fd5acb375c
BLAKE2b-256 d6f9557876bae8acca92af2bf0d29a585443d92717ace2d7a91e70cfb19464bf

See more details on using hashes here.

File details

Details for the file text_scrubber-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for text_scrubber-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9f4d520d962dc698b24e2617b65d15aeb378a3c339d4712b821cd6d585c592db
MD5 ee64e000676e6474c2d6bcb9f4b390fe
BLAKE2b-256 b0480cc35e5c4cf593fc788155dd260a1ff7171e3ca4332c8ae1f8517a86c8b4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page