Skip to main content

A small text cleaning package

Project description

ngtextpreprocess

ngtextpreprocess a simple Python package that removes noise and extracts the meaningful information from the given input text data.

Unlike plain tokenization and de-tokenization, where useful information like sentences, dates, percentages, monetary values etc becomes undentifiable, ngtextpreprocess goes one step ahead in preserving these crucial information while removing noisy data.

Current Release Version Current Release Version

Table of contents:

Installation:

To install the package in your local environment, open a terminal inside your project directory and type:

pip install ngtextpreprocess

To upgrade the already existing installation, run

  pip install -U ngtextpreprocess

Usage:

The package comes with a cleaning pipeline for performing all the text cleaning processes in a single step.
Along with that, the package also can be used for specific text cleaning tasks by accessing the individual methods.

Cleaning pipeline

# import the package
from ngtextpreprocess import CleanText

# initialize the input text
input_text = """
                This is a #1234 sampl writtn 100% on 2022/04/14 ___
                <a href=#> with $100.50 on my @abcd table.</a>
              """

# instantiate the class object by passing the input text
ct = CleanText(input_text=input_text)

# call the cleaning pipeline and get the output
output_text = ct.cleaning_pipeline()

print(output_text)

>> This is a sample written 100% on 2022/04/14 with $100.50 on my table.

You can customize the pipeline by deciding what all functions you would require in the same sequential manner.

This can be done by backward elimination technique where you can set the parameter for the required function as False.

Also you can enable the set_logging parameter to get the logging details as a log file in a dynamically created logging directory.

Here is how its done.

Using required functions in the pipeline

In this example, we want the name to stay intact in the output. So, we are disabling the remove_name function. Also we are enabling logging to get the log details in the logging directory.

# import
from ngtextpreprocess import CleanText

# initialize the input text
input_text = "This is John Doe from U.S. ."

# instantiate
cleaner = CleanText(input_text)

# call the cleaning_pipeline method
output_text = cleaner.cleaning_pipeline(set_logging=True, set_remove_name=False)

print(output_text)

>> This is John Doe from

As you can see, the name has been preserved and all other possible corrections have been made. Also, the logfiles have been generated.

Individual methods

The following are the individual functions used within the pipeline.

For Text Beautification

  1. Cleaning HTML
  2. Fixing ASCII decoding errors
  3. Removing Bullets
  4. Replacing Hexcodes
  5. Removing Symbols and Emojis

For Personal Information Removal

  1. Removing Personal Names
  2. Removing Contact Addresses
  3. Removing Contact Numbers
  4. Removing e-mail address
  5. Removing social-media tags
  6. Removing URL

For Text Correction

  1. Expanding Domain specific short-forms (Currently, financial domain has been covered.)
  2. Expanding General short-forms
  3. Fixing Contractions
  4. Removing Punctuations
  5. Removing Extra Whitespaces
  6. Fixing Spelling errors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ngtextpreprocess-0.0.1.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ngtextpreprocess-0.0.1-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file ngtextpreprocess-0.0.1.tar.gz.

File metadata

  • Download URL: ngtextpreprocess-0.0.1.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.0

File hashes

Hashes for ngtextpreprocess-0.0.1.tar.gz
Algorithm Hash digest
SHA256 08b41cd5abe19f9fca8fd9f67a5048d7e58fad4cfa6ce402f79e1d70994ff2db
MD5 f9663a578c8b2bc597e646c49c9b118f
BLAKE2b-256 400b09c19b1c98b1313a8fef3dd2153c42022b1d7177debad51210004f407a92

See more details on using hashes here.

File details

Details for the file ngtextpreprocess-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ngtextpreprocess-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7a14fd6bf4cef322b328843a046b20ed49aa1a980c7e9bd1f94931936e981aac
MD5 3caa32ac9b1bc4b9ca80614ff248ea85
BLAKE2b-256 dc2110bc0f51f93356f458f9d6d996ea3c589f82b43d44d356e2057267cdf2f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page