Skip to main content

A package to help quickly clean text data

Project description

TextHelp

This is a simple library to help you clean your textual data.

Why do I need this?

Honestly there are several packages out there which do similar things, but they've never really worked well for my use cases or don't have all the features I need. So I decided to make my own.

The API design is made to be readable, and I don't hesitate to create functions even for trivial tasks as they make reaching the goal easier.

Sample usage

from pipeline import Pipeline
from steps import *

# Create a pipeline with a list of preprocessing steps
pipeline = Pipeline([
    RemoveAllPunctuations(),
    RemoveAllNonAlphabeticCharacters(),
    RemoveTokensWithMajorityNonAlphabeticCharacters(),
], track_diffs=True)

# Process text
text = ['.', 'this', 'is', 'a', '.', 'test', '.', '....', '99a']
print(pipeline.process(text))

# Output:
# ['this', 'is', 'a', 'test']

pipeline.explain(show_diffs=True)

# Output:
# Step 1: Remove all punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
# Diff: ['.', 'this', 'is', 'a', '.', 'test', '.', '....', '99a'] -> ['this', 'is', 'a', 'test', '....', '99a']
# Step 2: Remove all non alphabetic characters from a list of words
# Diff: ['this', 'is', 'a', 'test', '....', '99a'] -> ['this', 'is', 'a', 'test']
# Step 3: Remove tokens with majority non alphabetic characters from a list of words | Threshold: 0.5
# Diff: ['this', 'is', 'a', 'test'] -> ['this', 'is', 'a', 'test']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleansetext-0.0.1.tar.gz (5.7 kB view hashes)

Uploaded Source

Built Distribution

cleansetext-0.0.1-py3-none-any.whl (6.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page