A package to help quickly clean text data
Project description
CleanseText
NOTE: THE LIBRARY IS CURRENTLY PRERELEASE AND SEVERAL FEATURES MIGHT STILL BE BROKEN
This is a simple library to help you clean your textual data.
Why do I need this?
Honestly there are several packages out there which do similar things, but they've never really worked well for my use cases or don't have all the features I need. So I decided to make my own.
The API design is made to be readable, and I don't hesitate to create functions even for trivial tasks as they make reaching the goal easier.
How to Install?
pip install cleansetext
Sample usage
from cleansetext.pipeline import Pipeline
from cleansetext.steps import *
from nltk.tokenize import TweetTokenizer
tk = TweetTokenizer()
# Create a pipeline with a list of preprocessing steps
pipeline = Pipeline([
RemoveEmojis(),
RemoveAllPunctuations(),
RemoveTokensWithOnlyPunctuations(),
ReplaceURLsandHTMLTags(),
ReplaceUsernames(),
RemoveWhiteSpaceOrChunksOfWhiteSpace()
], track_diffs=True)
# Process text
text = "@Mary I hate you and everything about you ...... 🎉🎉 google.com"
text = tk.tokenize(text)
print(text)
# Output: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', '🎉', '🎉', 'google.com']
print(pipeline.process(text))
# Output:
# ['<USER>', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>']
pipeline.explain(show_diffs=True)
# Output:
# Step 1: Remove emojis from text | Language: en
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', '🎉', '🎉', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com']
# Step 2: Remove all punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com']
# Step 3: Remove tokens with only punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', 'google.com']
# Step 4: Remove URLs and HTML tags from a sentence | Replace with: <URL>
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>']
# Step 5: Remove usernames from a sentence | Replace with: <USER>
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>'] -> ['<USER>', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>']
# Step 6: Remove whitespace from a sentence or chunks of whitespace
# Diff: ['<USER>', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>'] -> ['<USER>', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cleansetext-0.0.6.tar.gz
(6.5 kB
view hashes)
Built Distribution
Close
Hashes for cleansetext-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c47014358b3897b567cae5db7c86a8e8e0825d3891d94223572813f93a58fcbb |
|
MD5 | 76fa3d996786319cfe13ce9da0589665 |
|
BLAKE2b-256 | 0e99f15649d87772ff738b0595b7d7d012e2a6732fed172fbcbda0ba024b5fb0 |