A Python library for cleaning text data
Project description
CleanseText
This is a simple library to help you clean your textual data.
Why do I need this?
Honestly there are several packages out there which do similar things, but they've never really worked well for my use cases or don't have all the features I need. So I decided to make my own.
The API design is made to be readable, and I don't hesitate to create functions even for trivial tasks as they make reaching the goal easier.
How to Install?
pip install cleansetext
Sample usage
from cleansetext.pipeline import Pipeline
from cleansetext.steps import *
from nltk.tokenize import TweetTokenizer
tk = TweetTokenizer()
# Create a pipeline with a list of preprocessing steps
pipeline = Pipeline([
RemoveEmojis(),
RemoveAllPunctuations(),
RemoveTokensWithOnlyPunctuations(),
ReplaceURLsandHTMLTags(),
ReplaceUsernames(),
RemoveWhiteSpaceOrChunksOfWhiteSpace()
], track_diffs=True)
# Process text
text = "@Mary I hate you and everything about you ...... 🎉🎉 google.com"
text = tk.tokenize(text)
print(text)
# Output: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', '🎉', '🎉', 'google.com']
print(pipeline.process(text))
# Output:
# ['<USER>', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>']
pipeline.explain(show_diffs=True)
# Output:
# Step 1: Remove emojis from text | Language: en
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', '🎉', '🎉', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com']
# Step 2: Remove all punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com']
# Step 3: Remove tokens with only punctuations from a list of words | Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '...', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', 'google.com']
# Step 4: Remove URLs and HTML tags from a sentence | Replace with: <URL>
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', 'google.com'] -> ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>']
# Step 5: Remove usernames from a sentence | Replace with: <USER>
# Diff: ['@Mary', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>'] -> ['<USER>', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>']
# Step 6: Remove whitespace from a sentence or chunks of whitespace
# Diff: ['<USER>', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>'] -> ['<USER>', 'I', 'hate', 'you', 'and', 'everything', 'about', 'you', '<URL>']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleansetext-1.1.0.tar.gz.
File metadata
- Download URL: cleansetext-1.1.0.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2824a7e22157aad92b373000b4e9cf78d1923ae12f7d79ae0507588eb500d42e
|
|
| MD5 |
2722bb784ea0b45bc8a1d00a8962b194
|
|
| BLAKE2b-256 |
e44bf771f45e4e58de612d0d05f0fd94b9e8707ea888f3ce9b23495d1ef27c5e
|
File details
Details for the file cleansetext-1.1.0-py3-none-any.whl.
File metadata
- Download URL: cleansetext-1.1.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36f5fa4349556cc864b47a63215bdc3fb8a158fb5caa97ed5a12bb3e809398be
|
|
| MD5 |
5b7d3821247bd08d252b106a4ef8098d
|
|
| BLAKE2b-256 |
3918858e827595dc186067c2a4e2c756c66a444e1acd21a4656aed9e6356f79b
|