A modular, fully-configurable NLP text cleaning function with 15+ toggleable steps.
Project description
AdvancedTextCleaner
A single, fully-configurable AdvancedTextCleaner() function for all your NLP preprocessing needs.
Toggle any combination of 15+ cleaning steps — no pipeline boilerplate, no class inheritance, just one function call.
Installation
pip install AdvancedTextCleaner
NLTK data (first run only)
The package downloads required NLTK corpora automatically on first import. If you prefer to do it manually:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
Quick Start
from AdvancedTextCleaner import AdvancedTextCleaner
text = "<p>Hello! Visit https://example.com 😊 Don't you love NLP? #AI @user</p>"
# Defaults — basic cleaning
AdvancedTextCleaner(text)
# → "hello visit don't you love nlp ai user"
# Full normalisation
AdvancedTextCleaner(text,
expand_contractions=True,
remove_stopwords=True,
remove_emojis=True,
remove_hashtags=True,
remove_mentions=True,
lemmatize=True
)
# → "hello visit love nlp"
Features
- 15+ cleaning steps — all individually toggled via keyword arguments
- Three morphological reducers — POS-aware Lemmatization, Porter Stemmer, Snowball Stemmer
- Stopword control — NLTK defaults + custom additions + a
keep_wordswhitelist - Social media ready — handles
@mentions,#hashtags, emojis, and URLs - Sentiment-safe mode — preserves
!,?,'even when stripping all other punctuation - Stateless & pipeline-safe — safe to use with
.apply(), multiprocessing, and inference pipelines
API Reference
AdvancedTextCleaner(
text: str,
# Normalisation
to_lowercase = True,
remove_accents = False,
expand_contractions = False,
# Noise removal
remove_html = True,
remove_urls = True,
remove_emails = True,
remove_mentions = False,
remove_hashtags = False,
remove_numbers = False,
remove_punctuation = True,
remove_extra_spaces = True,
# Special characters
keep_sentiment_markers = False,
remove_emojis = False,
remove_special_chars = False,
# Stopwords
remove_stopwords = False,
custom_stopwords = None, # set
keep_words = None, # set
# Morphological reduction
lemmatize = False,
stem_porter = False,
stem_snowball = False,
) -> str
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
to_lowercase |
bool |
True |
"Hello" → "hello" |
remove_accents |
bool |
False |
"café" → "cafe" |
expand_contractions |
bool |
False |
"don't" → "do not" |
remove_html |
bool |
True |
<b>hi</b> → "hi" |
remove_urls |
bool |
True |
Strips http://, https://, www. |
remove_emails |
bool |
True |
Strips user@mail.com |
remove_mentions |
bool |
False |
Strips @username |
remove_hashtags |
bool |
False |
Strips #hashtag |
remove_numbers |
bool |
False |
Strips digit sequences |
remove_punctuation |
bool |
True |
Strips .,!? etc. |
remove_extra_spaces |
bool |
True |
Collapses multiple spaces into one |
keep_sentiment_markers |
bool |
False |
Preserves !, ?, ' even when removing punctuation |
remove_emojis |
bool |
False |
Strips 😊🔥 etc. |
remove_special_chars |
bool |
False |
Keeps only [a-zA-Z0-9 ] — strictest mode |
remove_stopwords |
bool |
False |
Strips NLTK English stopwords |
custom_stopwords |
set |
None |
Extra domain-specific words to strip |
keep_words |
set |
None |
Whitelist — these words are never removed |
lemmatize |
bool |
False |
POS-aware WordNet lemmatization |
stem_porter |
bool |
False |
Porter Stemmer |
stem_snowball |
bool |
False |
Snowball Stemmer |
Processing Order
Steps always execute in this fixed order:
1. Expand contractions
2. Lowercase
3. Remove accents
4. Strip HTML
5. Remove URLs
6. Remove emails
7. Remove @mentions
8. Remove #hashtags
9. Remove emojis
10. Remove numbers
11. Remove punctuation / special chars
12. Normalise whitespace
13. Remove stopwords ← token-level
14. Lemmatize / Stem ← token-level
15. Final whitespace pass
Preset Recipes
Sentiment Analysis
AdvancedTextCleaner(text,
expand_contractions=True,
keep_sentiment_markers=True, # preserve ! ?
remove_stopwords=False, # keep "not", "never"
remove_emojis=False, # emojis carry sentiment
)
Topic Modelling / TF-IDF
AdvancedTextCleaner(text,
remove_stopwords=True,
lemmatize=True,
remove_numbers=True,
remove_emojis=True,
)
Bag-of-Words / Classical ML
AdvancedTextCleaner(text,
remove_accents=True,
expand_contractions=True,
remove_numbers=True,
remove_punctuation=True,
remove_stopwords=True,
remove_emojis=True,
stem_porter=True,
)
Social Media Text
AdvancedTextCleaner(text,
remove_mentions=True,
remove_hashtags=True,
remove_emojis=True,
expand_contractions=True,
keep_sentiment_markers=True,
)
pandas DataFrame
import pandas as pd
from AdvancedTextCleaner import AdvancedTextCleaner
df['clean'] = df['text'].apply(lambda x: AdvancedTextCleaner(x,
remove_stopwords=True,
lemmatize=True
))
Priority Rules
| Conflicting flags | Winner |
|---|---|
lemmatize=True + stem_porter=True |
lemmatize |
lemmatize=True + stem_snowball=True |
lemmatize |
stem_porter=True + stem_snowball=True |
stem_porter |
remove_special_chars=True + remove_punctuation=True |
remove_special_chars (stricter) |
keep_sentiment_markers=True + remove_punctuation=True |
!, ?, ' are kept |
keep_words={'not'} + remove_stopwords=True |
"not" is never removed |
Tips
Protect negations in sentiment tasks — "not", "no", "never" flip sentiment entirely. Whitelist them:
AdvancedTextCleaner(text,
remove_stopwords=True,
keep_words={'not', 'no', 'never', "n't"}
)
Always expand contractions before stopword removal — without it "don't" may slip past the stopword list.
Lemmatize vs Stem — lemmatize produces real dictionary words (running → run); stemmers are faster but rougher (running → runn). Use lemmatize for interpretable output, stem for speed.
Dependencies
| Package | Purpose |
|---|---|
nltk |
Tokenization, stopwords, lemmatization, stemming |
contractions |
Expanding English contractions |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file advancedtextcleaner-0.1.tar.gz.
File metadata
- Download URL: advancedtextcleaner-0.1.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44c8b57c0a4b876d8cd615950e9aa1c334ce7e41df5b76f68def813f9c4d07fd
|
|
| MD5 |
af5c2ed2903a7a788f9f81c9ff0ea434
|
|
| BLAKE2b-256 |
f2b86e0bf8157fd605239773d0602e5d9fca27a93ca2b8afe55cd043f4412a0d
|
File details
Details for the file advancedtextcleaner-0.1-py3-none-any.whl.
File metadata
- Download URL: advancedtextcleaner-0.1-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
566580f9187dcfe917fcf67bf0f37061ee1c990184aa9bbb15b3f8af5c238da7
|
|
| MD5 |
5f1cdd02b0eb6ddb66085c59cd4a1c78
|
|
| BLAKE2b-256 |
ddb72f2645507b54818eda49ad9cd35e187effcfd04c492f4bd6275bd7eacb2c
|