A modular, fully-configurable NLP text cleaning function with 15+ toggleable steps.

These details have not been verified by PyPI

Project description

AdvancedTextCleaner

A single, fully-configurable AdvancedTextCleaner() function for all your NLP preprocessing needs.
Toggle any combination of 15+ cleaning steps — no pipeline boilerplate, no class inheritance, just one function call.

Installation

pip install AdvancedTextCleaner

NLTK data (first run only)

The package downloads required NLTK corpora automatically on first import. If you prefer to do it manually:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

Quick Start

from AdvancedTextCleaner import AdvancedTextCleaner

text = "<p>Hello! Visit https://example.com 😊 Don't you love NLP? #AI @user</p>"

# Defaults — basic cleaning
AdvancedTextCleaner(text)
# → "hello visit don't you love nlp ai user"

# Full normalisation
AdvancedTextCleaner(text,
    expand_contractions=True,
    remove_stopwords=True,
    remove_emojis=True,
    remove_hashtags=True,
    remove_mentions=True,
    lemmatize=True
)
# → "hello visit love nlp"

Features

15+ cleaning steps — all individually toggled via keyword arguments
Three morphological reducers — POS-aware Lemmatization, Porter Stemmer, Snowball Stemmer
Stopword control — NLTK defaults + custom additions + a keep_words whitelist
Social media ready — handles @mentions, #hashtags, emojis, and URLs
Sentiment-safe mode — preserves !, ?, ' even when stripping all other punctuation
Stateless & pipeline-safe — safe to use with .apply(), multiprocessing, and inference pipelines

API Reference

AdvancedTextCleaner(
    text: str,

    # Normalisation
    to_lowercase          = True,
    remove_accents        = False,
    expand_contractions   = False,

    # Noise removal
    remove_html           = True,
    remove_urls           = True,
    remove_emails         = True,
    remove_mentions       = False,
    remove_hashtags       = False,
    remove_numbers        = False,
    remove_punctuation    = True,
    remove_extra_spaces   = True,

    # Special characters
    keep_sentiment_markers = False,
    remove_emojis          = False,
    remove_special_chars   = False,

    # Stopwords
    remove_stopwords      = False,
    custom_stopwords      = None,   # set
    keep_words            = None,   # set

    # Morphological reduction
    lemmatize             = False,
    stem_porter           = False,
    stem_snowball         = False,
) -> str

Parameters

Parameter	Type	Default	Description
`to_lowercase`	`bool`	`True`	`"Hello"` → `"hello"`
`remove_accents`	`bool`	`False`	`"café"` → `"cafe"`
`expand_contractions`	`bool`	`False`	`"don't"` → `"do not"`
`remove_html`	`bool`	`True`	`<b>hi</b>` → `"hi"`
`remove_urls`	`bool`	`True`	Strips `http://`, `https://`, `www.`
`remove_emails`	`bool`	`True`	Strips `user@mail.com`
`remove_mentions`	`bool`	`False`	Strips `@username`
`remove_hashtags`	`bool`	`False`	Strips `#hashtag`
`remove_numbers`	`bool`	`False`	Strips digit sequences
`remove_punctuation`	`bool`	`True`	Strips `.,!?` etc.
`remove_extra_spaces`	`bool`	`True`	Collapses multiple spaces into one
`keep_sentiment_markers`	`bool`	`False`	Preserves `!`, `?`, `'` even when removing punctuation
`remove_emojis`	`bool`	`False`	Strips 😊🔥 etc.
`remove_special_chars`	`bool`	`False`	Keeps only `[a-zA-Z0-9 ]` — strictest mode
`remove_stopwords`	`bool`	`False`	Strips NLTK English stopwords
`custom_stopwords`	`set`	`None`	Extra domain-specific words to strip
`keep_words`	`set`	`None`	Whitelist — these words are never removed
`lemmatize`	`bool`	`False`	POS-aware WordNet lemmatization
`stem_porter`	`bool`	`False`	Porter Stemmer
`stem_snowball`	`bool`	`False`	Snowball Stemmer

Processing Order

Steps always execute in this fixed order:

1.  Expand contractions
2.  Lowercase
3.  Remove accents
4.  Strip HTML
5.  Remove URLs
6.  Remove emails
7.  Remove @mentions
8.  Remove #hashtags
9.  Remove emojis
10. Remove numbers
11. Remove punctuation / special chars
12. Normalise whitespace
13. Remove stopwords        ← token-level
14. Lemmatize / Stem        ← token-level
15. Final whitespace pass

Preset Recipes

Sentiment Analysis

AdvancedTextCleaner(text,
    expand_contractions=True,
    keep_sentiment_markers=True,   # preserve ! ?
    remove_stopwords=False,        # keep "not", "never"
    remove_emojis=False,           # emojis carry sentiment
)

Topic Modelling / TF-IDF

AdvancedTextCleaner(text,
    remove_stopwords=True,
    lemmatize=True,
    remove_numbers=True,
    remove_emojis=True,
)

Bag-of-Words / Classical ML

AdvancedTextCleaner(text,
    remove_accents=True,
    expand_contractions=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_stopwords=True,
    remove_emojis=True,
    stem_porter=True,
)

Social Media Text

AdvancedTextCleaner(text,
    remove_mentions=True,
    remove_hashtags=True,
    remove_emojis=True,
    expand_contractions=True,
    keep_sentiment_markers=True,
)

pandas DataFrame

import pandas as pd
from AdvancedTextCleaner import AdvancedTextCleaner

df['clean'] = df['text'].apply(lambda x: AdvancedTextCleaner(x,
    remove_stopwords=True,
    lemmatize=True
))

Priority Rules

Conflicting flags	Winner
`lemmatize=True` + `stem_porter=True`	`lemmatize`
`lemmatize=True` + `stem_snowball=True`	`lemmatize`
`stem_porter=True` + `stem_snowball=True`	`stem_porter`
`remove_special_chars=True` + `remove_punctuation=True`	`remove_special_chars` (stricter)
`keep_sentiment_markers=True` + `remove_punctuation=True`	`!`, `?`, `'` are kept
`keep_words={'not'}` + `remove_stopwords=True`	`"not"` is never removed

Tips

Protect negations in sentiment tasks — "not", "no", "never" flip sentiment entirely. Whitelist them:

AdvancedTextCleaner(text,
    remove_stopwords=True,
    keep_words={'not', 'no', 'never', "n't"}
)

Always expand contractions before stopword removal — without it "don't" may slip past the stopword list.

Lemmatize vs Stem — lemmatize produces real dictionary words (running → run); stemmers are faster but rougher (running → runn). Use lemmatize for interpretable output, stem for speed.

Dependencies

Package	Purpose
`nltk`	Tokenization, stopwords, lemmatization, stemming
`contractions`	Expanding English contractions

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1

Apr 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

advancedtextcleaner-0.1.tar.gz (5.4 kB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

advancedtextcleaner-0.1-py3-none-any.whl (5.9 kB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file advancedtextcleaner-0.1.tar.gz.

File metadata

Download URL: advancedtextcleaner-0.1.tar.gz
Upload date: Apr 25, 2026
Size: 5.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for advancedtextcleaner-0.1.tar.gz
Algorithm	Hash digest
SHA256	`44c8b57c0a4b876d8cd615950e9aa1c334ce7e41df5b76f68def813f9c4d07fd`
MD5	`af5c2ed2903a7a788f9f81c9ff0ea434`
BLAKE2b-256	`f2b86e0bf8157fd605239773d0602e5d9fca27a93ca2b8afe55cd043f4412a0d`

See more details on using hashes here.

File details

Details for the file advancedtextcleaner-0.1-py3-none-any.whl.

File metadata

Download URL: advancedtextcleaner-0.1-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 5.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for advancedtextcleaner-0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`566580f9187dcfe917fcf67bf0f37061ee1c990184aa9bbb15b3f8af5c238da7`
MD5	`5f1cdd02b0eb6ddb66085c59cd4a1c78`
BLAKE2b-256	`ddb72f2645507b54818eda49ad9cd35e187effcfd04c492f4bd6275bd7eacb2c`

See more details on using hashes here.

advancedtextcleaner 0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

AdvancedTextCleaner

Installation

NLTK data (first run only)

Quick Start

Features

API Reference

Parameters

Processing Order

Preset Recipes

Sentiment Analysis

Topic Modelling / TF-IDF

Bag-of-Words / Classical ML

Social Media Text

pandas DataFrame

Priority Rules

Tips

Dependencies

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes