Skip to main content

A modular, fully-configurable NLP text cleaning function with 15+ toggleable steps.

Project description

AdvancedTextCleaner

A single, fully-configurable AdvancedTextCleaner() function for all your NLP preprocessing needs.
Toggle any combination of 15+ cleaning steps — no pipeline boilerplate, no class inheritance, just one function call.


Installation

pip install AdvancedTextCleaner

NLTK data (first run only)

The package downloads required NLTK corpora automatically on first import. If you prefer to do it manually:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

Quick Start

from AdvancedTextCleaner import AdvancedTextCleaner

text = "<p>Hello! Visit https://example.com 😊 Don't you love NLP? #AI @user</p>"

# Defaults — basic cleaning
AdvancedTextCleaner(text)
# → "hello visit don't you love nlp ai user"

# Full normalisation
AdvancedTextCleaner(text,
    expand_contractions=True,
    remove_stopwords=True,
    remove_emojis=True,
    remove_hashtags=True,
    remove_mentions=True,
    lemmatize=True
)
# → "hello visit love nlp"

Features

  • 15+ cleaning steps — all individually toggled via keyword arguments
  • Three morphological reducers — POS-aware Lemmatization, Porter Stemmer, Snowball Stemmer
  • Stopword control — NLTK defaults + custom additions + a keep_words whitelist
  • Social media ready — handles @mentions, #hashtags, emojis, and URLs
  • Sentiment-safe mode — preserves !, ?, ' even when stripping all other punctuation
  • Stateless & pipeline-safe — safe to use with .apply(), multiprocessing, and inference pipelines

API Reference

AdvancedTextCleaner(
    text: str,

    # Normalisation
    to_lowercase          = True,
    remove_accents        = False,
    expand_contractions   = False,

    # Noise removal
    remove_html           = True,
    remove_urls           = True,
    remove_emails         = True,
    remove_mentions       = False,
    remove_hashtags       = False,
    remove_numbers        = False,
    remove_punctuation    = True,
    remove_extra_spaces   = True,

    # Special characters
    keep_sentiment_markers = False,
    remove_emojis          = False,
    remove_special_chars   = False,

    # Stopwords
    remove_stopwords      = False,
    custom_stopwords      = None,   # set
    keep_words            = None,   # set

    # Morphological reduction
    lemmatize             = False,
    stem_porter           = False,
    stem_snowball         = False,
) -> str

Parameters

Parameter Type Default Description
to_lowercase bool True "Hello""hello"
remove_accents bool False "café""cafe"
expand_contractions bool False "don't""do not"
remove_html bool True <b>hi</b>"hi"
remove_urls bool True Strips http://, https://, www.
remove_emails bool True Strips user@mail.com
remove_mentions bool False Strips @username
remove_hashtags bool False Strips #hashtag
remove_numbers bool False Strips digit sequences
remove_punctuation bool True Strips .,!? etc.
remove_extra_spaces bool True Collapses multiple spaces into one
keep_sentiment_markers bool False Preserves !, ?, ' even when removing punctuation
remove_emojis bool False Strips 😊🔥 etc.
remove_special_chars bool False Keeps only [a-zA-Z0-9 ] — strictest mode
remove_stopwords bool False Strips NLTK English stopwords
custom_stopwords set None Extra domain-specific words to strip
keep_words set None Whitelist — these words are never removed
lemmatize bool False POS-aware WordNet lemmatization
stem_porter bool False Porter Stemmer
stem_snowball bool False Snowball Stemmer

Processing Order

Steps always execute in this fixed order:

1.  Expand contractions
2.  Lowercase
3.  Remove accents
4.  Strip HTML
5.  Remove URLs
6.  Remove emails
7.  Remove @mentions
8.  Remove #hashtags
9.  Remove emojis
10. Remove numbers
11. Remove punctuation / special chars
12. Normalise whitespace
13. Remove stopwords        ← token-level
14. Lemmatize / Stem        ← token-level
15. Final whitespace pass

Preset Recipes

Sentiment Analysis

AdvancedTextCleaner(text,
    expand_contractions=True,
    keep_sentiment_markers=True,   # preserve ! ?
    remove_stopwords=False,        # keep "not", "never"
    remove_emojis=False,           # emojis carry sentiment
)

Topic Modelling / TF-IDF

AdvancedTextCleaner(text,
    remove_stopwords=True,
    lemmatize=True,
    remove_numbers=True,
    remove_emojis=True,
)

Bag-of-Words / Classical ML

AdvancedTextCleaner(text,
    remove_accents=True,
    expand_contractions=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_stopwords=True,
    remove_emojis=True,
    stem_porter=True,
)

Social Media Text

AdvancedTextCleaner(text,
    remove_mentions=True,
    remove_hashtags=True,
    remove_emojis=True,
    expand_contractions=True,
    keep_sentiment_markers=True,
)

pandas DataFrame

import pandas as pd
from AdvancedTextCleaner import AdvancedTextCleaner

df['clean'] = df['text'].apply(lambda x: AdvancedTextCleaner(x,
    remove_stopwords=True,
    lemmatize=True
))

Priority Rules

Conflicting flags Winner
lemmatize=True + stem_porter=True lemmatize
lemmatize=True + stem_snowball=True lemmatize
stem_porter=True + stem_snowball=True stem_porter
remove_special_chars=True + remove_punctuation=True remove_special_chars (stricter)
keep_sentiment_markers=True + remove_punctuation=True !, ?, ' are kept
keep_words={'not'} + remove_stopwords=True "not" is never removed

Tips

Protect negations in sentiment tasks"not", "no", "never" flip sentiment entirely. Whitelist them:

AdvancedTextCleaner(text,
    remove_stopwords=True,
    keep_words={'not', 'no', 'never', "n't"}
)

Always expand contractions before stopword removal — without it "don't" may slip past the stopword list.

Lemmatize vs Stemlemmatize produces real dictionary words (runningrun); stemmers are faster but rougher (runningrunn). Use lemmatize for interpretable output, stem for speed.


Dependencies

Package Purpose
nltk Tokenization, stopwords, lemmatization, stemming
contractions Expanding English contractions

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

advancedtextcleaner-0.1.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

advancedtextcleaner-0.1-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file advancedtextcleaner-0.1.tar.gz.

File metadata

  • Download URL: advancedtextcleaner-0.1.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for advancedtextcleaner-0.1.tar.gz
Algorithm Hash digest
SHA256 44c8b57c0a4b876d8cd615950e9aa1c334ce7e41df5b76f68def813f9c4d07fd
MD5 af5c2ed2903a7a788f9f81c9ff0ea434
BLAKE2b-256 f2b86e0bf8157fd605239773d0602e5d9fca27a93ca2b8afe55cd043f4412a0d

See more details on using hashes here.

File details

Details for the file advancedtextcleaner-0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for advancedtextcleaner-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 566580f9187dcfe917fcf67bf0f37061ee1c990184aa9bbb15b3f8af5c238da7
MD5 5f1cdd02b0eb6ddb66085c59cd4a1c78
BLAKE2b-256 ddb72f2645507b54818eda49ad9cd35e187effcfd04c492f4bd6275bd7eacb2c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page