Skip to main content

A lightweight and reusable text preprocessing package for NLP tasks

Project description

🧹 textcleaner-partha

PyPI version License

A lightweight and reusable text preprocessing package for NLP tasks. It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.

✨ Features

•	✅ HTML tag and emoji removal
•	✅ Stopword removal
•	✅ Contraction expansion (e.g., “can’t” → “cannot”)
•	✅ Abbreviation expansion (e.g., “asap” → “as soon as possible”)
•	✅ Spelling correction with autocorrect
•	✅ Lemmatization using spaCy (en_core_web_sm)
•	✅ Filters out stopwords, punctuation, numbers
•	✅ Retains only nouns, verbs, adjectives, and adverbs
•	✅ Returns tokens in a text

🚀 Installation

From PyPI:

pip install --upgrade textcleaner-partha

Install directly from GitHub:

pip install git+https://github.com/partha6369/textcleaner.git

🧠 Usage

from textcleaner_partha import preprocess

text = "I can't believe it's already raining! 😞 <p>Click here</p>"

# Default usage (all features enabled)
cleaned = preprocess(text)
print(cleaned)

# Custom usage with optional features disabled
cleaned_partial = preprocess(
    text,
    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)
    correct_spelling=False,     # Skip spelling correction
    expand_contraction=False    # Skip contraction expansion
)
print(cleaned_partial)
from textcleaner_partha import get_tokens

text = "I can't believe it's already raining! 😞 <p>Click here</p>"

# Default usage (all features enabled)
tokens = get_tokens(text)
print(tokens)

# Custom usage with optional features disabled
tokens_partial = get_tokens(
    text,
    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)
    correct_spelling=False,     # Skip spelling correction
    expand_contraction=False    # Skip contraction expansion
)
print(tokens_partial)

🔧 Parameters

The preprocess() and get_tokens() functions offer flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:

def preprocess(
    text,
    lowercase=True,
    remove_html=True,
    remove_emoji=True,
    remove_whitespace=True,
    remove_punct=False,
    expand_contraction=True,
    expand_abbrev=True,
    correct_spelling=True,
    lemmatise=True,
)
def get_tokens(
    text,
    lowercase=True,
    remove_html=True,
    remove_emoji=True,
    remove_whitespace=True,
    remove_punct=False,
    expand_contraction=True,
    expand_abbrev=True,
    correct_spelling=True,
    lemmatise=True,
)

📦 Dependencies

•	spacy
•	autocorrect
•	contractions

You can install them manually or via the included requirements.txt:

pip install -r requirements.txt

And download the required spaCy model:

python -m spacy download en_core_web_sm

📄 License

MIT License © Dr. Partha Majumdar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textcleaner_partha-1.1.1.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textcleaner_partha-1.1.1-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file textcleaner_partha-1.1.1.tar.gz.

File metadata

  • Download URL: textcleaner_partha-1.1.1.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.12

File hashes

Hashes for textcleaner_partha-1.1.1.tar.gz
Algorithm Hash digest
SHA256 68c0ca83491c110698e1517570c4e46ef4f39f2e5b86fa7d3a37c8f1eb5cbd39
MD5 8de7870ed5d9268ac3d500be93addc0f
BLAKE2b-256 200307cc5991cd427c545e929ccb83c4de9f354e3b573ffc4a9337d5dc3aad4c

See more details on using hashes here.

File details

Details for the file textcleaner_partha-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for textcleaner_partha-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 039822a604ab86deeab9b4a37eaea2c0169f70f543590aa28d0af1725838a14b
MD5 278ce2c5efdba4437b5bec8b073aad0e
BLAKE2b-256 9fd0c6422c7caeed8efeeee2c1365e71028ebf768d47a3ace97e2abea551edf2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page