Skip to main content

A lightweight and reusable text preprocessing package for NLP tasks

Project description

🧹 textcleaner-partha

PyPI version License

A lightweight and reusable text preprocessing package for NLP tasks. It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.

✨ Features

•	✅ HTML tag and emoji removal
•	✅ Contraction expansion (e.g., “can’t” → “cannot”)
•	✅ Spelling correction with autocorrect
•	✅ Lemmatization using spaCy (en_core_web_sm)
•	✅ Filters out stopwords, punctuation, numbers
•	✅ Retains only nouns, verbs, adjectives, and adverbs

🚀 Installation

From PyPI:

pip install textcleaner-partha

Install directly from GitHub:

pip install git+https://github.com/partha6369/textcleaner.git

🧠 Usage

from textcleaner_partha import preprocess

text = "I can't believe it's already raining! 😞 <p>Click here</p>"

# Default usage (all features enabled)
cleaned = preprocess(text)
print(cleaned)

# Custom usage with optional features disabled
cleaned_partial = preprocess(
    text,
    lemmatise=False,            # Skip spaCy processing (lemmatisation, POS filtering)
    correct_spelling=False,     # Skip spelling correction
    expand_contraction=False    # Skip contraction expansion
)
print(cleaned_partial)

🔧 Parameters

The preprocess() function offers flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:

def preprocess(
    text,
    lowercase=True,
    remove_html=True,
    remove_emoji=True,
    expand_contraction=True,
    correct_spelling=True,
    lemmatise=True,
)

📦 Dependencies

•	spacy
•	autocorrect
•	contractions

You can install them manually or via the included requirements.txt:

pip install -r requirements.txt

And download the required spaCy model:

python -m spacy download en_core_web_sm

📄 License

MIT License © Dr. Partha Majumdar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textcleaner_partha-0.1.8.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textcleaner_partha-0.1.8-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file textcleaner_partha-0.1.8.tar.gz.

File metadata

  • Download URL: textcleaner_partha-0.1.8.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for textcleaner_partha-0.1.8.tar.gz
Algorithm Hash digest
SHA256 dbd22689bac1f425a42707cbbdb8aef72be20d97478987cceb587a50353119e1
MD5 6a2dc1b6a7088a902602be2e101381f1
BLAKE2b-256 2108e0ea3cbf29471233ebbf7c06b936fb66a7f660916dc28e6b20b95b0bf28b

See more details on using hashes here.

File details

Details for the file textcleaner_partha-0.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for textcleaner_partha-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f264669074c8dac22435f022f454f69d3677dcb8418e71cea82afb91b04d9b1f
MD5 7a9f60d94e538fc9cd675be33f89b721
BLAKE2b-256 83ff0c5a2a0b6e722d0cad541a03d711f11cffbf83334c192ebafb8d5a654ff7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page