A lightweight and reusable text preprocessing package for NLP tasks
Project description
🧹 textcleaner-partha
A lightweight and reusable text preprocessing package for NLP tasks. It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.
✨ Features
• ✅ HTML tag and emoji removal
• ✅ Contraction expansion (e.g., “can’t” → “cannot”)
• ✅ Abbreviation expansion (e.g., “asap” → “as soon as possible”)
• ✅ Spelling correction with autocorrect
• ✅ Lemmatization using spaCy (en_core_web_sm)
• ✅ Filters out stopwords, punctuation, numbers
• ✅ Retains only nouns, verbs, adjectives, and adverbs
• ✅ Returns tokens in a text
🚀 Installation
From PyPI:
pip install textcleaner-partha
Install directly from GitHub:
pip install git+https://github.com/partha6369/textcleaner.git
🧠 Usage
from textcleaner_partha import preprocess
text = "I can't believe it's already raining! 😞 <p>Click here</p>"
# Default usage (all features enabled)
cleaned = preprocess(text)
print(cleaned)
# Custom usage with optional features disabled
cleaned_partial = preprocess(
text,
lemmatise=False, # Skip spaCy processing (lemmatisation, POS filtering)
correct_spelling=False, # Skip spelling correction
expand_contraction=False # Skip contraction expansion
)
print(cleaned_partial)
from textcleaner_partha import get_tokens
text = "I can't believe it's already raining! 😞 <p>Click here</p>"
# Default usage (all features enabled)
tokens = get_tokens(text)
print(tokens)
# Custom usage with optional features disabled
tokens_partial = get_tokens(
text,
lemmatise=False, # Skip spaCy processing (lemmatisation, POS filtering)
correct_spelling=False, # Skip spelling correction
expand_contraction=False # Skip contraction expansion
)
print(tokens_partial)
🔧 Parameters
The preprocess() and get_tokens() functions offer flexible control over each text cleaning step. You can selectively enable or disable operations using the parameters below:
def preprocess(
text,
lowercase=True,
remove_html=True,
remove_emoji=True,
remove_whitespace=True,
remove_punct=False,
expand_contraction=True,
expand_abbrev=True,
correct_spelling=True,
lemmatise=True,
)
def get_tokens(
text,
lowercase=True,
remove_html=True,
remove_emoji=True,
remove_whitespace=True,
remove_punct=False,
expand_contraction=True,
expand_abbrev=True,
correct_spelling=True,
lemmatise=True,
)
📦 Dependencies
• spacy
• autocorrect
• contractions
You can install them manually or via the included requirements.txt:
pip install -r requirements.txt
And download the required spaCy model:
python -m spacy download en_core_web_sm
📄 License
MIT License © Dr. Partha Majumdar
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textcleaner_partha-0.3.1.tar.gz.
File metadata
- Download URL: textcleaner_partha-0.3.1.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e84d91d3e0be67241af273f0b05c1c7502ec82fa747b8e6c4bd5b78a8fbe8e93
|
|
| MD5 |
447829bcbbb2fb6583e4cd923f636578
|
|
| BLAKE2b-256 |
97f1d81117d247e7a7afc2830ebc2b247b25c59ba3d7663ab9b26e74104817ef
|
File details
Details for the file textcleaner_partha-0.3.1-py3-none-any.whl.
File metadata
- Download URL: textcleaner_partha-0.3.1-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e4ddd0aa13a9942b3f9c53d572b380c7c412061ac841f25b06df3e71fbe49b0
|
|
| MD5 |
78fbfc17879cd8a7aa5c02221c089234
|
|
| BLAKE2b-256 |
659b841c9112e1c2cfdddcd08febd6dd3b9d9fc6045ce575eb4abc24e307b047
|