Skip to main content

A lightweight and reusable text preprocessing package for NLP tasks

Project description

🧹 textcleaner

A lightweight and reusable text preprocessing package for NLP tasks. It cleans text by removing HTML tags and emojis, expanding contractions, correcting spelling, and performing lemmatization using spaCy.

✨ Features

•	✅ HTML tag and emoji removal
•	✅ Contraction expansion (e.g., “can’t” → “cannot”)
•	✅ Spelling correction with autocorrect
•	✅ Lemmatization using spaCy (en_core_web_sm)
•	✅ Filters out stopwords, punctuation, numbers
•	✅ Retains only nouns, verbs, adjectives, and adverbs

🚀 Installation

Install directly from GitHub:

pip install git+https://github.com/partha6369/textcleaner.git

🧠 Usage

from textcleaner import preprocess

text = "I can't believe it's already raining! 😞 <p>Click here</p>"
cleaned = preprocess(text)
print(cleaned)

📦 Dependencies

•	spacy
•	autocorrect
•	contractions

You can install them manually or via the included requirements.txt:

pip install -r requirements.txt

And download the required spaCy model:

python -m spacy download en_core_web_sm

📄 License

MIT License © Partha Majumdar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textcleaner_partha-0.1.1.tar.gz (3.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textcleaner_partha-0.1.1-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file textcleaner_partha-0.1.1.tar.gz.

File metadata

  • Download URL: textcleaner_partha-0.1.1.tar.gz
  • Upload date:
  • Size: 3.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.12

File hashes

Hashes for textcleaner_partha-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2e9afb49ac12809c560dba5544aff9552a9d095d4f4d51eb4a37b33584b3ca46
MD5 c2b512cfc7fba51d4ae9a122abaae1f3
BLAKE2b-256 6f08d303bb9798f58d273519615f207103128f4d0bd2afd2d3fd5997dcc82b1c

See more details on using hashes here.

File details

Details for the file textcleaner_partha-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for textcleaner_partha-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 16401394c87066db8f7a5be54705cfc06a84e388a18b00911323ef91f1e997ac
MD5 6920d2757fa6ed568d4876138a31af27
BLAKE2b-256 6f47ec954ca28e6219d2ba123f1adf35fe9377ba589e37c7f49970aff4e497ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page