An advance text pre-processor
Project description
Text Preprocessor
📝 Overview
Text Preprocessor is a powerful Python library designed for comprehensive text preprocessing tasks, particularly useful for natural language processing (NLP), sentiment analysis, and text classification projects.
✨ Features
- 🧹 Remove HTML tags
- 🛑 Remove stopwords
- 🔤 Lemmatization
- 🧼 Remove special characters
- 🔍 Remove duplicates
- 📏 Remove short texts
- 📊 Basic outlier detection
🚀 Installation
Using pip
pip install sentence-preprocessor
Using Poetry
poetry add sentence-preprocessor
💻 Usage
As a Python Module
from text_preprocessor.preprocessor import TextPreprocessor
# Preprocess a CSV file
preprocessor = TextPreprocessor('input.csv', 'output.csv')
preprocessed_df = preprocessor.preprocess()
preprocessor.save_preprocessed_data()
Command Line Interface
# Basic usage
text_preprocessor input.csv [output.csv]
🛠 Development Setup
Prerequisites
- Python 3.12+
- Poetry
Installation Steps
# Install Poetry (if not already installed)
pip install poetry
# Install dependencies
make install
# Activate virtual environment
make shell
🧪 Running Tests
# Run tests
make test
# Run tests with coverage
make test-cov
📦 Build and Publish
# Build distribution
make build
# Publish to PyPI
make publish
🔍 Preprocessing Pipeline
The preprocessor applies the following transformations:
- Remove HTML tags
- Convert to lowercase
- Remove special characters
- Remove stopwords
- Lemmatize text
- Remove single characters
- Remove texts with fewer than 3 words
- Remove duplicates
- Basic outlier detection
📝 Configuration
You can customize preprocessing by modifying the preprocess() method parameters or extending the TextPreprocessingUtils class.
🤝 Contributing
Contributions are welcome! If you'd like to contribute to this project, please open an issue or submit a pull request.
📜 License
Distributed under the MIT License. See LICENSE for more information.
🛡 Disclaimer
This library is provided as-is. Always review and test thoroughly before using in production environments.
📞 Contact
Your Name - your.email@example.com
Project Link:
🙌 Acknowledgements
Star ⭐ the repository if this project helps you!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_preprocessor_v1-0.1.0.tar.gz.
File metadata
- Download URL: text_preprocessor_v1-0.1.0.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.12.4 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8429c67c3b07cb02714512fa37fc9fcc687ea1f27cbebd52d27ba8bb3f3d9f37
|
|
| MD5 |
6221da267b2d90a42f50abeb31dc4b3e
|
|
| BLAKE2b-256 |
6bff9f93ca438ceaad1b8b152181623c3ae33a6f3499ac31d18e10319e7cb793
|
File details
Details for the file text_preprocessor_v1-0.1.0-py3-none-any.whl.
File metadata
- Download URL: text_preprocessor_v1-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.12.4 Darwin/23.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aaf21cae3efab6d47fc46cc064f86684e0cec6da4d83d0fa35d563d2db5f602a
|
|
| MD5 |
cfebf20fe4f6fefa316c0e809d08f9e7
|
|
| BLAKE2b-256 |
fe34279ae0fa80219539faf180d08e76fbc2bf8e0aa37ebeac1e1876ecd40b1c
|