Skip to main content

An advance text pre-processor

Project description

Text Preprocessor

📝 Overview

Text Preprocessor is a powerful Python library designed for comprehensive text preprocessing tasks, particularly useful for natural language processing (NLP), sentiment analysis, and text classification projects.

✨ Features

  • 🧹 Remove HTML tags
  • 🛑 Remove stopwords
  • 🔤 Lemmatization
  • 🧼 Remove special characters
  • 🔍 Remove duplicates
  • 📏 Remove short texts
  • 📊 Basic outlier detection

🚀 Installation

Using pip

pip install sentence-preprocessor

Using Poetry

poetry add sentence-preprocessor

💻 Usage

As a Python Module

from text_preprocessor.preprocessor import TextPreprocessor

# Preprocess a CSV file
preprocessor = TextPreprocessor('input.csv', 'output.csv')
preprocessed_df = preprocessor.preprocess()
preprocessor.save_preprocessed_data()

Command Line Interface

# Basic usage
text_preprocessor input.csv [output.csv]

🛠 Development Setup

Prerequisites

  • Python 3.12+
  • Poetry

Installation Steps

# Install Poetry (if not already installed)
pip install poetry

# Install dependencies
make install

# Activate virtual environment
make shell

🧪 Running Tests

# Run tests
make test

# Run tests with coverage
make test-cov

📦 Build and Publish

# Build distribution
make build

# Publish to PyPI
make publish

🔍 Preprocessing Pipeline

The preprocessor applies the following transformations:

  1. Remove HTML tags
  2. Convert to lowercase
  3. Remove special characters
  4. Remove stopwords
  5. Lemmatize text
  6. Remove single characters
  7. Remove texts with fewer than 3 words
  8. Remove duplicates
  9. Basic outlier detection

📝 Configuration

You can customize preprocessing by modifying the preprocess() method parameters or extending the TextPreprocessingUtils class.

🤝 Contributing

Contributions are welcome! If you'd like to contribute to this project, please open an issue or submit a pull request.

📜 License

Distributed under the MIT License. See LICENSE for more information.

🛡 Disclaimer

This library is provided as-is. Always review and test thoroughly before using in production environments.

📞 Contact

Your Name - your.email@example.com

Project Link:

🙌 Acknowledgements


Star ⭐ the repository if this project helps you!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_preprocessor_v1-0.1.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_preprocessor_v1-0.1.0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file text_preprocessor_v1-0.1.0.tar.gz.

File metadata

  • Download URL: text_preprocessor_v1-0.1.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.4 Darwin/23.6.0

File hashes

Hashes for text_preprocessor_v1-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8429c67c3b07cb02714512fa37fc9fcc687ea1f27cbebd52d27ba8bb3f3d9f37
MD5 6221da267b2d90a42f50abeb31dc4b3e
BLAKE2b-256 6bff9f93ca438ceaad1b8b152181623c3ae33a6f3499ac31d18e10319e7cb793

See more details on using hashes here.

File details

Details for the file text_preprocessor_v1-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for text_preprocessor_v1-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aaf21cae3efab6d47fc46cc064f86684e0cec6da4d83d0fa35d563d2db5f602a
MD5 cfebf20fe4f6fefa316c0e809d08f9e7
BLAKE2b-256 fe34279ae0fa80219539faf180d08e76fbc2bf8e0aa37ebeac1e1876ecd40b1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page