Skip to main content

A package that automates text preprocessing

Project description

Text Preprocessing Toolkit

This repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.

Features

  • Lowercasing: Convert all text to lowercase.
  • Punctuation Removal: Remove punctuation marks from text.
  • Stopword Removal: Remove common words (e.g., "and", "the") that do not contribute much meaning.
  • Lemmatization: Reduce words to their base or root form (e.g., "running" -> "run").
  • Spell Correction: Correct misspelled words in the text.
  • URL and HTML Tag Removal: Clean URLs and HTML tags from text.
  • Special Character Removal: Remove non-alphanumeric characters.

Requirements

  • Python 3.8 or higher
  • flake8 for linting
  • pytest for testing
  • Any dependencies defined in requirements.txt

Installation

To install the package, clone the repository and install the necessary dependencies.

Clone the repository:

git clone https://github.com/your-username/text-preprocessing-toolkit.git
cd text-preprocessing-toolkit

Install dependencies:

pip install -r requirements.txt

Alternatively, if you want to install the package globally:

pip install .

Usage

You can use this toolkit in your Python project by importing the preprocessing functions:

from text_preprocessing_toolkit import processor

text = "Your sample text goes here!"

# Preprocess text
cleaned_text = processor.preprocess(text, steps=[
    "lowercase",
    "remove_punctuation",
    "remove_stopwords",
    "lemmatize_text",
    "remove_special_characters",
    "remove_url",
    "remove_html_tags",
    "correct_spellings"
])

print(cleaned_text)

Available Preprocessing Steps:

  • lowercase: Convert text to lowercase.
  • remove_punctuation: Remove punctuation characters.
  • remove_stopwords: Remove stopwords (common words like 'the', 'and', etc.).
  • lemmatize_text: Lemmatize words (reduce to base form).
  • remove_special_characters: Remove special characters from text.
  • remove_url: Remove URLs from text.
  • remove_html_tags: Remove HTML tags.
  • correct_spellings: Correct common spelling mistakes.

Running Tests

This repository includes unit and integration tests using pytest. To run the tests:

  1. Install pytest if you haven't already:
pip install pytest
  1. Run the tests:
pytest

Tests are located in the tests/ directory.

Code Linting

This project uses flake8 for linting. To check the code for style issues:

flake8 text_preprocessing_toolkit

CI/CD

This repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the main branch, the following steps will be automatically performed:

  • Linting: Code will be checked for style issues using flake8.
  • Testing: Unit tests will be run using pytest.
  • Build: The package will be built using python -m build.
  • Publish: The package will be uploaded to PyPI (if a release is created).

Contributing

We welcome contributions! If you'd like to contribute to the project, please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-name).
  3. Make your changes and commit them (git commit -m 'Add feature').
  4. Push to your forked repository (git push origin feature-name).
  5. Create a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.


Notes:

  • Replace the repository URL in the git clone command with your actual GitHub repository URL.
  • Update any project-specific features or configurations that might be necessary.

________________________________________________________________________________________________

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_preprocessing_toolkit-0.0.1.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file text_preprocessing_toolkit-0.0.1.tar.gz.

File metadata

File hashes

Hashes for text_preprocessing_toolkit-0.0.1.tar.gz
Algorithm Hash digest
SHA256 73692ce178b69c2324b29bef627df63b06c6db398fac0b02ed9fd8ea1a0af606
MD5 03f1e453fd8290330ebd44cb63444c61
BLAKE2b-256 52f80959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4

See more details on using hashes here.

File details

Details for the file Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 316fd8a70bc092b33a839c7099715ef0aaf53d54aadbfb3aa9cc8fea351a7cef
MD5 cdd035c66f2fef0061e8d96b0a81d453
BLAKE2b-256 7ba637c88f7eff8b61e528dd6be4e948c076f172f94612857ebc2e24576dabc9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page