A package that automates text preprocessing

These details have not been verified by PyPI

Project links

Project description

Text Preprocessing Toolkit

This repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.

Features

Lowercasing: Convert all text to lowercase.
Punctuation Removal: Remove punctuation marks from text.
Stopword Removal: Remove common words (e.g., "and", "the") that do not contribute much meaning.
Lemmatization: Reduce words to their base or root form (e.g., "running" -> "run").
Spell Correction: Correct misspelled words in the text.
URL and HTML Tag Removal: Clean URLs and HTML tags from text.
Special Character Removal: Remove non-alphanumeric characters.

Requirements

Python 3.8 or higher
flake8 for linting
pytest for testing
Any dependencies defined in requirements.txt

Installation

To install the package, clone the repository and install the necessary dependencies.

Clone the repository:

git clone https://github.com/your-username/text-preprocessing-toolkit.git
cd text-preprocessing-toolkit

Install dependencies:

pip install -r requirements.txt

Alternatively, if you want to install the package globally:

pip install .

Usage

You can use this toolkit in your Python project by importing the preprocessing functions:

from text_preprocessing_toolkit import processor

text = "Your sample text goes here!"

# Preprocess text
cleaned_text = processor.preprocess(text, steps=[
    "lowercase",
    "remove_punctuation",
    "remove_stopwords",
    "lemmatize_text",
    "remove_special_characters",
    "remove_url",
    "remove_html_tags",
    "correct_spellings"
])

print(cleaned_text)

Available Preprocessing Steps:

lowercase: Convert text to lowercase.
remove_punctuation: Remove punctuation characters.
remove_stopwords: Remove stopwords (common words like 'the', 'and', etc.).
lemmatize_text: Lemmatize words (reduce to base form).
remove_special_characters: Remove special characters from text.
remove_url: Remove URLs from text.
remove_html_tags: Remove HTML tags.
correct_spellings: Correct common spelling mistakes.

Running Tests

This repository includes unit and integration tests using pytest. To run the tests:

Install pytest if you haven't already:

pip install pytest

Run the tests:

pytest

Tests are located in the tests/ directory.

Code Linting

This project uses flake8 for linting. To check the code for style issues:

flake8 text_preprocessing_toolkit

CI/CD

This repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the main branch, the following steps will be automatically performed:

Linting: Code will be checked for style issues using flake8.
Testing: Unit tests will be run using pytest.
Build: The package will be built using python -m build.
Publish: The package will be uploaded to PyPI (if a release is created).

Contributing

We welcome contributions! If you'd like to contribute to the project, please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature-name).
Make your changes and commit them (git commit -m 'Add feature').
Push to your forked repository (git push origin feature-name).
Create a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Notes:

Replace the repository URL in the git clone command with your actual GitHub repository URL.
Update any project-specific features or configurations that might be necessary.

________________________________________________________________________________________________

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.2

Dec 3, 2024

This version

0.0.1

Dec 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_preprocessing_toolkit-0.0.1.tar.gz (6.2 kB view details)

Uploaded Dec 1, 2024 Source

Built Distribution

Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl (4.9 kB view details)

Uploaded Dec 1, 2024 Python 3

File details

Details for the file text_preprocessing_toolkit-0.0.1.tar.gz.

File metadata

Download URL: text_preprocessing_toolkit-0.0.1.tar.gz
Upload date: Dec 1, 2024
Size: 6.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for text_preprocessing_toolkit-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`73692ce178b69c2324b29bef627df63b06c6db398fac0b02ed9fd8ea1a0af606`
MD5	`03f1e453fd8290330ebd44cb63444c61`
BLAKE2b-256	`52f80959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4`

See more details on using hashes here.

File details

Details for the file Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl.

File metadata

Download URL: Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl
Upload date: Dec 1, 2024
Size: 4.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`316fd8a70bc092b33a839c7099715ef0aaf53d54aadbfb3aa9cc8fea351a7cef`
MD5	`cdd035c66f2fef0061e8d96b0a81d453`
BLAKE2b-256	`7ba637c88f7eff8b61e528dd6be4e948c076f172f94612857ebc2e24576dabc9`