A package that automates text preprocessing
Project description
Text Preprocessing Toolkit
This repository contains a Python package for text preprocessing tasks. The toolkit includes functions for various preprocessing steps such as tokenization, lemmatization, stopword removal, text normalization, and more. It aims to provide a convenient and customizable solution for preparing text data for downstream tasks like natural language processing (NLP) and machine learning.
Features
- Lowercasing: Convert all text to lowercase.
- Punctuation Removal: Remove punctuation marks from text.
- Stopword Removal: Remove common words (e.g., "and", "the") that do not contribute much meaning.
- Lemmatization: Reduce words to their base or root form (e.g., "running" -> "run").
- Spell Correction: Correct misspelled words in the text.
- URL and HTML Tag Removal: Clean URLs and HTML tags from text.
- Special Character Removal: Remove non-alphanumeric characters.
Requirements
- Python 3.8 or higher
flake8
for lintingpytest
for testing- Any dependencies defined in
requirements.txt
Installation
To install the package, clone the repository and install the necessary dependencies.
Clone the repository:
git clone https://github.com/your-username/text-preprocessing-toolkit.git
cd text-preprocessing-toolkit
Install dependencies:
pip install -r requirements.txt
Alternatively, if you want to install the package globally:
pip install .
Usage
You can use this toolkit in your Python project by importing the preprocessing functions:
from text_preprocessing_toolkit import processor
text = "Your sample text goes here!"
# Preprocess text
cleaned_text = processor.preprocess(text, steps=[
"lowercase",
"remove_punctuation",
"remove_stopwords",
"lemmatize_text",
"remove_special_characters",
"remove_url",
"remove_html_tags",
"correct_spellings"
])
print(cleaned_text)
Available Preprocessing Steps:
- lowercase: Convert text to lowercase.
- remove_punctuation: Remove punctuation characters.
- remove_stopwords: Remove stopwords (common words like 'the', 'and', etc.).
- lemmatize_text: Lemmatize words (reduce to base form).
- remove_special_characters: Remove special characters from text.
- remove_url: Remove URLs from text.
- remove_html_tags: Remove HTML tags.
- correct_spellings: Correct common spelling mistakes.
Running Tests
This repository includes unit and integration tests using pytest
. To run the tests:
- Install
pytest
if you haven't already:
pip install pytest
- Run the tests:
pytest
Tests are located in the tests/
directory.
Code Linting
This project uses flake8
for linting. To check the code for style issues:
flake8 text_preprocessing_toolkit
CI/CD
This repository is integrated with GitHub Actions for continuous integration and continuous deployment (CI/CD). Every time a new commit is pushed or a pull request is created to the main
branch, the following steps will be automatically performed:
- Linting: Code will be checked for style issues using
flake8
. - Testing: Unit tests will be run using
pytest
. - Build: The package will be built using
python -m build
. - Publish: The package will be uploaded to PyPI (if a release is created).
Contributing
We welcome contributions! If you'd like to contribute to the project, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-name
). - Make your changes and commit them (
git commit -m 'Add feature'
). - Push to your forked repository (
git push origin feature-name
). - Create a pull request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Notes:
- Replace the repository URL in the
git clone
command with your actual GitHub repository URL. - Update any project-specific features or configurations that might be necessary.
________________________________________________________________________________________________
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file text_preprocessing_toolkit-0.0.1.tar.gz
.
File metadata
- Download URL: text_preprocessing_toolkit-0.0.1.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73692ce178b69c2324b29bef627df63b06c6db398fac0b02ed9fd8ea1a0af606 |
|
MD5 | 03f1e453fd8290330ebd44cb63444c61 |
|
BLAKE2b-256 | 52f80959e1e6b46742f0b598851c48d50e0977ffc7ae0fd09fc20c38a3811ff4 |
File details
Details for the file Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: Text_Preprocessing_Toolkit-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 316fd8a70bc092b33a839c7099715ef0aaf53d54aadbfb3aa9cc8fea351a7cef |
|
MD5 | cdd035c66f2fef0061e8d96b0a81d453 |
|
BLAKE2b-256 | 7ba637c88f7eff8b61e528dd6be4e948c076f172f94612857ebc2e24576dabc9 |