Text Preprocessing Library
Project description
nlpprepkit
nlpprepkit is a Python library for text preprocessing, designed to simplify and accelerate the preparation of text data for natural language processing (NLP) tasks.
Features
- Text Cleaning: Remove extra whitespace, special characters, emojis, HTML tags, URLs, numbers, and social tags.
- Contraction Expansion: Expand common English contractions (e.g., "don't" → "do not").
- Unicode Normalization: Normalize text to ASCII representation.
- Pipeline Support: Create customizable pipelines for sequential text processing.
- Profiling: Measure the execution time of each step in the pipeline.
- Caching: Avoid redundant processing with built-in caching.
- Parallel Processing: Process large text datasets efficiently.
Installation
Install the library using pip:
pip install nlpprepkit
Or install from source:
git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .
Quick Start
Using the Pipeline
from nlpprepkit.pipeline import Pipeline
# Define a custom processing step
def lowercase(text):
return text.lower()
# Create a pipeline and add the step
pipeline = Pipeline()
pipeline.add_step(lowercase)
# Process text
result = pipeline.process("This is a TEST.")
print(result) # Output: "this is a test."
Text Cleaning Functions
from nlpprepkit.functions import remove_extra_whitespace, remove_special_characters
text = "This is a test!!!"
cleaned_text = remove_extra_whitespace(text)
print(cleaned_text) # Output: "This is a test!!!"
cleaned_text = remove_special_characters(cleaned_text)
print(cleaned_text) # Output: "This is a test"
Expanding Contractions
from nlpprepkit.functions import expand_contractions
text = "I'm going to the store."
expanded_text = expand_contractions(text)
print(expanded_text) # Output: "I am going to the store."
Running Tests
To run the tests, use pytest:
pytest
Contributing
Contributions are welcome! Feel free to submit a pull request or open an issue on GitHub.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlpprepkit-1.2.2.tar.gz.
File metadata
- Download URL: nlpprepkit-1.2.2.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4aac2d3c157d25d3955f74a9161122567c6d492a5ad7ec802301de82adf9375e
|
|
| MD5 |
dc08a3f07719613b6e238ff441db96b8
|
|
| BLAKE2b-256 |
8528251f0ab39b5baf40eecc3e4e78150495dfcdb8bca2defcd2e666ad152984
|
File details
Details for the file nlpprepkit-1.2.2-py3-none-any.whl.
File metadata
- Download URL: nlpprepkit-1.2.2-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1d781a63450acf1bb8fd04910c2b0c738421f18b6fb8d152ee8eef7e75e081a
|
|
| MD5 |
3c1c35b605fb2c3cf0a16ed39fa7f8e7
|
|
| BLAKE2b-256 |
856323e669c90d0fda00289a58a4cbbe4c7c6fc9247818fbeab25e575327b147
|