Text Preprocessing Library
Project description
nlpprepkit
This can't be the best library for text preprocessing, but it's definitely a library!
Installation
pip install nlpprepkit
Or install from source:
git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .
Features
- Flexible cleaning options: Control which cleaning operations to apply
- Parallel processing: Process large text collections efficiently with multi-core support
- Caching: Avoid redundant processing with built-in caching system
- NLTK integration: Easy access to stemming, lemmatization, and stopword removal
- Configurable: Save and load configuration settings for reproducible workflows
Quick Start
from nlpprepkit import TextPreprocessor
# Create a preprocessor with default settings
preprocessor = TextPreprocessor()
# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"
# Process multiple texts in parallel
texts = [
"First text with URL https://example.org",
"Second text with numbers 12345",
"Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)
Customizing Configuration
from nlpprepkit import TextPreprocessor, CleaningConfig
# Create a custom configuration
config = CleaningConfig(
expand_contractions=True,
lowercase=True,
remove_urls=True,
remove_newlines=True,
remove_numbers=True,
remove_punctuation=True,
remove_emojis=True,
tokenize=True,
remove_stopwords=True,
stemming=False,
lemmatization=True, # Only one of stemming or lemmatization can be enabled
normalize_unicode=True,
language="english",
custom_stopwords=["custom1", "custom2"],
keep_words=["important1", "important2"],
min_word_length=2,
max_word_length=15
)
# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)
Saving and Loading Configuration
# Save configuration to a file
preprocessor.save_config("my_config.json")
# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")
Caching
The library includes a caching system to avoid redundant processing:
# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)
# Clear cache if needed
TextPreprocessor.clear_cache()
Parallel Processing
# Process a large list of texts with parallel processing
results = preprocessor.process_text(
large_text_list,
max_workers=8, # Number of parallel workers
batch_size=1000 # Batch size for processing
)
Available Cleaning Operations
- Expand contractions: Convert contractions like "don't" to "do not"
- Lowercase: Convert text to lowercase
- Remove URLs: Remove web links from text
- Remove newlines: Replace newline characters with spaces
- Remove numbers: Remove digits from text
- Remove punctuation: Remove punctuation marks
- Remove emojis: Remove emoji characters
- Tokenization: Split text into tokens
- Remove stopwords: Remove common words like "the", "a", "is"
- Stemming/Lemmatization: Reduce words to their root forms
- Unicode normalization: Normalize accented characters
- Word length filtering: Filter words by length
Supported Languages
- English
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlpprepkit-1.1.3.tar.gz
(40.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlpprepkit-1.1.3.tar.gz.
File metadata
- Download URL: nlpprepkit-1.1.3.tar.gz
- Upload date:
- Size: 40.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cc1bdc5d010eab8c30f30463440ec2f920068efe07bec59b8e787b21edfef6a
|
|
| MD5 |
e95b8a7306dea3d415e3d9d634ea3e9e
|
|
| BLAKE2b-256 |
93170e9345c3e6e982e07f3f906158d789798e84a4853ecbc9a8102a9b77b71c
|
File details
Details for the file nlpprepkit-1.1.3-py3-none-any.whl.
File metadata
- Download URL: nlpprepkit-1.1.3-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b8e3c14530cf5f8d441723b66c8432a3e00e3ef31ab9c2739546f7e6286cf04
|
|
| MD5 |
8c800f25d96a87e6d4a501a084b2d76c
|
|
| BLAKE2b-256 |
95c954fae523213489cd65db07070de9404ba838e7e5126d98f23a5efb398016
|