Text Preprocessing Library
Project description
nlpprepkit
This can't be the best library for text preprocessing, but it's definitely a library!
Installation
pip install nlpprepkit
Or install from source:
git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .
Features
- Flexible cleaning options: Control which cleaning operations to apply
- Parallel processing: Process large text collections efficiently with multi-core support
- Caching: Avoid redundant processing with built-in caching system
- NLTK integration: Easy access to stemming, lemmatization, and stopword removal
- Configurable: Save and load configuration settings for reproducible workflows
Quick Start
from nlpprepkit import TextPreprocessor
# Create a preprocessor with default settings
preprocessor = TextPreprocessor()
# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"
# Process multiple texts in parallel
texts = [
"First text with URL https://example.org",
"Second text with numbers 12345",
"Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)
Customizing Configuration
from nlpprepkit import TextPreprocessor, CleaningConfig
# Create a custom configuration
config = CleaningConfig(
expand_contractions=True,
lowercase=True,
remove_urls=True,
remove_newlines=True,
remove_numbers=True,
remove_punctuation=True,
remove_emojis=True,
tokenize=True,
remove_stopwords=True,
stemming=False,
lemmatization=True, # Only one of stemming or lemmatization can be enabled
normalize_unicode=True,
language="english",
custom_stopwords=["custom1", "custom2"],
keep_words=["important1", "important2"],
min_word_length=2,
max_word_length=15
)
# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)
Saving and Loading Configuration
# Save configuration to a file
preprocessor.save_config("my_config.json")
# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")
Caching
The library includes a caching system to avoid redundant processing:
# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)
# Clear cache if needed
TextPreprocessor.clear_cache()
Parallel Processing
# Process a large list of texts with parallel processing
results = preprocessor.process_text(
large_text_list,
max_workers=8, # Number of parallel workers
batch_size=1000 # Batch size for processing
)
Available Cleaning Operations
- Expand contractions: Convert contractions like "don't" to "do not"
- Lowercase: Convert text to lowercase
- Remove URLs: Remove web links from text
- Remove newlines: Replace newline characters with spaces
- Remove numbers: Remove digits from text
- Remove punctuation: Remove punctuation marks
- Remove emojis: Remove emoji characters
- Tokenization: Split text into tokens
- Remove stopwords: Remove common words like "the", "a", "is"
- Stemming/Lemmatization: Reduce words to their root forms
- Unicode normalization: Normalize accented characters
- Word length filtering: Filter words by length
Supported Languages
- English
- Spanish
- French
- German
- Italian
- Dutch
- Portuguese
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlpprepkit-1.0.3.tar.gz
(13.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlpprepkit-1.0.3.tar.gz.
File metadata
- Download URL: nlpprepkit-1.0.3.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c8f1950f94608639d58b9a28452e3cb61908b7f6420d14b7b3c42e316a24270
|
|
| MD5 |
dc528a9d2505fe252ef4aad5e5b1a401
|
|
| BLAKE2b-256 |
40edf0627d60e128bdcb31f6cd290f975ba6cdb8b9423c77dbb9e0f00e203da1
|
File details
Details for the file nlpprepkit-1.0.3-py3-none-any.whl.
File metadata
- Download URL: nlpprepkit-1.0.3-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdf60d7e678c1fc33a65f37d627404b8c8067086969499b4c7762bfd8ae07264
|
|
| MD5 |
98e33e636dd6251d2eed3216c0539af0
|
|
| BLAKE2b-256 |
f086f81d395ccc66a96a4c90fe701ccb60d2681e3d9b5682805355bca128d7a5
|