Text Preprocessing Library
Project description
nlpprepkit
This can't be the best library for text preprocessing, but it's definitely a library!
Installation
pip install nlpprepkit
Or install from source:
git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .
Features
- Flexible cleaning options: Control which cleaning operations to apply
- Parallel processing: Process large text collections efficiently with multi-core support
- Caching: Avoid redundant processing with built-in caching system
- NLTK integration: Easy access to stemming, lemmatization, and stopword removal
- Configurable: Save and load configuration settings for reproducible workflows
Quick Start
from nlpprepkit import TextPreprocessor
# Create a preprocessor with default settings
preprocessor = TextPreprocessor()
# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"
# Process multiple texts in parallel
texts = [
"First text with URL https://example.org",
"Second text with numbers 12345",
"Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)
Customizing Configuration
from nlpprepkit import TextPreprocessor, CleaningConfig
# Create a custom configuration
config = CleaningConfig(
expand_contractions=True,
lowercase=True,
remove_urls=True,
remove_newlines=True,
remove_numbers=True,
remove_punctuation=True,
remove_emojis=True,
tokenize=True,
remove_stopwords=True,
stemming=False,
lemmatization=True, # Only one of stemming or lemmatization can be enabled
normalize_unicode=True,
language="english",
custom_stopwords=["custom1", "custom2"],
keep_words=["important1", "important2"],
min_word_length=2,
max_word_length=15
)
# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)
Saving and Loading Configuration
# Save configuration to a file
preprocessor.save_config("my_config.json")
# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")
Caching
The library includes a caching system to avoid redundant processing:
# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)
# Clear cache if needed
TextPreprocessor.clear_cache()
Parallel Processing
# Process a large list of texts with parallel processing
results = preprocessor.process_text(
large_text_list,
max_workers=8, # Number of parallel workers
batch_size=1000 # Batch size for processing
)
Available Cleaning Operations
- Expand contractions: Convert contractions like "don't" to "do not"
- Lowercase: Convert text to lowercase
- Remove URLs: Remove web links from text
- Remove newlines: Replace newline characters with spaces
- Remove numbers: Remove digits from text
- Remove punctuation: Remove punctuation marks
- Remove emojis: Remove emoji characters
- Tokenization: Split text into tokens
- Remove stopwords: Remove common words like "the", "a", "is"
- Stemming/Lemmatization: Reduce words to their root forms
- Unicode normalization: Normalize accented characters
- Word length filtering: Filter words by length
Supported Languages
- English
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nlpprepkit-1.0.5.tar.gz
(13.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlpprepkit-1.0.5.tar.gz.
File metadata
- Download URL: nlpprepkit-1.0.5.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2be104cc5b89d3dd8f1225be6bd4762b41608f6b9a6d78c37371be7e25cc48f7
|
|
| MD5 |
23fedca2a8e64c0128e959d0a8972788
|
|
| BLAKE2b-256 |
3dcb37fcb2584915b92db08af82fac962d24407170c2de0b529ea8ff4564c534
|
File details
Details for the file nlpprepkit-1.0.5-py3-none-any.whl.
File metadata
- Download URL: nlpprepkit-1.0.5-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee1a7ee14d824c481b97bad51654ea5714601bdce7606b993d1600bc872982de
|
|
| MD5 |
5e88ad3c3816a2b8bb7fabaaf0c1d43a
|
|
| BLAKE2b-256 |
e42397ed4a8bbf1026fd9e5cc37fe936285cb75524ddeadf814db0d6089dcff8
|