Text Preprocessing Library

Project description

nlpprepkit

This can't be the best library for text preprocessing, but it's definitely a library!

Installation

pip install nlpprepkit

Or install from source:

git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .

Features

Flexible cleaning options: Control which cleaning operations to apply
Parallel processing: Process large text collections efficiently with multi-core support
Caching: Avoid redundant processing with built-in caching system
NLTK integration: Easy access to stemming, lemmatization, and stopword removal
Configurable: Save and load configuration settings for reproducible workflows

Quick Start

from nlpprepkit import TextPreprocessor

# Create a preprocessor with default settings
preprocessor = TextPreprocessor()

# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"

# Process multiple texts in parallel
texts = [
    "First text with URL https://example.org",
    "Second text with numbers 12345",
    "Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)

Customizing Configuration

from nlpprepkit import TextPreprocessor, CleaningConfig

# Create a custom configuration
config = CleaningConfig(
    expand_contractions=True,
    lowercase=True,
    remove_urls=True,
    remove_newlines=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_emojis=True,
    tokenize=True,
    remove_stopwords=True,
    stemming=False,
    lemmatization=True, # Only one of stemming or lemmatization can be enabled
    normalize_unicode=True,
    language="english",
    custom_stopwords=["custom1", "custom2"],
    keep_words=["important1", "important2"],
    min_word_length=2,
    max_word_length=15
)

# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)

Saving and Loading Configuration

# Save configuration to a file
preprocessor.save_config("my_config.json")

# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")

Caching

The library includes a caching system to avoid redundant processing:

# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)

# Clear cache if needed
TextPreprocessor.clear_cache()

Parallel Processing

# Process a large list of texts with parallel processing
results = preprocessor.process_text(
    large_text_list,
    max_workers=8, # Number of parallel workers
    batch_size=1000 # Batch size for processing
)

Available Cleaning Operations

Expand contractions: Convert contractions like "don't" to "do not"
Lowercase: Convert text to lowercase
Remove URLs: Remove web links from text
Remove newlines: Replace newline characters with spaces
Remove numbers: Remove digits from text
Remove punctuation: Remove punctuation marks
Remove emojis: Remove emoji characters
Tokenization: Split text into tokens
Remove stopwords: Remove common words like "the", "a", "is"
Stemming/Lemmatization: Reduce words to their root forms
Unicode normalization: Normalize accented characters
Word length filtering: Filter words by length

Supported Languages

English

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details

Release history Release notifications | RSS feed

1.2.3

Mar 29, 2025

1.2.2

Mar 29, 2025

1.2.1

Mar 29, 2025

This version

1.2.0

Mar 29, 2025

1.1.3

Mar 21, 2025

1.1.1

Mar 21, 2025

1.0.8.post0

Mar 21, 2025

1.0.5

Mar 21, 2025

1.0.4

Mar 20, 2025

1.0.3

Mar 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpprepkit-1.2.0.tar.gz (16.6 kB view details)

Uploaded Mar 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nlpprepkit-1.2.0-py3-none-any.whl (10.6 kB view details)

Uploaded Mar 29, 2025 Python 3

File details

Details for the file nlpprepkit-1.2.0.tar.gz.

File metadata

Download URL: nlpprepkit-1.2.0.tar.gz
Upload date: Mar 29, 2025
Size: 16.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.10

File hashes

Hashes for nlpprepkit-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`87464a2caeedc71f1506f0eb5d96bafd047e68b2fb924de1f6b44734eed82fbc`
MD5	`0f0f79cb844e767d67bc275d48a54e64`
BLAKE2b-256	`a6c6072b3cd5cc2a7d66940d46512ccd2e766181e05f5dddbebb0640c4af7831`

See more details on using hashes here.

File details

Details for the file nlpprepkit-1.2.0-py3-none-any.whl.

File metadata

Download URL: nlpprepkit-1.2.0-py3-none-any.whl
Upload date: Mar 29, 2025
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.10

File hashes

Hashes for nlpprepkit-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ad475d40d12a38f8d451ea54b607506d95de34548a9f72047c3945c151121b6`
MD5	`b6ec39956d6f89f9c72253305d91f79f`
BLAKE2b-256	`332732c503194bd464f91f719b23eb3c70556a23e60aa57c7258f24110dc1e0b`

See more details on using hashes here.

nlpprepkit 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

nlpprepkit

Installation

Features

Quick Start

Customizing Configuration

Saving and Loading Configuration

Caching

Parallel Processing

Available Cleaning Operations

Supported Languages

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes