Skip to main content

Text Preprocessing Library

Project description

nlpprepkit

This can't be the best library for text preprocessing, but it's definitely a library!

Installation

pip install nlpprepkit

Or install from source:

git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .

Features

  • Flexible cleaning options: Control which cleaning operations to apply
  • Parallel processing: Process large text collections efficiently with multi-core support
  • Caching: Avoid redundant processing with built-in caching system
  • NLTK integration: Easy access to stemming, lemmatization, and stopword removal
  • Configurable: Save and load configuration settings for reproducible workflows

Quick Start

from nlpprepkit import TextPreprocessor

# Create a preprocessor with default settings
preprocessor = TextPreprocessor()

# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"

# Process multiple texts in parallel
texts = [
    "First text with URL https://example.org",
    "Second text with numbers 12345",
    "Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)

Customizing Configuration

from nlpprepkit import TextPreprocessor, CleaningConfig

# Create a custom configuration
config = CleaningConfig(
    expand_contractions=True,
    lowercase=True,
    remove_urls=True,
    remove_newlines=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_emojis=True,
    tokenize=True,
    remove_stopwords=True,
    stemming=False,
    lemmatization=True, # Only one of stemming or lemmatization can be enabled
    normalize_unicode=True,
    language="english",
    custom_stopwords=["custom1", "custom2"],
    keep_words=["important1", "important2"],
    min_word_length=2,
    max_word_length=15
)

# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)

Saving and Loading Configuration

# Save configuration to a file
preprocessor.save_config("my_config.json")

# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")

Caching

The library includes a caching system to avoid redundant processing:

# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)

# Clear cache if needed
TextPreprocessor.clear_cache()

Parallel Processing

# Process a large list of texts with parallel processing
results = preprocessor.process_text(
    large_text_list,
    max_workers=8, # Number of parallel workers
    batch_size=1000 # Batch size for processing
)

Available Cleaning Operations

  • Expand contractions: Convert contractions like "don't" to "do not"
  • Lowercase: Convert text to lowercase
  • Remove URLs: Remove web links from text
  • Remove newlines: Replace newline characters with spaces
  • Remove numbers: Remove digits from text
  • Remove punctuation: Remove punctuation marks
  • Remove emojis: Remove emoji characters
  • Tokenization: Split text into tokens
  • Remove stopwords: Remove common words like "the", "a", "is"
  • Stemming/Lemmatization: Reduce words to their root forms
  • Unicode normalization: Normalize accented characters
  • Word length filtering: Filter words by length

Supported Languages

  • English

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpprepkit-1.1.3.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpprepkit-1.1.3-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file nlpprepkit-1.1.3.tar.gz.

File metadata

  • Download URL: nlpprepkit-1.1.3.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.9

File hashes

Hashes for nlpprepkit-1.1.3.tar.gz
Algorithm Hash digest
SHA256 0cc1bdc5d010eab8c30f30463440ec2f920068efe07bec59b8e787b21edfef6a
MD5 e95b8a7306dea3d415e3d9d634ea3e9e
BLAKE2b-256 93170e9345c3e6e982e07f3f906158d789798e84a4853ecbc9a8102a9b77b71c

See more details on using hashes here.

File details

Details for the file nlpprepkit-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: nlpprepkit-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.9

File hashes

Hashes for nlpprepkit-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8b8e3c14530cf5f8d441723b66c8432a3e00e3ef31ab9c2739546f7e6286cf04
MD5 8c800f25d96a87e6d4a501a084b2d76c
BLAKE2b-256 95c954fae523213489cd65db07070de9404ba838e7e5126d98f23a5efb398016

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page