Skip to main content

Text Preprocessing Library

Project description

nlpprepkit

This can't be the best library for text preprocessing, but it's definitely a library!

Installation

pip install nlpprepkit

Or install from source:

git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .

Features

  • Flexible cleaning options: Control which cleaning operations to apply
  • Parallel processing: Process large text collections efficiently with multi-core support
  • Caching: Avoid redundant processing with built-in caching system
  • NLTK integration: Easy access to stemming, lemmatization, and stopword removal
  • Configurable: Save and load configuration settings for reproducible workflows

Quick Start

from nlpprepkit import TextPreprocessor

# Create a preprocessor with default settings
preprocessor = TextPreprocessor()

# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"

# Process multiple texts in parallel
texts = [
    "First text with URL https://example.org",
    "Second text with numbers 12345",
    "Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)

Customizing Configuration

from nlpprepkit import TextPreprocessor, CleaningConfig

# Create a custom configuration
config = CleaningConfig(
    expand_contractions=True,
    lowercase=True,
    remove_urls=True,
    remove_newlines=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_emojis=True,
    tokenize=True,
    remove_stopwords=True,
    stemming=False,
    lemmatization=True, # Only one of stemming or lemmatization can be enabled
    normalize_unicode=True,
    language="english",
    custom_stopwords=["custom1", "custom2"],
    keep_words=["important1", "important2"],
    min_word_length=2,
    max_word_length=15
)

# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)

Saving and Loading Configuration

# Save configuration to a file
preprocessor.save_config("my_config.json")

# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")

Caching

The library includes a caching system to avoid redundant processing:

# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)

# Clear cache if needed
TextPreprocessor.clear_cache()

Parallel Processing

# Process a large list of texts with parallel processing
results = preprocessor.process_text(
    large_text_list,
    max_workers=8, # Number of parallel workers
    batch_size=1000 # Batch size for processing
)

Available Cleaning Operations

  • Expand contractions: Convert contractions like "don't" to "do not"
  • Lowercase: Convert text to lowercase
  • Remove URLs: Remove web links from text
  • Remove newlines: Replace newline characters with spaces
  • Remove numbers: Remove digits from text
  • Remove punctuation: Remove punctuation marks
  • Remove emojis: Remove emoji characters
  • Tokenization: Split text into tokens
  • Remove stopwords: Remove common words like "the", "a", "is"
  • Stemming/Lemmatization: Reduce words to their root forms
  • Unicode normalization: Normalize accented characters
  • Word length filtering: Filter words by length

Supported Languages

  • English

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpprepkit-1.2.0.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpprepkit-1.2.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file nlpprepkit-1.2.0.tar.gz.

File metadata

  • Download URL: nlpprepkit-1.2.0.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.10

File hashes

Hashes for nlpprepkit-1.2.0.tar.gz
Algorithm Hash digest
SHA256 87464a2caeedc71f1506f0eb5d96bafd047e68b2fb924de1f6b44734eed82fbc
MD5 0f0f79cb844e767d67bc275d48a54e64
BLAKE2b-256 a6c6072b3cd5cc2a7d66940d46512ccd2e766181e05f5dddbebb0640c4af7831

See more details on using hashes here.

File details

Details for the file nlpprepkit-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: nlpprepkit-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.10

File hashes

Hashes for nlpprepkit-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ad475d40d12a38f8d451ea54b607506d95de34548a9f72047c3945c151121b6
MD5 b6ec39956d6f89f9c72253305d91f79f
BLAKE2b-256 332732c503194bd464f91f719b23eb3c70556a23e60aa57c7258f24110dc1e0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page