Skip to main content

Text Preprocessing Library

Project description

nlpprepkit

This can't be the best library for text preprocessing, but it's definitely a library!

Installation

pip install nlpprepkit

Or install from source:

git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .

Features

  • Flexible cleaning options: Control which cleaning operations to apply
  • Parallel processing: Process large text collections efficiently with multi-core support
  • Caching: Avoid redundant processing with built-in caching system
  • NLTK integration: Easy access to stemming, lemmatization, and stopword removal
  • Configurable: Save and load configuration settings for reproducible workflows

Quick Start

from nlpprepkit import TextPreprocessor

# Create a preprocessor with default settings
preprocessor = TextPreprocessor()

# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"

# Process multiple texts in parallel
texts = [
    "First text with URL https://example.org",
    "Second text with numbers 12345",
    "Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)

Customizing Configuration

from nlpprepkit import TextPreprocessor, CleaningConfig

# Create a custom configuration
config = CleaningConfig(
    expand_contractions=True,
    lowercase=True,
    remove_urls=True,
    remove_newlines=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_emojis=True,
    tokenize=True,
    remove_stopwords=True,
    stemming=False,
    lemmatization=True, # Only one of stemming or lemmatization can be enabled
    normalize_unicode=True,
    language="english",
    custom_stopwords=["custom1", "custom2"],
    keep_words=["important1", "important2"],
    min_word_length=2,
    max_word_length=15
)

# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)

Saving and Loading Configuration

# Save configuration to a file
preprocessor.save_config("my_config.json")

# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")

Caching

The library includes a caching system to avoid redundant processing:

# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)

# Clear cache if needed
TextPreprocessor.clear_cache()

Parallel Processing

# Process a large list of texts with parallel processing
results = preprocessor.process_text(
    large_text_list,
    max_workers=8, # Number of parallel workers
    batch_size=1000 # Batch size for processing
)

Available Cleaning Operations

  • Expand contractions: Convert contractions like "don't" to "do not"
  • Lowercase: Convert text to lowercase
  • Remove URLs: Remove web links from text
  • Remove newlines: Replace newline characters with spaces
  • Remove numbers: Remove digits from text
  • Remove punctuation: Remove punctuation marks
  • Remove emojis: Remove emoji characters
  • Tokenization: Split text into tokens
  • Remove stopwords: Remove common words like "the", "a", "is"
  • Stemming/Lemmatization: Reduce words to their root forms
  • Unicode normalization: Normalize accented characters
  • Word length filtering: Filter words by length

Supported Languages

  • English

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpprepkit-1.0.8.post0.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpprepkit-1.0.8.post0-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file nlpprepkit-1.0.8.post0.tar.gz.

File metadata

  • Download URL: nlpprepkit-1.0.8.post0.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nlpprepkit-1.0.8.post0.tar.gz
Algorithm Hash digest
SHA256 0555c9c7cc92a3914dc6ed7c9234496615426fb38993956492b36c2725bf31f2
MD5 75429bd8e6094547ca47e607c8923e8c
BLAKE2b-256 d58cfb4001eda4b859242238f40376da2824e4cb52012761ee97b7db7ccc1165

See more details on using hashes here.

File details

Details for the file nlpprepkit-1.0.8.post0-py3-none-any.whl.

File metadata

File hashes

Hashes for nlpprepkit-1.0.8.post0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3467862d679cdf06d2d64e05e27157f3af02caf056059af46e085ea62e62a47
MD5 018017f834a8dcb805033e7c8f683ff8
BLAKE2b-256 c48fd3b542b3748266205f0c069d9b4636b9b0d20d3f5eee0e0f28bb846016d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page