Skip to main content

Text Preprocessing Library

Project description

nlpprepkit

This can't be the best library for text preprocessing, but it's definitely a library!

Installation

pip install nlpprepkit

Or install from source:

git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .

Features

  • Flexible cleaning options: Control which cleaning operations to apply
  • Parallel processing: Process large text collections efficiently with multi-core support
  • Caching: Avoid redundant processing with built-in caching system
  • NLTK integration: Easy access to stemming, lemmatization, and stopword removal
  • Configurable: Save and load configuration settings for reproducible workflows

Quick Start

from nlpprepkit import TextPreprocessor

# Create a preprocessor with default settings
preprocessor = TextPreprocessor()

# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"

# Process multiple texts in parallel
texts = [
    "First text with URL https://example.org",
    "Second text with numbers 12345",
    "Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)

Customizing Configuration

from nlpprepkit import TextPreprocessor, CleaningConfig

# Create a custom configuration
config = CleaningConfig(
    expand_contractions=True,
    lowercase=True,
    remove_urls=True,
    remove_newlines=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_emojis=True,
    tokenize=True,
    remove_stopwords=True,
    stemming=False,
    lemmatization=True, # Only one of stemming or lemmatization can be enabled
    normalize_unicode=True,
    language="english",
    custom_stopwords=["custom1", "custom2"],
    keep_words=["important1", "important2"],
    min_word_length=2,
    max_word_length=15
)

# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)

Saving and Loading Configuration

# Save configuration to a file
preprocessor.save_config("my_config.json")

# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")

Caching

The library includes a caching system to avoid redundant processing:

# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)

# Clear cache if needed
TextPreprocessor.clear_cache()

Parallel Processing

# Process a large list of texts with parallel processing
results = preprocessor.process_text(
    large_text_list,
    max_workers=8, # Number of parallel workers
    batch_size=1000 # Batch size for processing
)

Available Cleaning Operations

  • Expand contractions: Convert contractions like "don't" to "do not"
  • Lowercase: Convert text to lowercase
  • Remove URLs: Remove web links from text
  • Remove newlines: Replace newline characters with spaces
  • Remove numbers: Remove digits from text
  • Remove punctuation: Remove punctuation marks
  • Remove emojis: Remove emoji characters
  • Tokenization: Split text into tokens
  • Remove stopwords: Remove common words like "the", "a", "is"
  • Stemming/Lemmatization: Reduce words to their root forms
  • Unicode normalization: Normalize accented characters
  • Word length filtering: Filter words by length

Supported Languages

  • English
  • Spanish
  • French
  • German
  • Italian
  • Dutch
  • Portuguese

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpprepkit-1.0.3.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpprepkit-1.0.3-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file nlpprepkit-1.0.3.tar.gz.

File metadata

  • Download URL: nlpprepkit-1.0.3.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nlpprepkit-1.0.3.tar.gz
Algorithm Hash digest
SHA256 9c8f1950f94608639d58b9a28452e3cb61908b7f6420d14b7b3c42e316a24270
MD5 dc528a9d2505fe252ef4aad5e5b1a401
BLAKE2b-256 40edf0627d60e128bdcb31f6cd290f975ba6cdb8b9423c77dbb9e0f00e203da1

See more details on using hashes here.

File details

Details for the file nlpprepkit-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: nlpprepkit-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nlpprepkit-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bdf60d7e678c1fc33a65f37d627404b8c8067086969499b4c7762bfd8ae07264
MD5 98e33e636dd6251d2eed3216c0539af0
BLAKE2b-256 f086f81d395ccc66a96a4c90fe701ccb60d2681e3d9b5682805355bca128d7a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page