Skip to main content

Text Preprocessing Library

Project description

nlpprepkit

This can't be the best library for text preprocessing, but it's definitely a library!

Installation

pip install nlpprepkit

Or install from source:

git clone https://github.com/vnniciusg/nlpprepkit.git
cd nlpprepkit
pip install -e .

Features

  • Flexible cleaning options: Control which cleaning operations to apply
  • Parallel processing: Process large text collections efficiently with multi-core support
  • Caching: Avoid redundant processing with built-in caching system
  • NLTK integration: Easy access to stemming, lemmatization, and stopword removal
  • Configurable: Save and load configuration settings for reproducible workflows

Quick Start

from nlpprepkit import TextPreprocessor

# Create a preprocessor with default settings
preprocessor = TextPreprocessor()

# Process a single text
cleaned_text = preprocessor.process_text("Check out this URL: https://example.com and these numbers 12345!")
print(cleaned_text) # Output: "check url number"

# Process multiple texts in parallel
texts = [
    "First text with URL https://example.org",
    "Second text with numbers 12345",
    "Third text with emoji 😀 and contraction don't"
]
results = preprocessor.process_text(texts)

Customizing Configuration

from nlpprepkit import TextPreprocessor, CleaningConfig

# Create a custom configuration
config = CleaningConfig(
    expand_contractions=True,
    lowercase=True,
    remove_urls=True,
    remove_newlines=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_emojis=True,
    tokenize=True,
    remove_stopwords=True,
    stemming=False,
    lemmatization=True, # Only one of stemming or lemmatization can be enabled
    normalize_unicode=True,
    language="english",
    custom_stopwords=["custom1", "custom2"],
    keep_words=["important1", "important2"],
    min_word_length=2,
    max_word_length=15
)

# Create preprocessor with custom config
preprocessor = TextPreprocessor(config)

Saving and Loading Configuration

# Save configuration to a file
preprocessor.save_config("my_config.json")

# Load configuration from a file
preprocessor = TextPreprocessor.from_config_file("my_config.json")

Caching

The library includes a caching system to avoid redundant processing:

# Enable caching (enabled by default)
TextPreprocessor.enable_cache(max_size=1000)

# Clear cache if needed
TextPreprocessor.clear_cache()

Parallel Processing

# Process a large list of texts with parallel processing
results = preprocessor.process_text(
    large_text_list,
    max_workers=8, # Number of parallel workers
    batch_size=1000 # Batch size for processing
)

Available Cleaning Operations

  • Expand contractions: Convert contractions like "don't" to "do not"
  • Lowercase: Convert text to lowercase
  • Remove URLs: Remove web links from text
  • Remove newlines: Replace newline characters with spaces
  • Remove numbers: Remove digits from text
  • Remove punctuation: Remove punctuation marks
  • Remove emojis: Remove emoji characters
  • Tokenization: Split text into tokens
  • Remove stopwords: Remove common words like "the", "a", "is"
  • Stemming/Lemmatization: Reduce words to their root forms
  • Unicode normalization: Normalize accented characters
  • Word length filtering: Filter words by length

Supported Languages

  • English

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpprepkit-1.1.1.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpprepkit-1.1.1-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file nlpprepkit-1.1.1.tar.gz.

File metadata

  • Download URL: nlpprepkit-1.1.1.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nlpprepkit-1.1.1.tar.gz
Algorithm Hash digest
SHA256 0ceff28390c899c4ba7851de17d0daeb0ce520777eb04c9bdc0160d08e9ecd12
MD5 004d56bb03e9b534c088f7679c5c68d5
BLAKE2b-256 441f874bc424dbda1c345b092cf534ec530173451384bb8a7df4981f278f61f3

See more details on using hashes here.

File details

Details for the file nlpprepkit-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: nlpprepkit-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nlpprepkit-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dcd6ae4fc4bc4531fc27f343c3a8bc0a96882917202824635a168535f55781d1
MD5 a472906984d6bf1c2e0cc628ab117114
BLAKE2b-256 ec36bf5c5d7b6ccfbcb25e07e65a90a8eb01d9b4f41091e99f7d9ae6983fed99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page