Skip to main content

A robust NLP pipeline for stemming, lemmatization, and vectorization

Project description

pun_nlp

PyPI Downloads

Overview

pun_nlp is a robust NLP abstraction layer designed to simplify text processing and vectorization. It handles dependency management, resource downloading, and text preprocessing automatically, so you don't have to write boilerplate code.

It solves common issues with NLTK downloads and path errors by implementing a robust, lazy-loading resource manager that works in restricted environments like Kaggle and corporate servers.

Features

  • Robust Resource Management: Automatically handles NLTK/Spacy downloads and SSL errors.
  • Lazy Loading: Resources are only loaded into memory when needed.
  • Type Safety: Prevents invalid combinations of operations (like vectorizing POS tuples).
  • Unified API: Process single strings, lists, or 2D arrays of text with one method.
  • Seamless Vectorization: Integrates directly with Scikit-Learn's TF-IDF and Count vectorizers.

Installation

pip install pun_nlp

Usage

Basic Pipeline

from pun_nlp import NLPProcessor

# Initialize with desired flags
p = NLPProcessor(
    tokenize=True, 
    stem=True, 
    remove_stopwords=True,
    normalize=True
)

text = "The QUICK brown foxes are running fast!"

# Automatically handles downloads and processing
print(p.process(text))
# Output: ['quick', 'brown', 'fox', 'run', 'fast']

NER & POS Tagging

# NER (Case sensitive checking happens before normalization)
p_ner = NLPProcessor(ner=True)
print(p_ner.process("Apple Inc. is hiring in California."))
# Output: [('Apple Inc.', 'ORG'), ('California', 'GPE')]

# POS Tagging (Tags tokens correctly before stemming)
p_pos = NLPProcessor(pos_tagging=True, stem=True)
print(p_pos.process("The boys are likely running."))
# Output: [('the', 'DT'), ('boy', 'NNS'), ('are', 'VBP'), ('like', 'RB'), ('run', 'VBG')]

Vectorization

p_vec = NLPProcessor(vectorize="tfidf", stop_words=True)
corpus = [
    "Machine learning is fascinating.",
    "Natural language processing is a subset of AI."
]

p_vec.fit_vectorizer(corpus)
vectors = p_vec.transform_texts(corpus)
print(vectors.shape)

Configuration

Parameter Description
stem Enable stemming (PorterStemmer).
lemmatize Enable lemmatization (WordNet).
vectorize "tfidf", "count", or None.
tokenize Force return of token list.
remove_stopwords Remove English stopwords (Case-insensitive).
pos_tagging Return (Word, Tag) tuples.
ner Return Entity tuples (Uses Spacy).
normalize Lowercase & remove punctuation.
backend "nltk" (default) or "spacy".

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pun_nlp-0.0.9.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pun_nlp-0.0.9-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file pun_nlp-0.0.9.tar.gz.

File metadata

  • Download URL: pun_nlp-0.0.9.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pun_nlp-0.0.9.tar.gz
Algorithm Hash digest
SHA256 2fe5ecf091e7021cef828debf8aaa7597f531208512a78d39bea04d2ecee0d69
MD5 a6ea01c1e5f8ea0773d2185bc7f5a5c2
BLAKE2b-256 46f148e4a9b2b9e8b66de08dd4436bdd21182dac5d8ca3c3554816a09e3d8af4

See more details on using hashes here.

File details

Details for the file pun_nlp-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: pun_nlp-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pun_nlp-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 1f601e80887f026e28dd7f606ffd1ab80088e31dbba1bb358cb6a2164d60c21a
MD5 f5349312f2ce4bfa63355c780b339eb4
BLAKE2b-256 1e090c58848666e1e1abf08d340a60461a140e7eec8924c6086dd59b08a57e20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page