An easy-to-use library with advanced preprocessing features to streamline and accelerate machine learning workflows.

These details have not been verified by PyPI

Project description

NLPkit

A toolkit that wraps popular scikit-learn models with NLP-aware preprocessing and useful utilities. NLPkit simplifies text classification and clustering by offering:

Tokenization
Stop-word removal
Contraction expansion
Punctuation stripping
Rare word filtering
Vectorization
Automated Model management & Evaluation

Installation

Install via pip:

pip install nlpkit

Quick Start

from nlpkit import LogisticClassifier

# Sample data
docs = [
    "The food was absolutely fantastic!",
    "I can't stand the traffic here.",
    "What a beautiful experience that was.",
    "This product broke after one use.",
    "Totally worth the price!",
    "I wouldn't recommend this to anyone.",
    "Had a great time with my friends.",
    "The customer service was disappointing.",
    "Everything about this place was perfect.",
    "It was a complete waste of money."
]

labels = [
    "positive","negative","positive","negative","positive",
    "negative","positive","negative","positive","negative"]
# Initialize and train the classifier
clf = LogisticClassifier(
    embedding='tfidf',       # 'count' or 'tfidf'
    n=1,                     # 1 = unigram
    stop='english',          # language for stop words
    punc=True,               # remove punctuation
    normalization_method='lemmatization',  # or 'stemming'
    rare_words=0.01          # drop words in bottom 1%
)
clf.fit(docs, labels)

# Predict and evaluate
print(clf.predict(["What a wonderful experience?"]))
print(clf.getClassificationReport())

Features

Text Preprocessing
- Tokenization
- Stop-word removal
- Contraction expansion (e.g., "don't" → "do not")
- Punctuation removal
- Rare word filtering by proportion or count
- Normalization: stemming or lemmatization
Vectorization
- Bag-of-words (CountVectorizer)
- TF-IDF (TfidfVectorizer)
Model Management
- fit, predict, predict_proba, score
- Export trained pipeline to file
- Inspect coefficients and intercept
Evaluation
- Confusion matrix
- Classification report
- Precision, recall, and F1 scores

LogisticClassifier API

Initialization Parameters

Parameter	Type	Default	Description
`penalty`	`str`	`'l2'`	Regularization penalty (see `LogisticRegression`).
`dual`	`bool`	`False`	Dual or primal formulation.
`tol`	`float`	`1e-4`	Tolerance for stopping criteria.
`C`	`float`	`1.0`	Inverse of regularization strength.
All other sklearn.linear_model LogisticRegression args			Supported (e.g., `solver`, `max_iter`).
NLP-specific
`embedding`	`str`	`'count'`	`'count'` or `'tfidf'`.
`n`	`int`	`1`	N-gram size (1 = unigram, 2 = bigram, etc.).
`stop`	`str`	`'english'`	Stop-word language for NLTK.
`punc`	`bool`	`True`	Remove punctuation if `True`.
`extraction`	`bool`	`True`	Expand contractions if `True`.
`rare_words`	`float` or `int`	`0`	Proportion or count threshold to drop rare words.
`normalization_method`	`str` or `None`	`None`	`'stemming'`, `'lemmatization'`, or `None`.

Model Methods

fit(X: List[str], y: List[Any]) -> None

Train on raw text and labels.

predict(X: List[str]) -> np.ndarray

Predict class labels.

predict_proba(X: List[str]) -> np.ndarray

Return class probabilities.

score(X: List[str], y: List[Any]) -> float

Mean accuracy on test data.

export_model(path: str, model_name: str) -> None

Save pipeline as <path>/<model_name>.pkl.

getCoefficients() -> np.ndarray

Return learned feature coefficients.

getIntercept() -> np.ndarray

Return model intercept.

getConfusionMatrix() -> np.ndarray

Compute confusion matrix on the last predictions.

getClassificationReport() -> str

Detailed precision, recall, F1 by class.

getPrecisionScore() -> float

Precision score (binary or macro-averaged).

SupportVectorClassifier API

All parameters in the first section are forwarded directly to sklearn.svm.SVC and behave exactly as in scikit-learn’s documentation.

NLP-specific parameters (same as in the LogisticClassifier API):
- embedding
- n
- stop
- punc
- extraction
- rare_words
- normalization_method