Skip to main content

An easy-to-use library with advanced preprocessing features to streamline and accelerate machine learning workflows.

Project description

NLPkit

Azure Pipelines Codecov CircleCI Nightly wheels Code style: Ruff Python Versions Pypi Version DOI Benchmarked by ASV

A toolkit that wraps popular scikit-learn models with NLP-aware preprocessing and useful utilities. NLPkit simplifies text classification and clustering by offering:

  • Tokenization
  • Stop-word removal
  • Contraction expansion
  • Punctuation stripping
  • Rare word filtering
  • Vectorization
  • Automated Model management & Evaluation

Table of Contents

  1. Installation
  2. Quick Start
  3. Features
  4. LogisticClassifier API
  5. SupportVectorClassifier API
  6. DTC API
  7. NeigbhorClassifier API
  8. SGD API
  9. GBC API
  10. GaussianProcess API
  11. NMeans API

Installation

Install via pip:

pip install nlpkit

Quick Start

from nlpkit import LogisticClassifier

# Sample data
docs = [
    "The food was absolutely fantastic!",
    "I can't stand the traffic here.",
    "What a beautiful experience that was.",
    "This product broke after one use.",
    "Totally worth the price!",
    "I wouldn't recommend this to anyone.",
    "Had a great time with my friends.",
    "The customer service was disappointing.",
    "Everything about this place was perfect.",
    "It was a complete waste of money."
]

labels = [
    "positive","negative","positive","negative","positive",
    "negative","positive","negative","positive","negative"]
# Initialize and train the classifier
clf = LogisticClassifier(
    embedding='tfidf',       # 'count' or 'tfidf'
    n=1,                     # 1 = unigram
    stop='english',          # language for stop words
    punc=True,               # remove punctuation
    normalization_method='lemmatization',  # or 'stemming'
    rare_words=0.01          # drop words in bottom 1%
)
clf.fit(docs, labels)

# Predict and evaluate
print(clf.predict(["What a wonderful experience?"]))
print(clf.getClassificationReport())

Features

  • Text Preprocessing
    • Tokenization
    • Stop-word removal
    • Contraction expansion (e.g., "don't" → "do not")
    • Punctuation removal
    • Rare word filtering by proportion or count
    • Normalization: stemming or lemmatization
  • Vectorization
    • Bag-of-words (CountVectorizer)
    • TF-IDF (TfidfVectorizer)
  • Model Management
    • fit, predict, predict_proba, score
    • Export trained pipeline to file
    • Inspect coefficients and intercept
  • Evaluation
    • Confusion matrix
    • Classification report
    • Precision, recall, and F1 scores

LogisticClassifier API

Initialization Parameters

Parameter Type Default Description
penalty str 'l2' Regularization penalty (see LogisticRegression).
dual bool False Dual or primal formulation.
tol float 1e-4 Tolerance for stopping criteria.
C float 1.0 Inverse of regularization strength.
All other sklearn.linear_model LogisticRegression args Supported (e.g., solver, max_iter).
NLP-specific
embedding str 'count' 'count' or 'tfidf'.
n int 1 N-gram size (1 = unigram, 2 = bigram, etc.).
stop str 'english' Stop-word language for NLTK.
punc bool True Remove punctuation if True.
extraction bool True Expand contractions if True.
rare_words float or int 0 Proportion or count threshold to drop rare words.
normalization_method str or None None 'stemming', 'lemmatization', or None.

Model Methods

fit(X: List[str], y: List[Any]) -> None

Train on raw text and labels.

predict(X: List[str]) -> np.ndarray

Predict class labels.

predict_proba(X: List[str]) -> np.ndarray

Return class probabilities.

score(X: List[str], y: List[Any]) -> float

Mean accuracy on test data.

export_model(path: str, model_name: str) -> None

Save pipeline as <path>/<model_name>.pkl.

getCoefficients() -> np.ndarray

Return learned feature coefficients.

getIntercept() -> np.ndarray

Return model intercept.

getConfusionMatrix() -> np.ndarray

Compute confusion matrix on the last predictions.

getClassificationReport() -> str

Detailed precision, recall, F1 by class.

getPrecisionScore() -> float

Precision score (binary or macro-averaged).


SupportVectorClassifier API

All parameters in the first section are forwarded directly to sklearn.svm.SVC and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

DTC API

All parameters in the first section are forwarded directly to sklearn.tree DecisionTreeClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

NeigbhorClassifier API

All parameters in the first section are forwarded directly to sklearn.neighbors KNeighborsClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

SGD API

All parameters in the first section are forwarded directly to sklearn.linear_model SGDClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

GBC API

All parameters in the first section are forwarded directly to sklearn.ensemble GradientBoostingClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

GaussianProcess API

All parameters in the first section are forwarded directly to sklearn.gaussian_process GaussianProcessClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

NMeans API

All parameters in the first section are forwarded directly to sklearn.cluster KMeans and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

Model Methods

fit(X: array-like) -> None

Preprocess texts and fit KMeans on the feature matrix.

predict(X: array-like) -> np.ndarray

Preprocess and predict cluster assignments.

export_model(path: str, model_name: str) -> None

Save the fitted KMeans model as a .joblib file at <path>/<model_name>.joblib.

get_centroids() -> np.ndarray

Return cluster centroids.

get_labels() -> np.ndarray

Return labels assigned to each sample.

get_inertia() -> float

Return the final inertia (sum of squared distances to nearest centroid).

get_n_iterations() -> int

Return the number of iterations run.

get_n_features() -> int

Return the number of features seen during fit.

get_groups() -> Dict[str, List[str]]

Return a dict mapping each cluster label to the list of original inputs assigned to that cluster.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpbasekit-0.1.3.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpbasekit-0.1.3-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file nlpbasekit-0.1.3.tar.gz.

File metadata

  • Download URL: nlpbasekit-0.1.3.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for nlpbasekit-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f9a66c662e4b029e9e184b29b0eab526a0cf5fbbe7a0774283b22c72d7ac3ada
MD5 98609c8ed884a7be8c39e7bb3917b642
BLAKE2b-256 7b9b107f2f6c97ca6d10c05c1f5e9f7e43953a8ee945d8f2e6f318d61aa66277

See more details on using hashes here.

File details

Details for the file nlpbasekit-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: nlpbasekit-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for nlpbasekit-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1b6721788db2ea1f0ec3ca9cee3d92f829c1f0ac487d735eea5f9f80e43b252b
MD5 b0ef23a3c821d29d4d709617ea8f06c5
BLAKE2b-256 971032dbe11215ce5b7b5c55d36f42385c396c8b82e0139e1b5508e9b823e41f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page