Skip to main content

An easy-to-use library with advanced preprocessing features to streamline and accelerate machine learning workflows.

Project description

NLPkit

Azure Pipelines Codecov CircleCI Nightly wheels Code style: Ruff Python Versions Pypi Version DOI Benchmarked by ASV

A toolkit that wraps popular scikit-learn models with NLP-aware preprocessing and useful utilities. NLPkit simplifies text classification and clustering by offering:

  • Tokenization
  • Stop-word removal
  • Contraction expansion
  • Punctuation stripping
  • Rare word filtering
  • Vectorization
  • Automated Model management & Evaluation

Installation

Install via pip:

pip install nlpkit

Quick Start

from nlpkit import LogisticClassifier

# Sample data
docs = [
    "The food was absolutely fantastic!",
    "I can't stand the traffic here.",
    "What a beautiful experience that was.",
    "This product broke after one use.",
    "Totally worth the price!",
    "I wouldn't recommend this to anyone.",
    "Had a great time with my friends.",
    "The customer service was disappointing.",
    "Everything about this place was perfect.",
    "It was a complete waste of money."
]

labels = [
    "positive","negative","positive","negative","positive",
    "negative","positive","negative","positive","negative"]
# Initialize and train the classifier
clf = LogisticClassifier(
    embedding='tfidf',       # 'count' or 'tfidf'
    n=1,                     # 1 = unigram
    stop='english',          # language for stop words
    punc=True,               # remove punctuation
    normalization_method='lemmatization',  # or 'stemming'
    rare_words=0.01          # drop words in bottom 1%
)
clf.fit(docs, labels)

# Predict and evaluate
print(clf.predict(["What a wonderful experience?"]))
print(clf.getClassificationReport())

Features

  • Text Preprocessing
    • Tokenization
    • Stop-word removal
    • Contraction expansion (e.g., "don't" → "do not")
    • Punctuation removal
    • Rare word filtering by proportion or count
    • Normalization: stemming or lemmatization
  • Vectorization
    • Bag-of-words (CountVectorizer)
    • TF-IDF (TfidfVectorizer)
  • Model Management
    • fit, predict, predict_proba, score
    • Export trained pipeline to file
    • Inspect coefficients and intercept
  • Evaluation
    • Confusion matrix
    • Classification report
    • Precision, recall, and F1 scores

LogisticClassifier API

Initialization Parameters

Parameter Type Default Description
penalty str 'l2' Regularization penalty (see LogisticRegression).
dual bool False Dual or primal formulation.
tol float 1e-4 Tolerance for stopping criteria.
C float 1.0 Inverse of regularization strength.
All other sklearn.linear_model LogisticRegression args Supported (e.g., solver, max_iter).
NLP-specific
embedding str 'count' 'count' or 'tfidf'.
n int 1 N-gram size (1 = unigram, 2 = bigram, etc.).
stop str 'english' Stop-word language for NLTK.
punc bool True Remove punctuation if True.
extraction bool True Expand contractions if True.
rare_words float or int 0 Proportion or count threshold to drop rare words.
normalization_method str or None None 'stemming', 'lemmatization', or None.

Model Methods

fit(X: List[str], y: List[Any]) -> None

Train on raw text and labels.

predict(X: List[str]) -> np.ndarray

Predict class labels.

predict_proba(X: List[str]) -> np.ndarray

Return class probabilities.

score(X: List[str], y: List[Any]) -> float

Mean accuracy on test data.

export_model(path: str, model_name: str) -> None

Save pipeline as <path>/<model_name>.pkl.

getCoefficients() -> np.ndarray

Return learned feature coefficients.

getIntercept() -> np.ndarray

Return model intercept.

getConfusionMatrix() -> np.ndarray

Compute confusion matrix on the last predictions.

getClassificationReport() -> str

Detailed precision, recall, F1 by class.

getPrecisionScore() -> float

Precision score (binary or macro-averaged).


SupportVectorClassifier API

All parameters in the first section are forwarded directly to sklearn.svm.SVC and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

DTC API

All parameters in the first section are forwarded directly to sklearn.tree DecisionTreeClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

NeigbhorClassifier API

All parameters in the first section are forwarded directly to sklearn.neighbors KNeighborsClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

SGD API

All parameters in the first section are forwarded directly to sklearn.linear_model SGDClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

GBC API

All parameters in the first section are forwarded directly to sklearn.ensemble GradientBoostingClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

GaussianProcess API

All parameters in the first section are forwarded directly to sklearn.gaussian_process GaussianProcessClassifier and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

After fitting, you can call the following methods:

predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None

NMeans API

All parameters in the first section are forwarded directly to sklearn.cluster KMeans and behave exactly as in scikit-learn’s documentation.

  • NLP-specific parameters (same as in the LogisticClassifier API):
    • embedding
    • n
    • stop
    • punc
    • extraction
    • rare_words
    • normalization_method

Model Methods

fit(X: array-like) -> None

Preprocess texts and fit KMeans on the feature matrix.

predict(X: array-like) -> np.ndarray

Preprocess and predict cluster assignments.

export_model(path: str, model_name: str) -> None

Save the fitted KMeans model as a .joblib file at <path>/<model_name>.joblib.

get_centroids() -> np.ndarray

Return cluster centroids.

get_labels() -> np.ndarray

Return labels assigned to each sample.

get_inertia() -> float

Return the final inertia (sum of squared distances to nearest centroid).

get_n_iterations() -> int

Return the number of iterations run.

get_n_features() -> int

Return the number of features seen during fit.

get_groups() -> Dict[str, List[str]]

Return a dict mapping each cluster label to the list of original inputs assigned to that cluster.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpbasekit-0.1.4.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpbasekit-0.1.4-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file nlpbasekit-0.1.4.tar.gz.

File metadata

  • Download URL: nlpbasekit-0.1.4.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for nlpbasekit-0.1.4.tar.gz
Algorithm Hash digest
SHA256 5df16b71331384ae6e38ff34bcd7a800f8baf91093e5ff6039c9fb7b2253a9e2
MD5 60a21bf9c61275f64545154f0107f2c5
BLAKE2b-256 4dffc9cb997161702ff587172edfcb51e4b652c65bff68421453c3c8d6c026ff

See more details on using hashes here.

File details

Details for the file nlpbasekit-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: nlpbasekit-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for nlpbasekit-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 cc3c90eaa6f4039bf773bc45c3945fe224377e9fb6bb1e774caf73446c9168e5
MD5 ae9d7dbac6f2ef6cf5ea7a5bdd71c3bc
BLAKE2b-256 54f2a27f978b087c73b6fc45a8cf18117dd4fc7cac2b16c721acef0290f99e17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page