An easy-to-use library with advanced preprocessing features to streamline and accelerate machine learning workflows.
Project description
NLPkit
A toolkit that wraps popular scikit-learn models with NLP-aware preprocessing and useful utilities. NLPkit simplifies text classification and clustering by offering:
- Tokenization
- Stop-word removal
- Contraction expansion
- Punctuation stripping
- Rare word filtering
- Vectorization
- Automated Model management & Evaluation
Table of Contents
- Installation
- Quick Start
- Features
- LogisticClassifier API
- SupportVectorClassifier API
- DTC API
- NeigbhorClassifier API
- SGD API
- GBC API
- GaussianProcess API
- NMeans API
Installation
Install via pip:
pip install nlpkit
Quick Start
from nlpkit import LogisticClassifier
# Sample data
docs = [
"The food was absolutely fantastic!",
"I can't stand the traffic here.",
"What a beautiful experience that was.",
"This product broke after one use.",
"Totally worth the price!",
"I wouldn't recommend this to anyone.",
"Had a great time with my friends.",
"The customer service was disappointing.",
"Everything about this place was perfect.",
"It was a complete waste of money."
]
labels = [
"positive","negative","positive","negative","positive",
"negative","positive","negative","positive","negative"]
# Initialize and train the classifier
clf = LogisticClassifier(
embedding='tfidf', # 'count' or 'tfidf'
n=1, # 1 = unigram
stop='english', # language for stop words
punc=True, # remove punctuation
normalization_method='lemmatization', # or 'stemming'
rare_words=0.01 # drop words in bottom 1%
)
clf.fit(docs, labels)
# Predict and evaluate
print(clf.predict(["What a wonderful experience?"]))
print(clf.getClassificationReport())
Features
- Text Preprocessing
- Tokenization
- Stop-word removal
- Contraction expansion (e.g., "don't" → "do not")
- Punctuation removal
- Rare word filtering by proportion or count
- Normalization: stemming or lemmatization
- Vectorization
- Bag-of-words (
CountVectorizer) - TF-IDF (
TfidfVectorizer)
- Bag-of-words (
- Model Management
fit,predict,predict_proba,score- Export trained pipeline to file
- Inspect coefficients and intercept
- Evaluation
- Confusion matrix
- Classification report
- Precision, recall, and F1 scores
LogisticClassifier API
Initialization Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
penalty |
str |
'l2' |
Regularization penalty (see LogisticRegression). |
dual |
bool |
False |
Dual or primal formulation. |
tol |
float |
1e-4 |
Tolerance for stopping criteria. |
C |
float |
1.0 |
Inverse of regularization strength. |
| All other sklearn.linear_model LogisticRegression args | Supported (e.g., solver, max_iter). |
||
| NLP-specific | |||
embedding |
str |
'count' |
'count' or 'tfidf'. |
n |
int |
1 |
N-gram size (1 = unigram, 2 = bigram, etc.). |
stop |
str |
'english' |
Stop-word language for NLTK. |
punc |
bool |
True |
Remove punctuation if True. |
extraction |
bool |
True |
Expand contractions if True. |
rare_words |
float or int |
0 |
Proportion or count threshold to drop rare words. |
normalization_method |
str or None |
None |
'stemming', 'lemmatization', or None. |
Model Methods
fit(X: List[str], y: List[Any]) -> None
Train on raw text and labels.
predict(X: List[str]) -> np.ndarray
Predict class labels.
predict_proba(X: List[str]) -> np.ndarray
Return class probabilities.
score(X: List[str], y: List[Any]) -> float
Mean accuracy on test data.
export_model(path: str, model_name: str) -> None
Save pipeline as <path>/<model_name>.pkl.
getCoefficients() -> np.ndarray
Return learned feature coefficients.
getIntercept() -> np.ndarray
Return model intercept.
getConfusionMatrix() -> np.ndarray
Compute confusion matrix on the last predictions.
getClassificationReport() -> str
Detailed precision, recall, F1 by class.
getPrecisionScore() -> float
Precision score (binary or macro-averaged).
SupportVectorClassifier API
All parameters in the first section are forwarded directly to sklearn.svm.SVC and behave exactly as in scikit-learn’s documentation.
- NLP-specific parameters (same as in the LogisticClassifier API):
embeddingnstoppuncextractionrare_wordsnormalization_method
After fitting, you can call the following methods:
predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None
DTC API
All parameters in the first section are forwarded directly to sklearn.tree DecisionTreeClassifier and behave exactly as in scikit-learn’s documentation.
- NLP-specific parameters (same as in the LogisticClassifier API):
embeddingnstoppuncextractionrare_wordsnormalization_method
After fitting, you can call the following methods:
predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None
NeigbhorClassifier API
All parameters in the first section are forwarded directly to sklearn.neighbors KNeighborsClassifier and behave exactly as in scikit-learn’s documentation.
- NLP-specific parameters (same as in the LogisticClassifier API):
embeddingnstoppuncextractionrare_wordsnormalization_method
After fitting, you can call the following methods:
predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None
SGD API
All parameters in the first section are forwarded directly to sklearn.linear_model SGDClassifier and behave exactly as in scikit-learn’s documentation.
- NLP-specific parameters (same as in the LogisticClassifier API):
embeddingnstoppuncextractionrare_wordsnormalization_method
After fitting, you can call the following methods:
predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None
GBC API
All parameters in the first section are forwarded directly to sklearn.ensemble GradientBoostingClassifier and behave exactly as in scikit-learn’s documentation.
- NLP-specific parameters (same as in the LogisticClassifier API):
embeddingnstoppuncextractionrare_wordsnormalization_method
After fitting, you can call the following methods:
predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None
GaussianProcess API
All parameters in the first section are forwarded directly to sklearn.gaussian_process GaussianProcessClassifier and behave exactly as in scikit-learn’s documentation.
- NLP-specific parameters (same as in the LogisticClassifier API):
embeddingnstoppuncextractionrare_wordsnormalization_method
After fitting, you can call the following methods:
predict(X: List[str]) -> np.ndarray
predict_proba(X: List[str]) -> np.ndarray
score(X: List[str], y: List[Any]) -> float
getCoefficients() -> np.ndarray
getIntercept() -> np.ndarray
getConfusionMatrix() -> np.ndarray
getClassificationReport() -> str
getPrecisionScore() -> float
export_model(path: str, model_name: str) -> None
NMeans API
All parameters in the first section are forwarded directly to sklearn.cluster KMeans and behave exactly as in scikit-learn’s documentation.
- NLP-specific parameters (same as in the LogisticClassifier API):
embeddingnstoppuncextractionrare_wordsnormalization_method
Model Methods
fit(X: array-like) -> None
Preprocess texts and fit KMeans on the feature matrix.
predict(X: array-like) -> np.ndarray
Preprocess and predict cluster assignments.
export_model(path: str, model_name: str) -> None
Save the fitted KMeans model as a .joblib file at <path>/<model_name>.joblib.
get_centroids() -> np.ndarray
Return cluster centroids.
get_labels() -> np.ndarray
Return labels assigned to each sample.
get_inertia() -> float
Return the final inertia (sum of squared distances to nearest centroid).
get_n_iterations() -> int
Return the number of iterations run.
get_n_features() -> int
Return the number of features seen during fit.
get_groups() -> Dict[str, List[str]]
Return a dict mapping each cluster label to the list of original inputs assigned to that cluster.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlpbasekit-0.1.3.tar.gz.
File metadata
- Download URL: nlpbasekit-0.1.3.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9a66c662e4b029e9e184b29b0eab526a0cf5fbbe7a0774283b22c72d7ac3ada
|
|
| MD5 |
98609c8ed884a7be8c39e7bb3917b642
|
|
| BLAKE2b-256 |
7b9b107f2f6c97ca6d10c05c1f5e9f7e43953a8ee945d8f2e6f318d61aa66277
|
File details
Details for the file nlpbasekit-0.1.3-py3-none-any.whl.
File metadata
- Download URL: nlpbasekit-0.1.3-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b6721788db2ea1f0ec3ca9cee3d92f829c1f0ac487d735eea5f9f80e43b252b
|
|
| MD5 |
b0ef23a3c821d29d4d709617ea8f06c5
|
|
| BLAKE2b-256 |
971032dbe11215ce5b7b5c55d36f42385c396c8b82e0139e1b5508e9b823e41f
|