Some Pipelines
Project description
slenps
slenps is a collection of some simple NLP pipelines
- ecluster: cluster word embeddings
- llmclf: binary and multiclass classification using LLM
Installation
PyPI
pip install slenps
Example usage
embed documents and cluster its embeddings
Cluster some word with chosen embedding and clustering models
from slenps.eclusters import load_embedding_model, get_clustering_model_dict, load_clustering_model, cluster
import numpy as np
# load documents
with open('sample_documents.txt', 'r') as file:
documents = np.array([line.strip() for line in file.readlines()])
# embedding model
embedding_model = load_embedding_model(
model_name = 'all-MiniLM-L6-v2', mode = 'huggingface',
)
# embedding_model = load_embedding_model(model_name='word2vec') # or 'tfidf', 'doc2vec'
# obtain embeddings
embeddings = embedding_model.encode(documents)
# clustering model
clustering_model = load_clustering_model('kmeans')
clustering_model = clustering_model.set_params(n_clusters=3)
# fit the model and retrieve labels and metrics
labels, metrics = cluster(
embeddings, clustering_model,
metrics = ['dbs', 'calinski'],
)
print(metrics)
# print sample result
n_samples = 10
for document, label in zip(documents[:n_samples], labels[:n_samples]):
print(f'{document} -> label {label}')
Find the best algorithm and num_cluster
from slenps.eclusters import find_best_algorithm
import pandas as pd
# define a list of clustering models to evaluate
# all default models are included in get_clustering_model_dict
model_names = ['kmeans', 'agglomerative_clustering', 'spectral_clustering']
# find best algo and num_cluster using test_metric
results = find_best_algorithm(
embeddings, model_names=model_names,
test_metric='dbs', metrics = ['dbs', 'silhouette'],
min_cluster_num=2, max_cluster_num=10,
result_filepath='sample_result_metrics.csv',
print_topk=True,
)
# view all results
print(pd.DataFrame(results))
Supported models
Embedding models
embedding model | model_name | mode |
---|---|---|
sklearn.feature.extraction.text.TfidfVectorizer | tfidf | None |
gensim.models.Word2Vec | word2vec | None |
gensim.models.Doc2Vec | doc2vec | None |
sentence_transformers.SentenceTransformer | Any | huggingface |
Clustering models
clustering model | model_name | default params |
---|---|---|
sklearn.cluster.KMeans | kmeans | n_init='auto' |
sklearn.cluster.AgglomerativeClustering | agglomerative_clustering | None |
sklearn.cluster.SpectralClustering | spectral_clustering | None |
sklearn.cluster.MeanShift | mean_shift | None |
sklearn.cluster.AffinityPropagation | affinity_propagation | None |
sklearn.cluster.Birch | birch | threshold=0.2 |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
slenps-0.1.1.tar.gz
(9.9 kB
view details)
Built Distribution
slenps-0.1.1-py3-none-any.whl
(10.3 kB
view details)
File details
Details for the file slenps-0.1.1.tar.gz
.
File metadata
- Download URL: slenps-0.1.1.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25c43f0d699d3a8105e0a4888e14f30855df8bcdcb2f963699450fc3b1c1d7bd |
|
MD5 | 816c7939254d901b8d2ecf3e1f582120 |
|
BLAKE2b-256 | a552e99038885208989721c6cae695ba1bd819066d987c2404d15667e9708405 |
File details
Details for the file slenps-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: slenps-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c06dfadb991141e44f05801fd3a0fcf147b5a426a9b903fc36233f0f5160a40 |
|
MD5 | 1d297d5995eede16bfad03d71795ae5a |
|
BLAKE2b-256 | d88cb0f498434a6ed366b740b86d8c75e8312f4ede6bf945e1d4036239ffeb53 |