Skip to main content

Some Pipelines

Project description

slenps

slenps is a collection of some simple NLP pipelines

  • ecluster: cluster word embeddings
  • llmclf: binary and multiclass classification using LLM

Installation

PyPI

pip install slenps

Example usage

embed documents and cluster its embeddings

Cluster some word with chosen embedding and clustering models

from slenps.eclusters import load_embedding_model, get_clustering_model_dict, load_clustering_model, cluster
import numpy as np 

# load documents

with open('sample_documents.txt', 'r') as file:
    documents = np.array([line.strip() for line in file.readlines()])

# embedding model
embedding_model = load_embedding_model(
    model_name = 'all-MiniLM-L6-v2', mode = 'huggingface',
) 
# embedding_model = load_embedding_model(model_name='word2vec') # or 'tfidf', 'doc2vec'

# obtain embeddings 
embeddings = embedding_model.encode(documents)

# clustering model
clustering_model = load_clustering_model('kmeans')
clustering_model = clustering_model.set_params(n_clusters=3)

# fit the model and retrieve labels and metrics
labels, metrics = cluster(
    embeddings, clustering_model, 
    metrics = ['dbs', 'calinski'],
)
print(metrics)

# print sample result
n_samples = 10
for document, label in zip(documents[:n_samples], labels[:n_samples]):
    print(f'{document} -> label {label}')

Find the best algorithm and num_cluster

from slenps.eclusters import find_best_algorithm
import pandas as pd
# define a list of clustering models to evaluate
# all default models are included in get_clustering_model_dict
model_names = ['kmeans', 'agglomerative_clustering', 'spectral_clustering']

# find best algo and num_cluster using test_metric
results = find_best_algorithm(
	embeddings, model_names=model_names,
	test_metric='dbs', metrics = ['dbs', 'silhouette'],
	min_cluster_num=2, max_cluster_num=10,
	result_filepath='sample_result_metrics.csv',
	print_topk=True,
)

# view all results
print(pd.DataFrame(results))

Supported models

Embedding models

embedding model model_name mode
sklearn.feature.extraction.text.TfidfVectorizer tfidf None
gensim.models.Word2Vec word2vec None
gensim.models.Doc2Vec doc2vec None
sentence_transformers.SentenceTransformer Any huggingface

Clustering models

clustering model model_name default params
sklearn.cluster.KMeans kmeans n_init='auto'
sklearn.cluster.AgglomerativeClustering agglomerative_clustering None
sklearn.cluster.SpectralClustering spectral_clustering None
sklearn.cluster.MeanShift mean_shift None
sklearn.cluster.AffinityPropagation affinity_propagation None
sklearn.cluster.Birch birch threshold=0.2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slenps-0.1.1.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

slenps-0.1.1-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file slenps-0.1.1.tar.gz.

File metadata

  • Download URL: slenps-0.1.1.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for slenps-0.1.1.tar.gz
Algorithm Hash digest
SHA256 25c43f0d699d3a8105e0a4888e14f30855df8bcdcb2f963699450fc3b1c1d7bd
MD5 816c7939254d901b8d2ecf3e1f582120
BLAKE2b-256 a552e99038885208989721c6cae695ba1bd819066d987c2404d15667e9708405

See more details on using hashes here.

File details

Details for the file slenps-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: slenps-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for slenps-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7c06dfadb991141e44f05801fd3a0fcf147b5a426a9b903fc36233f0f5160a40
MD5 1d297d5995eede16bfad03d71795ae5a
BLAKE2b-256 d88cb0f498434a6ed366b740b86d8c75e8312f4ede6bf945e1d4036239ffeb53

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page