Skip to main content

Some Pipelines

Project description

slenps

slenps is a collection of some simple NLP pipelines

  • ecluster: cluster word embeddings
  • llmclf: binary and multiclass classification using LLM

Installation

PyPI

pip install slenps

Example usage

embed documents and cluster its embeddings

Cluster some word with chosen embedding and clustering models

from slenps.eclusters import load_embedding_model, get_clustering_model_dict, load_clustering_model, cluster
import numpy as np 

# load documents

with open('sample_documents.txt', 'r') as file:
    documents = np.array([line.strip() for line in file.readlines()])

# embedding model
embedding_model = load_embedding_model(
    model_name = 'all-MiniLM-L6-v2', mode = 'huggingface',
) 
# embedding_model = load_embedding_model(model_name='word2vec') # or 'tfidf', 'doc2vec'

# obtain embeddings 
embeddings = embedding_model.encode(documents)

# clustering model
clustering_model = load_clustering_model('kmeans')
clustering_model = clustering_model.set_params(n_clusters=3)

# fit the model and retrieve labels and metrics
labels, metrics = cluster(
    embeddings, clustering_model, 
    metrics = ['dbs', 'calinski'],
)
print(metrics)

# print sample result
n_samples = 10
for document, label in zip(documents[:n_samples], labels[:n_samples]):
    print(f'{document} -> label {label}')

Find the best algorithm and num_cluster

from slenps.eclusters import find_best_algorithm
import pandas as pd
# define a list of clustering models to evaluate
# all default models are included in get_clustering_model_dict
model_names = ['kmeans', 'agglomerative_clustering', 'spectral_clustering']

# find best algo and num_cluster using test_metric
results = find_best_algorithm(
	embeddings, model_names=model_names,
	test_metric='dbs', metrics = ['dbs', 'silhouette'],
	min_cluster_num=2, max_cluster_num=10,
	result_filepath='sample_result_metrics.csv',
	print_topk=True,
)

# view all results
print(pd.DataFrame(results))

Supported models

Embedding models

embedding model model_name mode
sklearn.feature.extraction.text.TfidfVectorizer tfidf None
gensim.models.Word2Vec word2vec None
gensim.models.Doc2Vec doc2vec None
sentence_transformers.SentenceTransformer Any huggingface

Clustering models

clustering model model_name default params
sklearn.cluster.KMeans kmeans n_init='auto'
sklearn.cluster.AgglomerativeClustering agglomerative_clustering None
sklearn.cluster.SpectralClustering spectral_clustering None
sklearn.cluster.MeanShift mean_shift None
sklearn.cluster.AffinityPropagation affinity_propagation None
sklearn.cluster.Birch birch threshold=0.2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slenps-0.1.2.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

slenps-0.1.2-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file slenps-0.1.2.tar.gz.

File metadata

  • Download URL: slenps-0.1.2.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for slenps-0.1.2.tar.gz
Algorithm Hash digest
SHA256 62c2b4d438c1ca76936e97b8164fc63b3701d6216258d7b4b17ead562da91993
MD5 27480903a07d88a43c4fb5e1655b654e
BLAKE2b-256 2a41b09e5a21537a9f384c2533fe3affac4e13eafa3774e290f034451e24be29

See more details on using hashes here.

File details

Details for the file slenps-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: slenps-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for slenps-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 748bdf5de32d80c7f066551dcd798ca8d61781c7cf521b505f2e468a93e3d203
MD5 890d3e450f15564b2c15eb3b76ac0e80
BLAKE2b-256 1839c4b19ffce8aea640ad9914423b0592856a0febb9594ab90d7a25c3de0b91

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page