Skip to main content

Some Pipelines

Project description

slenps

slenps is a collection of some simple NLP pipelines

  • ecluster: cluster word embeddings
  • llmclf: binary and multiclass classification using LLM

Installation

pip install slenps

Example usage

embed documents and cluster its embeddings

Cluster some word with chosen embedding and clustering models

from slenps.eclusters import load_embedding_model, get_clustering_model_dict, load_clustering_model, cluster
import numpy as np 

# load documents

with open('sample_documents.txt', 'r') as file:
    documents = np.array([line.strip() for line in file.readlines()])

# embedding model
embedding_model = load_embedding_model(
    model_name = 'all-MiniLM-L6-v2', mode = 'huggingface',
)
# embedding_model = load_embedding_model(model_name='word2vec') 

# obtain embeddings 
embeddings = embedding_model.encode(documents)

# clustering model
clustering_model = load_clustering_model('kmeans')
clustering_model = clustering_model.set_params(n_clusters=3)

# fit the model and retrieve labels and metrics
labels, metrics = cluster(
    embeddings, clustering_model, 
    metrics = ['dbs', 'calinski'],
)
print(metrics)

# print sample result
n_samples = 10
for document, label in zip(documents[:n_samples], labels[:n_samples]):
    print(f'{document} -> label {label}')

Find the best algorithm and num_cluster

from slenps.eclusters import find_best_algorithm
import pandas as pd
# define a list of clustering models to evaluate
# all default models are included in get_clustering_model_dict
model_names = ['kmeans', 'agglomerative_clustering', 'spectral_clustering']

# find best algo and num_cluster using test_metric
results = find_best_algorithm(
	embeddings, model_names=model_names,
	test_metric='dbs', metrics = ['dbs', 'silhouette'],
	min_cluster_num=2, max_cluster_num=10,
	result_filepath='sample_result_metrics.csv',
	print_topk=True,
)

# view all results
print(pd.DataFrame(results))

Installation

PyPI

$ pip install slenps

Supported models

embedding model model_name mode
sklearn.feature.extraction.text.TfidfVectorizer tfidf None
gensim.models.Word2Vec word2vec None
gensim.models.Doc2Vec doc2vec None
sentence_transformers.SentenceTransformer Any huggingface
clustering model model_name default params
sklearn.cluster.KMeans kmeans n_init='auto'
sklearn.cluster.AgglomerativeClustering agglomerative_clustering None
sklearn.cluster.SpectralClustering spectral_clustering None
sklearn.cluster.MeanShift mean_shift None
sklearn.cluster.AffinityPropagation affinity_propagation None
sklearn.cluster.Birch birch threshold=0.2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slenps-0.0.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

slenps-0.0.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file slenps-0.0.0.tar.gz.

File metadata

  • Download URL: slenps-0.0.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for slenps-0.0.0.tar.gz
Algorithm Hash digest
SHA256 f83b6b3d324ebc8516915e0f393bbf2d35167b6b146fe930053b898796dfedb3
MD5 852c0792482f1d51cb2727286d938aa9
BLAKE2b-256 36a0971749c9fe07e7ca80c7a2f74aff2c1530faaf7868de0061cde71c03d782

See more details on using hashes here.

File details

Details for the file slenps-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: slenps-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for slenps-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a521341e2ab0e4539127281692fb84715d96f9be5d397db6cf3fd7f58fca03c
MD5 b90bdb2cbf75afaa386f1eb414783c48
BLAKE2b-256 0314b1fddb67ce8e5383a7ae4f700d16c6b6c8de9d26abd544582c839c1c6512

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page