Skip to main content

Some Pipelines

Project description

slenps

slenps is a collection of some simple NLP pipelines

  • ecluster: cluster word embeddings
  • llmclf: binary and multiclass classification using LLM

Installation

PyPI

pip install slenps

Example usage

embed documents and cluster its embeddings

Cluster some word with chosen embedding and clustering models

from slenps.eclusters import load_embedding_model, get_clustering_model_dict, load_clustering_model, cluster
import numpy as np 

## Obtain documents and embeddings
with open('sample_documents.txt', 'r') as file:
    documents = np.array([line.strip() for line in file.readlines()])

# get embedding model
# embedding_model = load_embedding_model(
#     model_name='all-MiniLM-L6-v2', mode='huggingface'
# )
embedding_model = load_embedding_model(model_name="Word2VecEM")

# embed documents
embeddings = embedding_model.encode(documents)
print(f"Embedding shape: {embeddings.shape}\nDocuments shape: {documents.shape}")


## Clustering

# Select a clustering model and number of clusters
model_name = "kmeans"
num_cluster = 3

# create a clustering model
clustering_model = load_clustering_model(model_name).set_params(n_clusters=num_cluster)
clustering_model

# fit the model and retrieve labels and metrics
labels, metrics = cluster(
    embeddings,
    clustering_model,
    metrics=["dbs", "silhouette", "calinski"],
    return_model=False,
)
print(f"Clustering metrics: {metrics}")

# print sample result
n_samples = 10
for document, label in zip(documents[:n_samples], labels[:n_samples]):
    print(f"{document} --> Label {label}")

Find the best algorithm and num_cluster

from slenps.eclusters import find_best_algorithm
import pandas as pd
# define a list of clustering models to evaluate
# all default models are included in get_clustering_model_dict
model_names = ['kmeans', 'agglomerative_clustering', 'spectral_clustering']

# find best algo and num_cluster using test_metric
results = find_best_algorithm(
    embeddings,
    model_names=model_names,
    metrics=["dbs", "silhouette"],
    test_metric="dbs",
    min_cluster_num=2,
    max_cluster_num=10,
    result_filepath="sample_result_metric.csv",
    print_topk=True,
)

# view all results
pd.DataFrame(results)

Supported models

Embedding models

embedding model model_name mode
sklearn.feature.extraction.text.TfidfVectorizer TfidfEM None
gensim.models.Word2Vec Word2VecEM None
gensim.models.Doc2Vec Doc2VecEM None
sentence_transformers.SentenceTransformer Any huggingface

Clustering models

clustering model model_name default params
sklearn.cluster.KMeans kmeans n_init='auto'
sklearn.cluster.AgglomerativeClustering agglomerative_clustering None
sklearn.cluster.SpectralClustering spectral_clustering None
sklearn.cluster.MeanShift mean_shift None
sklearn.cluster.AffinityPropagation affinity_propagation None
sklearn.cluster.Birch birch threshold=0.2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slenps-0.1.3.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

slenps-0.1.3-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file slenps-0.1.3.tar.gz.

File metadata

  • Download URL: slenps-0.1.3.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for slenps-0.1.3.tar.gz
Algorithm Hash digest
SHA256 8889f4f1882174ad099edb5a041235aa79a63ebe28abb643cd068fef5d232f5a
MD5 1563bd3dfc68640a648605b495884e0e
BLAKE2b-256 63a8dbadef327806ffcee2d97a87fdaff06baf467b49e5643c64769626b00507

See more details on using hashes here.

File details

Details for the file slenps-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: slenps-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for slenps-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 01b762cb69d556f15f6d272d434f479f0e5518dca412e7a225547d922b55d832
MD5 fffac43e27791d98d3d0de72aace8fab
BLAKE2b-256 928f0e3644a46796c45699f14d40eaae2effb291cf4894077abd748287672e88

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page