Skip to main content

Some Pipelines

Project description

slenps

slenps is a collection of some simple NLP pipelines

  • ecluster: cluster word embeddings
  • llmclf: binary and multiclass classification using LLM

Installation

pip install slenps

Example usage

embed documents and cluster its embeddings

Cluster some word with chosen embedding and clustering models

from slenps.eclusters import load_embedding_model, get_clustering_model_dict, load_clustering_model, cluster
import numpy as np 

# load documents

with open('sample_documents.txt', 'r') as file:
    documents = np.array([line.strip() for line in file.readlines()])

# embedding model
embedding_model = load_embedding_model(
    model_name = 'all-MiniLM-L6-v2', mode = 'huggingface',
)
# embedding_model = load_embedding_model(model_name='word2vec') 

# obtain embeddings 
embeddings = embedding_model.encode(documents)

# clustering model
clustering_model = load_clustering_model('kmeans')
clustering_model = clustering_model.set_params(n_clusters=3)

# fit the model and retrieve labels and metrics
labels, metrics = cluster(
    embeddings, clustering_model, 
    metrics = ['dbs', 'calinski'],
)
print(metrics)

# print sample result
n_samples = 10
for document, label in zip(documents[:n_samples], labels[:n_samples]):
    print(f'{document} -> label {label}')

Find the best algorithm and num_cluster

from slenps.eclusters import find_best_algorithm
import pandas as pd
# define a list of clustering models to evaluate
# all default models are included in get_clustering_model_dict
model_names = ['kmeans', 'agglomerative_clustering', 'spectral_clustering']

# find best algo and num_cluster using test_metric
results = find_best_algorithm(
	embeddings, model_names=model_names,
	test_metric='dbs', metrics = ['dbs', 'silhouette'],
	min_cluster_num=2, max_cluster_num=10,
	result_filepath='sample_result_metrics.csv',
	print_topk=True,
)

# view all results
print(pd.DataFrame(results))

Installation

PyPI

$ pip install slenps

Supported models

embedding model model_name mode
sklearn.feature.extraction.text.TfidfVectorizer tfidf None
gensim.models.Word2Vec word2vec None
gensim.models.Doc2Vec doc2vec None
sentence_transformers.SentenceTransformer Any huggingface
clustering model model_name default params
sklearn.cluster.KMeans kmeans n_init='auto'
sklearn.cluster.AgglomerativeClustering agglomerative_clustering None
sklearn.cluster.SpectralClustering spectral_clustering None
sklearn.cluster.MeanShift mean_shift None
sklearn.cluster.AffinityPropagation affinity_propagation None
sklearn.cluster.Birch birch threshold=0.2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slenps-0.1.0.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

slenps-0.1.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file slenps-0.1.0.tar.gz.

File metadata

  • Download URL: slenps-0.1.0.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for slenps-0.1.0.tar.gz
Algorithm Hash digest
SHA256 82703f69f1e2faf25c86a0e305cf6a827bf9cc6a5138d8df964755b94b6aa713
MD5 2049d0baccd2d149de91fdff19c3cbdc
BLAKE2b-256 244e7c55aad632be198ad9f4c122c31d92970542b63eaf0d12b728adfedf329a

See more details on using hashes here.

File details

Details for the file slenps-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: slenps-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for slenps-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 98d6650017600e08fbda34756e7bc362aae6ce273bdf5d37dfafbc1dcfdb0587
MD5 ce0c03a3918f73e7adae755b72a0c230
BLAKE2b-256 37575dd7391bbf17da3e6592efab5b80a1e37ea244a8680b6b37943b3135590e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page