Skip to main content

A small package to do Sentence Clustering with BERT (SCBert)

Project description

Sentence Clustering with BERT (SCB)

Sentence Clustering with BERT project which aim to use state-of-the-art BERT models to compute vectors for sentences. A few tools are also implemented to explore those vectors and how sentences are related to each others in the latent space.

Demonstration

  • Create vectors from raw data :
#How to transform raw french texts into vectors using BERT model. 
from SCBert.SCBert import Vectorizer

vectorizer = Vectorizer("flaubert")
text_vectors = vectorizer.vectorize(data)
  • Explore the embedded space :
#How to explore the relation in your data. 
from SCBert.SCBert import EmbeddingExplorer

ee = EmbeddingExplorer(data,text_vectors)
labels = ee.cluster(k=3)                     #Cluster with k-means 
ee.extract_keywords()                        #Extract keywords using Rake algorithm, then accessible with ee.keywords
ee.explore(color = labels)                   #Generate a plot with PCA of the embedded vectors with colors corresponding to the labels 

Built-in example

There is a built-in example that you can find in the example folder. It comes with it's own data which is the CLS-fr composed of Amazon reviews from different sources (DVD, CD, Livres)

Installation

You can either download the zip file or use the Pypi package that you can install with the following command :

> pip install SCBert

If you encounter problems during the installation it may be because of the multi-rake dependy with cld2-cffi. I will try to address this later on. To bypass, just follow the instructions :

> export CFLAGS="-Wno-narrowing"
> pip install cld2-cffi
> pip install multi-rake

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SCBert-0.2.tar.gz (1.6 MB view hashes)

Uploaded Source

Built Distribution

SCBert-0.2-py3-none-any.whl (1.6 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page