Skip to main content

Interactive tree-maps for text corpora with SBERT & Hierarchical Clustering (HAC)

Project description

PictureText

PictureText converts a list of short documents to an interactive tree map with minimal code. It defaults to SBERT for text representation, leverages Hierarchical Agglomerative Clustering (HAC) for grouping and tree maps to visualize text interactively.

Given a corpus of short documents (think news headlines) it can group them into hierarchical groups, that semantically belong together. It also allows the reader to explore each group in more detail by going deeper into a hierarchy and dynamically pulling out of it when needed.

The approach is intended for grouping large sets of non-domain specific short texts. For instance: news headlines, natural language questions and social media posts would be good candidates.

Getting started

Installation

conda create --name pt python=3.6
conda install -n pt nb_conda_kernels
conda activate pt
pip install picture_text

A simple example

Consider the default values and their result

txt=['The cat sits outside',
     'A man is playing guitar',
     'I love pasta',
     'The new movie is awesome',
     'The cat plays in the garden',
     'A woman watches TV',
     'The new movie is so great',
     'Do you like pizza?',
     'Burgers are my favorite',
     'I like chips',
     'I will have french fries with my burger'
       ]
from picture_text.picture_text import PictureText

# initializing just sets the text corpus
pt = PictureText(txt) 

# Calling the method does the heavy lifting: 
# 1. HAC 
# 2. text embedding 
pt() 

# This step puts it all together:
# 1. converts HAC into a treemap format
# 2. determines a summary for each cluster and 
# 3. draws & return a treemap
pt.make_picture() 

Demo

Checkout the Colab notebook for interactive examples

Open In Colab

Outline of approach

  • Perform any required preprocessing to get to a list of document strings
  • Embed / Encode all documents with the method of choice, by default I use SBERT
  • Use HAC to get a “linkage” table of hierarchical assignments of each point to the rest of the data. Here I use fastcluster, ward linkage by default.
  • Convert to layers for treemap. Iteratively create “layers” by selecting a set number of splits to each layer
  • Summarize. Generate a summary for each layer. In the default setting, I use the point closest to the average of the cluster. Using the average of the cluster to represent its centroid is used in a number of few-shot, unsupervised settings
  • Draw treemap. Use plotly's treemap for interactive visualization

Customization

Consider the default values and their result

from picture_text.picture_text import PictureText
pt = PictureText(txt)
pt(hac_method='ward', hac_metric='euclidean')
pt.make_picture(layer_depth = 6,
                layer_min_size = 0.1,
                layer_max_extension = 1,
                treemap_average_score = None, 
                treemap_maxdepth=3,)

Selecting Layer Settings

Changing layer_depth parameter sets the number of layers produced by the split.

pt.make_picture(layer_depth = 1)

Changing layer_min_size parameters determines what is the minimal acceptable size of a new cluster for each layer. By default layer_min_size is 0.1 (or 10%) meaning if a layer has a cluster smaller than 10% we will try to find another cluster to add to the layer hoping that the next one will be bigger. We will do so up to increasing the relative number of additional clusters up to 1 (or 100%, layer_max_extension = 1). Increasing both of these significantly basically means that we get a lot more clusters a lot earlier.

pt.make_picture(layer_depth = 1,
                layer_min_size = 0.9,
                layer_max_extension = 3,
                )

Selecting Clustering Settings

The defaults are the following

pt = PictureText(txt)
pt(hac_method='ward', hac_metric='euclidean')

However, those get fed directly into fastcluster, hence all choices from the fastcluster documentation are available here too.

BYO-NLP

The key features to this sort of approach are the embeddings as well as the method of multi-doc summarization. You can use your NLP tools of choice there.

Text embeddings

The default set of text embeddings is via SBERT's distilbert-base-nli-stsb-mean-tokens.

from picture_text.picture_text import sbert_encoder
pt = PictureText(txt)
pt(encoder=sbert_encoder)

However, any mapping of a list of text to encoding can be used instead.

def silly_encoder(text_list):
    return [[1]]*len(text_list)

pt(encoder=silly_encoder)
pt.make_picture()

Summarizer

The default sumary method is to take the cluster member closest to the cluster averag. However, any mapping of a list of texts and embeddings into a text summary can be used instead.

def silly_summarizer(txt,txt_embeddings):
   return "All the same to me", 0
pt.make_picture(summarizer = silly_summarizer,)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

picture_text-0.1.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

picture_text-0.1.0-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file picture_text-0.1.0.tar.gz.

File metadata

  • Download URL: picture_text-0.1.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200925 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.6.12

File hashes

Hashes for picture_text-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6b5d130e866acc15a88a02a5d0632db48fbbe5d48467f1cc811ad8f39dc5110d
MD5 3c75d571a70c7c680837b852c7fcbc2c
BLAKE2b-256 12f6de894b1bc5f283df364a7d2c4825e9ee30237b38d6770def70995a1003a7

See more details on using hashes here.

File details

Details for the file picture_text-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: picture_text-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200925 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.6.12

File hashes

Hashes for picture_text-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 91a36bc4896c0e4a584c2a8e2e6889f56d5ceefe7bdfaa4e0ddaa62c4387bdbc
MD5 1f401c1d1509b55266427771488d84a6
BLAKE2b-256 fa61d2749f92b65e7154d9cb7d4c4e9c12505f3240aad3a5c248860492c56da4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page