Skip to main content

Tools to help uncover `whatlies` in word embeddings.

Project description

Downloads

whatlies

A library that tries to help you to understand (note the pun).

"What lies in word embeddings?"

This small library offers tools to make visualisation easier of both word embeddings as well as operations on them.

Feedback is welcome.

Produced

This project was initiated at Rasa as a by-product of our efforts in the developer advocacy and research teams. It's an open source project and community contributions are very welcome!

Features

This library has tools to help you understand what lies in word embeddings. This includes:

  • simple tools to create (interactive) visualisations
  • an api for vector arithmetic that you can visualise
  • support for many dimensionality reduction techniques like pca, umap and tsne
  • support for many language backends including spaCy, fasttext, tfhub, huggingface and bpemb
  • lightweight scikit-learn featurizer support for all these backends

Getting Started

For a quick overview, check out our introductory video on youtube. More in depth getting started guides can be found on the documentation page.

Examples

The idea is that you can load embeddings from a language backend and use mathematical operations on it.

from whatlies import EmbeddingSet
from whatlies.language import SpacyLanguage

lang = SpacyLanguage("en_core_web_md")
words = ["cat", "dog", "fish", "kitten", "man", "woman",
         "king", "queen", "doctor", "nurse"]

emb = EmbeddingSet(*[lang[w] for w in words])
emb.plot_interactive(x_axis=emb["man"], y_axis=emb["woman"])

You can even do fancy operations. Like projecting onto and away from vector embeddings! You can perform these on embeddings as well as sets of embeddings. In the example below we attempt to filter away gender bias using linear algebra operations.

orig_chart = emb.plot_interactive('man', 'woman')

new_ts = emb | (emb['king'] - emb['queen'])
new_chart = new_ts.plot_interactive('man', 'woman')

There's also things like pca and umap.

from whatlies.transformers import Pca, Umap

orig_chart = emb.plot_interactive('man', 'woman')
pca_plot = emb.transform(Pca(2)).plot_interactive(x_label='pca_0', y_label='pca_1')
umap_plot = emb.transform(Umap(2)).plot_interactive(x_label='umap_0', y_label='umap_1')

pca_plot | umap_plot

We even allow for BERT-style embeddings. Just use the square brackets.

lang = SpacyLanguage("en_trf_robertabase_lg")
lang['programming in [python]']

You'll now get the embedding for the token "python" but in context of "programming in python".

Documentation

To learn more and for a getting started guide, check out the documentation.

Installation

To install the package as well as all the dependencies, simply run;

pip install whatlies

Similar Projects

There are some projects out there who are working on similar tools and we figured it fair to mention and compare them here.

Julia Bazińska & Piotr Migdal Web App

The original inspiration for this project came from this web app and this pydata talk. It is a web app that takes a while to slow but it is really fun to play with. The goal of this project is to make it easier to make similar charts from jupyter using different language backends.

Tensorflow Projector

From google there's the tensorflow projector project. It offers highly interactive 3d visualisations as well as some transformations via tensorboard.

  • The tensorflow projector will create projections in tensorboard, which you can also load into jupyter notebook but whatlies makes visualisations directly.
  • The tensorflow projector supports interactive 3d visuals, which whatlies currently doesn't.
  • Whatlies offers lego bricks that you can chain together to get a visualisation started. This also means that you're more flexible when it comes to transforming data before visualising it.
Parallax

From Uber AI Labs there's parallax which is described in a paper here. There's a common mindset in the two tools; the goal is to use arbitrary user defined projections to understand embedding spaces better. That said, some differences that are worth to mention.

  • It relies on bokeh as a visualisation backend and offers a lot of visualisation types (like radar plots). Whatlies uses altair and tries to stick to simple scatter charts. Altair can export interactive html/svg but it will not scale as well if you've drawing many points at the same time.
  • Parallax is meant to be run as a stand-alone app from the command line while Whatlies is meant to be run from the jupyter notebook.
  • Parallax gives a full user interface while Whatlies offers lego bricks that you can chain together to get a visualisation started.
  • Whatlies relies on language backends to fetch word embeddings. Parallax allows you to instead fetch raw files on disk.
  • Parallax has been around for a while, Whatlies is more new and therefore more experimental.

Local Development

If you want to develop locally you can start by running this command.

make develop

Documentation

This is generated via

make docs

Citation

Please use the following citation when you found whatlies helpful for any of your work (find the whatlies paper here):

@misc{Warmerdam2020whatlies,
	Archiveprefix = {arXiv},
	Author = {Vincent D. Warmerdam and Thomas Kober and Rachael Tatman},
	Eprint = {2009.02113},
	Primaryclass = {cs.CL},
	Title = {Going Beyond T-SNE: Exposing \texttt{whatlies} in Text Embeddings},
	Year = {2020}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whatlies-0.5.5.tar.gz (47.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whatlies-0.5.5-py2.py3-none-any.whl (77.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file whatlies-0.5.5.tar.gz.

File metadata

  • Download URL: whatlies-0.5.5.tar.gz
  • Upload date:
  • Size: 47.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.7

File hashes

Hashes for whatlies-0.5.5.tar.gz
Algorithm Hash digest
SHA256 dc2bba0439a73d73dd7caacfbc4a205bfbd4e10e162fcb506446312aa54e2406
MD5 984ffc20bacb29486e5a379a4732bfca
BLAKE2b-256 0d2e69bf785b8a66b639d20a210478829acace83e101b92b4f60a76e89c68a61

See more details on using hashes here.

File details

Details for the file whatlies-0.5.5-py2.py3-none-any.whl.

File metadata

  • Download URL: whatlies-0.5.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 77.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.7

File hashes

Hashes for whatlies-0.5.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 3750a1ec89fcebc5847440a4504885b828e7b1365305cb4b41f7d6e5c6cd333a
MD5 1fede526fd99945317f1e7ab17017015
BLAKE2b-256 77058acf99bb71301b88fde4da0c7e086257a67ae891c6c56ecd9839f9b6227d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page