Skip to main content

EmbCompare is a small python package that helps you compare your embeddings both visually and numerically.

Project description

EmbCompare

A simple python tool for embedding comparison

EmbCompare is a small python package highly inspired by the Embedding Comparator tool that helps you compare your embeddings both visually and numerically.

Key features :

  • Visual comparison : GUI for comparison of two embeddings
  • Numerical comparison : Calculation of comparison indicators between two embeddings for monitoring purposes

EmbCompare keeps things simples. All computations are made in memory and the package does not bring any embedding storage management.

If you need a tool to store, compare and track your experiments, you may like the vectory project.

Table of content

🛠️ Installation

# basic install
pip install embcompare

# installation with the gui tool
pip install embcompare[gui]

👩‍💻 Usage

EmbCompare provides a CLI with three sub-commands :

  • embcompare add is used to create or update a yaml file containing all embeddings infos : path, format, labels, term-frequencies, ... ;
  • embcompare report is used to generate json reports containing comparison metrics ;
  • embcompare gui is used to start a streamlit webapp to compare your embeddings visually.

Config file

EmbCompare use a yaml file for referencing embeddings and relevant informations. By default, EmbCompare is looking for a file named embcompare.yaml in the current working directory.

embeddings: 
    first_embedding:
        name: My first embedding
        path: /abspath/to/firstembedding.json
        format: json
        frequencies: /abspath/to/freqs.json
        frequencies_format: json
        labels: /abspath/to/labels.pkl
        labels_format: pkl
    second_embedding:
        name: My second embedding
        path: /abspath/to/secondembedding.json
        format: word2vec
        frequencies: /abspath/to/freqs.pkl
        frequencies_format: pkl
        labels: /abspath/to/labels.json
        labels_format: json

The embcompare add command allow to update this file programatically (and even create it if it does not exist).

JSON comparison report generation

EmbCompare aims to help to compare embedding thanks to numerical metrics that can be used to check if a new generated embedding is very different from the last one. The command embcompare report can be used in two ways :

  • With a single embedding. In this case it generate a small report about the embedding :
    embcompare report first_embedding
    # creates a first_embedding_report.json file containing some infos about the embedding
    
  • With two embeddings. In this case it generate a comparison report about the two embeddings :
    embcompare report first_embedding second_embedding
    # creates a first_embedding_second_embedding_report.json file containing comparison metrics
    

GUI

A short video overview of embcompare graphical user interface

The GUI is also very handy to compare embeddings. To start the GUI, use the commande embcompare gui. It will launch a streamlit app that will allow you to visually compare the embeddings you added in the configuration file.

🐍 Python API

EmbCompare provide several classes to load and compare embeddings.

Embedding

The Embedding class is child of the gensim.KeyedVectors class.

It add few functionalities :

  • You can provide term frequencies so you can filter the elements later
  • You can easily compute all elements nearest neighbors (thanks to sklearn.neighbors.NearestNeighbors)
import json
import gensim.downloader as api
from embcompare import Embedding

word_vectors = api.load("glove-wiki-gigaword-100")
with open("frequencies.json", "r") as f:
  word_frequencies = json.load(f)

embedding = Embedding.load_from_keyedvectors(word_vectors, frequencies=word_frequencies)
neigh_dist, neigh_ind = embedding.compute_neighborhoods()

EmbeddingComparison

The EmbeddingComparison class is meant to compare two Embedding objects :

from embcompare import EmbeddingComparison, load_embedding

emb1 = load_embedding("first_emb.bin", embedding_format="fasttext", frequencies_path="freqs.pkl")
emb2 = load_embedding("second_emb.bin", embedding_format="word2vec", frequencies_path="freqs.pkl")

comparison = EmbeddingComparison({"emb1": emb1, "emb2": emb2}, n_neighbors=25)
comparison.neighborhoods_similarities["word"]
# 0.867

JSON reports

EmbeddingReport

The EmbeddingReport class is used to generate small report about an embedding :

from embcompare import EmbeddingReport, load_embedding

emb1 = load_embedding("first_emb.bin", embedding_format="fasttext", frequencies_path="freqs.pkl")
report = EmbeddingReport(emb1)
report.to_dict()
# { 
#   "vector_size": 300,
#   "mean_frequency": 0.00012,
#   "mean_distance_neighbors": 0.023,
#   ...
# }

EmbeddingComparisonReport

The EmbeddingComparisonReport class is used to generate small comparison report from two embedding :

from embcompare import EmbeddingComparison, EmbeddingComparisonReport, load_embedding

emb1 = load_embedding("first_emb.bin", embedding_format="fasttext", frequencies_path="freqs.pkl")
emb2 = load_embedding("second_emb.bin", embedding_format="word2vec", frequencies_path="freqs.pkl")

comparison = EmbeddingComparison({"emb1": emb1, "emb2": emb2})
report = EmbeddingComparisonReport(comparison)

report.to_dict()
# {
#   "embeddings" : [
#     { 
#       "vector_size": 300,
#       "mean_frequency": 0.00012,
#       "mean_distance_neighbors": 0.023,
#       ...
#     },
#     ...
#   ],
#   "neighborhoods_similarities_median": 0.012,
#   ...
# }

📊 Create your custom streamlit app

The GUI is built with streamlit. We tried to modularized the app so you can more easily reuse some features for your custom streamlit app :

# embcompare/gui/app.py

from embcompare.gui.features import (
    display_custom_elements_comparison,
    display_elements_comparison,
    display_embeddings_config,
    display_frequencies_comparison,
    display_neighborhoods_similarities,
    display_numbers_of_elements,
    display_parameters_selection,
    display_spaces_comparison,
    display_statistics_comparison,
)
from embcompare.gui.helpers import create_comparison

def main():
    """Streamlit app for embeddings comparison"""
    config_embeddings = config[CONFIG_EMBEDDINGS]

    (
        tab_infos,
        tab_stats,
        tab_spaces,
        tab_neighbors,
        tab_compare,
        tab_compare_custom,
        tab_frequencies,
    ) = st.tabs(
        [
            "Infos",
            "Statistics",
            "Spaces",
            "Similarities",
            "Elements",
            "Search elements",
            "Frequencies",
        ]
    )

    # Embedding selection (inside the sidebar)
    with st.sidebar:
        parameters = display_parameters_selection(config_embeddings)

    # Display informations about embeddings
    with tab_infos:
        display_embeddings_config(
            config_embeddings, parameters.emb1_id, parameters.emb2_id
        )

    comparison = create_comparison(
        config_embeddings,
        emb1_id=parameters.emb1_id,
        emb2_id=parameters.emb2_id,
        n_neighbors=parameters.n_neighbors,
        max_emb_size=parameters.max_emb_size,
        min_frequency=parameters.min_frequency,
    )

    # Display number of element in both embedding and common elements
    with tab_infos:
        display_numbers_of_elements(comparison)

    # Display statistics
    with tab_stats:
        display_statistics_comparison(comparison)

    if not comparison.common_keys:
        st.warning("The embeddings have no element in common")
        st.stop()

    # Comparison below are based on common elements comparison
    with tab_spaces:
        display_spaces_comparison(comparison)

    with tab_neighbors:
        display_neighborhoods_similarities(comparison)

    with tab_compare:
        display_elements_comparison(comparison)

    with tab_compare_custom:
        display_custom_elements_comparison(comparison)

    with tab_frequencies:
        display_frequencies_comparison(comparison)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embcompare-1.0.1.tar.gz (29.8 MB view details)

Uploaded Source

Built Distribution

embcompare-1.0.1-py3-none-any.whl (56.5 kB view details)

Uploaded Python 3

File details

Details for the file embcompare-1.0.1.tar.gz.

File metadata

  • Download URL: embcompare-1.0.1.tar.gz
  • Upload date:
  • Size: 29.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for embcompare-1.0.1.tar.gz
Algorithm Hash digest
SHA256 0351f82f5667c445bf97414ad876e2935bb00b79b9603e1065faec3fa45f81cd
MD5 580de05f63fab0d6da6d26078c57f011
BLAKE2b-256 c51cf7420fee3ddbcfc1bdd5585042cfae88013ae48d414c5e1ee1506dffe410

See more details on using hashes here.

File details

Details for the file embcompare-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: embcompare-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 56.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for embcompare-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9881e044873609ebc497862cfe51fa5da35808635ed9ad767b90a457bc2bf78d
MD5 5c285f6c883d6933610eb7d862dcc4d7
BLAKE2b-256 c35f224472a4e951924e21eeeeaaf3f652262976156406c6c60c1c0fc83f3bee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page