Skip to main content

Presto: Projected Embedding Similarty based on Topological Overlays

Project description

Presto 🎶

Confidently and efficiently navigate the multiverse 🧭🚀

The world of machine learning research is riddled with small decisions, from data collection, cleaning, into model selection and parameter tuning. Each combination of data, implementation, and modeling decisions leads to a potential universe where we can analyze and interpret results. Together, these form a multiverse! 🌌

In this project, we focus on mapping out the multiverse of latent representations, specifically targeting machine learning tasks that produce embeddings. In our ICML 2024 paper, we develop topological tools to efficiently measure the structural variation between representations that arise from different choices in a machine learning workflow. Our main contribution is a custom score developed specifically for multiverse analysis:

Presto, our Projected Embedding Similarity based on Topological Overlays. 🔍✨

Installation

You can install Presto using pip:

pip install presto-multiverse

Basic Usage

Presto

import numpy as np
from presto import Presto
from sklearn.random_projection import GaussianRandomProjection as Gauss

#Compare Two Embeddings X,Y based on a collection of low-dimensional random embeddings

X = np.random.rand(1000,10)
Y = np.random.rand(1000,12)

presto = Presto(projector=Gauss)

dist = presto.fit_transform(X,Y,n_projections = 20,n_components=2,normalize=True)

print(f"Presto Distance between X & Y : {dist}")

Atom

We also provide an Atom class that supports Approximate Topological Operations in the Multiverse. When considering a collection of representations that arise from a multiverse, we provide functionality for producing Multiverse Metric Spaces (MMS) that we introduce in our work. Given a collection of embeddings, an MMS encodes the pairwise distance between topological descriptors as computed by Presto.

import numpy as np
from presto import Atom

# A list of embeddings in the multiverse
data = [np.random.rand(100, 10) for _ in range(3)]

atom = Atom(data)
atom.compute_MMS(parallelize=True)
print(atom.MMS)

Features

Normalization & Projection

Presto provides a method to normalize spaces by approximating their diameter. This ensures the embeddings are scaled appropriately before further processing. This class also allows a user to you to project high-dimensional embeddings into a lower-dimensional space using your method of choice! We recommend methods like PCA or Gaussian Random Projection. When using random projections, we encourage the user to produce many random embeddings– Presto uses the distrubution of these low-dimensional representations to build topological descriptors of each space.

Presto Distance

Presto first fits a topological descriptor to each high dimensional embedding: in particular we build persistence landscapes based on the projections of the embedding. When dealing with a distribution of projections, we fit landscapes to each projection and aggregate the topological information into an average landscape, a well defined notion thanks to the great work by Bubenik et al. The presto distance between embeddings X,Y is then computed as the landscape distance between the fitted topological descriptors.

Presto Sensitivity

Presto also offers methods to evaluate the sensitivity of the embeddings' topological structure. By using the statistical properties afforded by persistence landscapes, we compute variance-based scores to evaluate the sensitivity and stability of embeddings with respect to different data, implementation, and modeling choices. Presto supports functionality to calculate various versions of sensitivity within a multiverse.

Clustering and Compression

Using the Atom class, you can compute Multiverse Metric Spaces (MMS) that encode pairwise Presto distances between embeddings. Once you have an MMS, there are lots of cool things to do in this space: we provide functionality for clustering your embeddings with Scikit-learn's AgglomerativeClustering and compressing your embedding space using a greedy approximation of a set cover algorithm.

Generating Embeddings

Coming soon! We will also support a framework for generating various embeddings from models like VAEs, Transformers, and Dimensionality reduction algorithms.

License

Presto is licensed under the BSD-3 License. This permissive license allows you to use, modify, and distribute the software with minimal restrictions. You are free to incorporate Presto into your projects, whether they are open-source or proprietary, as long as you cite us! Please include the original license and copyright notice in any distributions of the software. For more detailed information, please refer to the LICENSE file included in the repository.

Contributing

We welcome contributions and suggestions for our Presto package! Here are some basic guidelines for contributing:

How to Submit an Issue

  1. Check Existing Issues: Before submitting a new issue, please check if it has already been reported.

  2. Open a New Issue: If your issue is new, open a new issue in the repository. Provide a clear and detailed description of the problem, including steps to reproduce the issue if applicable.

  3. Include Relevant Information: Include any relevant information, such as system details, version numbers, and screenshots, to help us understand and resolve the issue more efficiently.

How to Contribute

If you're unfamiliar with contributing to open source repositories, here is a basic roadmap:

  1. Fork the Repository: Start by forking the repository to your own GitHub account.

  2. Clone the Repository: Clone the forked repository to your local machine.

    git clone https://github.com/your-username/presto.git
    
  3. Create a Branch: Create a new branch for your feature or bug fix.

    git checkout -b feature/your-feature-name
    
  4. Make Changes: Implement your changes in the new branch.

  5. Commit Changes: Commit your changes with a descriptive commit message.

    git commit -m "Description of your changes"
    
  6. Push Changes: Push the changes to your forked repository.

    git push origin feature/your-feature-name
    
  7. Submit a Pull Request: Open a pull request to the main repository with a clear description of your changes and the purpose of the contribution.

Need Help?

If you need any help or have questions, feel free to reach out to the authors or submit a pull request. We appreciate your contributions and are happy to assist!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

presto_multiverse-0.1.2.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

presto_multiverse-0.1.2-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file presto_multiverse-0.1.2.tar.gz.

File metadata

  • Download URL: presto_multiverse-0.1.2.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.13 Darwin/23.4.0

File hashes

Hashes for presto_multiverse-0.1.2.tar.gz
Algorithm Hash digest
SHA256 93a86064ceaf7895ea528e5468fef2539c4ce36d6dc296a632ddd080d2273d35
MD5 dcf5550470bce9803c7c1b89c38cbefb
BLAKE2b-256 525b1ba9116d625d284a4f26fc77e95bb45254ccde16f5a3c9de4706bf5aa28b

See more details on using hashes here.

File details

Details for the file presto_multiverse-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for presto_multiverse-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ef1235939503d8890b56768db14fbdf6b1bb4d727f097e5bfce53317b61d0458
MD5 8b1b4d36e9d7419341d3ec6a3c6a0eb6
BLAKE2b-256 e9ae91c0f626644da5bf6fb85826da8de1515e32b50da30d8375306fded95577

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page