Presto: Projected Embedding Similarty based on Topological Overlays
Project description
Presto 🎶
Confidently and efficiently navigate the multiverse 🧭🚀
The world of machine learning research is riddled with small decisions, from data collection, cleaning, into model selection and parameter tuning. Each combination of data, implementation, and modeling decisions leads to a potential universe where we can analyze and interpret results. Together, these form a multiverse! 🌌
In this project, we focus on mapping out the multiverse of latent representations, specifically targeting machine learning tasks that produce embeddings. In our ICML 2024 paper, we develop topological tools to efficiently measure the structural variation between representations that arise from different choices in a machine learning workflow. Our main contribution is a custom score developed specifically for multiverse analysis:
Presto
, our Projected Embedding Similarity based on Topological Overlays. 🔍✨
Installation
You can install Presto using pip:
pip install presto-multiverse
Basic Usage
Presto
import numpy as np
from presto import Presto
from sklearn.random_projection import GaussianRandomProjection as Gauss
#Compare Two Embeddings X,Y based on a collection of low-dimensional random embeddings
X = np.random.rand(1000,10)
Y = np.random.rand(1000,12)
presto = Presto(projector=Gauss)
dist = presto.fit_transform(X,Y,n_projections = 20,n_components=2,normalize=True)
print(f"Presto Distance between X & Y : {dist}")
Atom
We also provide an Atom
class that supports Approximate Topological Operations in the Multiverse.
When considering a collection of representations that arise from a multiverse, we provide functionality for producing Multiverse Metric Spaces (MMS) that we introduce in our work. Given a collection of embeddings, an MMS encodes the pairwise distance between topological descriptors as computed by Presto
.
import numpy as np
from presto import Atom
# A list of embeddings in the multiverse
data = [np.random.rand(100, 10) for _ in range(3)]
atom = Atom(data)
atom.compute_MMS(parallelize=True)
print(atom.MMS)
Features
Normalization & Projection
Presto provides a method to normalize spaces by approximating their diameter. This ensures the embeddings are scaled appropriately before further processing. This class also allows a user to you to project high-dimensional embeddings into a lower-dimensional space using your method of choice! We recommend methods like PCA or Gaussian Random Projection. When using random projections, we encourage the user to produce many random embeddings– Presto
uses the distrubution of these low-dimensional representations to build topological descriptors of each space.
Presto Distance
Presto first fits a topological descriptor to each high dimensional embedding: in particular we build persistence landscapes based on the projections of the embedding. When dealing with a distribution of projections, we fit landscapes to each projection and aggregate the topological information into an average landscape, a well defined notion thanks to the great work by Bubenik et al. The presto distance between embeddings X,Y is then computed as the landscape distance between the fitted topological descriptors.
Presto Sensitivity
Presto also offers methods to evaluate the sensitivity of the embeddings' topological structure. By using the statistical properties afforded by persistence landscapes, we compute variance-based scores to evaluate the sensitivity and stability of embeddings with respect to different data, implementation, and modeling choices. Presto
supports functionality to calculate various versions of sensitivity within a multiverse.
Clustering and Compression
Using the Atom
class, you can compute Multiverse Metric Spaces (MMS) that encode pairwise Presto distances between embeddings. Once you have an MMS, there are lots of cool things to do in this space: we provide functionality for clustering your embeddings with Scikit-learn's AgglomerativeClustering and compressing your embedding space using a greedy approximation of a set cover algorithm.
Generating Embeddings
Coming soon! We will also support a framework for generating various embeddings from models like VAEs, Transformers, and Dimensionality reduction algorithms.
License
Presto is licensed under the BSD-3 License. This permissive license allows you to use, modify, and distribute the software with minimal restrictions. You are free to incorporate Presto into your projects, whether they are open-source or proprietary, as long as you cite us! Please include the original license and copyright notice in any distributions of the software. For more detailed information, please refer to the LICENSE file included in the repository.
Contributing
We welcome contributions and suggestions for our Presto package! Here are some basic guidelines for contributing:
How to Submit an Issue
-
Check Existing Issues: Before submitting a new issue, please check if it has already been reported.
-
Open a New Issue: If your issue is new, open a new issue in the repository. Provide a clear and detailed description of the problem, including steps to reproduce the issue if applicable.
-
Include Relevant Information: Include any relevant information, such as system details, version numbers, and screenshots, to help us understand and resolve the issue more efficiently.
How to Contribute
If you're unfamiliar with contributing to open source repositories, here is a basic roadmap:
-
Fork the Repository: Start by forking the repository to your own GitHub account.
-
Clone the Repository: Clone the forked repository to your local machine.
git clone https://github.com/your-username/presto.git
-
Create a Branch: Create a new branch for your feature or bug fix.
git checkout -b feature/your-feature-name
-
Make Changes: Implement your changes in the new branch.
-
Commit Changes: Commit your changes with a descriptive commit message.
git commit -m "Description of your changes"
-
Push Changes: Push the changes to your forked repository.
git push origin feature/your-feature-name
-
Submit a Pull Request: Open a pull request to the main repository with a clear description of your changes and the purpose of the contribution.
Need Help?
If you need any help or have questions, feel free to reach out to the authors or submit a pull request. We appreciate your contributions and are happy to assist!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file presto_multiverse-0.1.2.tar.gz
.
File metadata
- Download URL: presto_multiverse-0.1.2.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.9.13 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93a86064ceaf7895ea528e5468fef2539c4ce36d6dc296a632ddd080d2273d35 |
|
MD5 | dcf5550470bce9803c7c1b89c38cbefb |
|
BLAKE2b-256 | 525b1ba9116d625d284a4f26fc77e95bb45254ccde16f5a3c9de4706bf5aa28b |
File details
Details for the file presto_multiverse-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: presto_multiverse-0.1.2-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.9.13 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef1235939503d8890b56768db14fbdf6b1bb4d727f097e5bfce53317b61d0458 |
|
MD5 | 8b1b4d36e9d7419341d3ec6a3c6a0eb6 |
|
BLAKE2b-256 | e9ae91c0f626644da5bf6fb85826da8de1515e32b50da30d8375306fded95577 |