The data manifold under the microscope.
Project description
[!CAUTION] This project is under active construction. The code runs and can be used to explore the framework, but it still needs more updates before it can be relied on to reproduce the original paper results or to support new research projects.
The Data Manifold Under the Microscope
Library transition status
This repository is being transitioned from a paper-review artifact into a reusable Python library for working with grid-sampled data manifolds and geometric measurements. The current library-facing core is the dataset generation and finite-difference geometry code. The manifold fitting and bounds evaluation code is still included as reference material from the accompanying paper and as example usage of the framework, but it is not yet a polished or stable public API.
Here is a map of the main components:
- Datasets
- There are two notebooks under
notebooks/datasets_and_measureswhich contain examples of loading and visualizing the toy and image datasets. - The toy datasets code is in
microscope/datasets/toy_manifolds.py - The image datasets code is in
microscope/datasets/custom_dsprites.pyandmicroscope/datasets/coil20.pyrespectively.
- There are two notebooks under
- Geometric measures
- The notebook
notebooks/datasets_and_measures/toy_manifold_datasets.ipynbcontains example computation and visualization of the measures on the image datasets. - The finite element computations of the measures are under
microscope/computations_grid.basic.pycontains basic computations such as partial derivatives or the Riemannian metric and on top of it are builtvolume.py,curvature.pyandreach.py. - The functions are thoroughly tested, check
microscope/computations_grid/tests. Those use a whole separate part of the codebase,microscope/manifold_exampleswhere symbolic computations of the measures on simple manifolds are performed.
- The notebook
- Manifold fitting and bounds
- The MMLS fitting method is in
experiment_scripts/manifold_fitting/mmls.py. - The denoising autoencoder used for the toy datasets is in
experiment_scripts/toy_manifolds_experiment/manifold_fitting_denoising_autoencoder.py. - The beta-VAE components are under
representation_learning/beta_vae. - The scripts running the main experiments are
experiment_scripts/manifold_fitting/training.pyfor the image datasets andexperiment_scripts/toy_manifolds_experiment/fit_and_get_measures.pyfor the toy datasets. - The three notebooks under
notebooks/manifold_fittingshow how to generate some of the results of the paper, namely the bound curves for all datasets on MMLS, the curves on dSprites with MMLS on different dimensions and the curves for all methods on dSprites. Please note that the code and plots are not very polished there. The final plots were generated separately using the curves exported from the notebook.
- The MMLS fitting method is in
manifold-microscope provides a framework for studying and benchmarking data manifolds through densely sampled
grid-based datasets and finite-difference geometric computations. The goal of the library is to make it practical to
construct measurable reference manifolds, compute geometric quantities such as curvature, reach, and volume, and use
those quantities when evaluating manifold fitting, generalization bounds, and geometric estimation methods.
Why Use the Microscope?
In most research settings, one must choose between idealized mathematical manifolds (e.g., spheres, ellipsoids) with unrealistic simplicity, or real-world datasets where true geometric quantities are unknown or hard to measure accurately for testing. This framework bridges that gap by offering datasets that are both structured and realistic, yet fully measurable.
For instance, if you derive a new generalization or manifold fitting bound involving curvature or volume, you can directly test how tight it is under controlled geometric conditions. Similarly, if you develop a curvature estimation algorithm, you can benchmark its performance on datasets where the true curvature is exactly known. Randomly sampling points from the provided grids lets you simulate realistic sparse sampling scenarios and directly compare estimates to ground truth.
Setup and requirements
To install the current development version of the library, run pip install . at the top level of the project. It is
recommended to use a computer with a GPU of at least 5 GB memory and 30 GB of RAM, especially for dense image-dataset
experiments.
To run the unit tests of the project, run pytest on the top level of the project.
Datasets
The two main image datasets included are:
- dSprites (grid generator) – generates synthetic images with controllable grid density, image size, and transformations.
- COIL-20 (augmented generator) – extends the original dataset (to be downloaded separately) with controlled xy-rotations and rescaling.
Additionally, four toy datasets are provided for smaller experiments:
- Circle - A circle embedded in 2D.
- Moons - Two semicircles close to each other. It is practically the same as the moons dataset from sklearn.
- Sphere – A sphere embedded in 3D.
- Torus – A torus in 3D. Has slightly more complex topology and nonuniform geometric measures.
All datasets are densely sampled on a grid, where each grid direction corresponds to a transformation axis. This limits practical dimensionality to about 4–5 directions but provides highly accurate geometric values which can be used as ground truth. Datasets can be loaded as full grids or sampled uniformly with respect to the local volume element. For the toy datasets, the geometric measures are computed directly using the corresponding closed-form formulas.
For example usages and visualizations of the datasets look in the notebooks in notebooks/datasets_and_measures.
Geometric Measures
Finite-difference operators are used to compute geometric quantities directly on the grid—making the framework accurate, stable, and differentiable.
Available measures include:
- Volume and volume element
- Tangent spaces and the Riemannian metric tensor
- Scalar curvature
- Reach along with a pointwise version of it.
All computations can be executed on GPU, allowing fast analysis even for dense grids.
Example computations of the measures can be found in the following notebook: notebooks/datasets_and_measures/toy_manifold_datasets.ipynb.
Long term, the goal is to extend this module to include geodesic distances, exponential maps, and other advanced differential quantities.
Manifold Fitting Bounds
This section provides reference experiments used in the accompanying paper to validate theoretical manifold fitting bounds. These scripts and notebooks are kept in this repository for now so that the original experiments remain visible, but this part of the codebase is still being reorganized and may move to a separate public reproducibility repository.
Two types of models are used to approximate the reference manifolds:
- Moving Least Squares (MMLS): A classical local manifold fitting algorithm used to recover smooth embeddings from sampled data.
- $\beta$-VAE: A deep generative model trained to learn a low-dimensional latent manifold consistent with the data geometry and using it to reconstruct a full data manifold.
The $\beta$-VAE implementation in representation_learning/beta_vae is copied and adapted from
1Konny/Beta-VAE, a PyTorch reproduction of the $\beta$-VAE models from Higgins
et al. (2017) and Burgess et al. (2018). The upstream project is distributed under the MIT License; the original
copyright and license text are included in representation_learning/beta_vae/LICENSE, and a package-level attribution
notice is included in representation_learning/beta_vae/NOTICE.md.
The results are compared to theoretical bounds proposed by Fefferman, Narayanan & Mitter (2016) and Genovese et al. (2012), assessing their tightness and dependence on curvature, reach, and sample density.
To reproduce the fitting of the manifolds:
- For the toy datasets run:
python experiment_scripts/toy_manifolds_experiment/fit_and_get_measures.py \ --output-path <path to output dir>/toy_datasets_fitting_mmls \ --n-range 25 505 5 \ --n-examples-per-size 20 \ --n-ground-truth 1_000 \ --max-workers 5 \ --fitting-method MMLS python experiment_scripts/toy_manifolds_experiment/fit_and_get_measures.py \ --output-path <path to output dir>/toy_dataset_fitting_denoising_autoencoder \ --n-examples-per-size 5 \ --n-ground-truth 1_000 \ --max-workers 5 \ --fitting-method denoising_autoencoder_random_noise - For the image datasets run:
COIL20_PATH=<path where you extracted coil-20-proc> python experiment_scripts/manifold_fitting/training.py --output-path <output path>
Examples of generated results can be found in the three notebooks under notebooks/manifold_fitting.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file manifold_microscope-0.0.1.tar.gz.
File metadata
- Download URL: manifold_microscope-0.0.1.tar.gz
- Upload date:
- Size: 74.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
041972b21e8c059da0ad8950df6f8845594ce50c2e6783f83e35f64be10081c8
|
|
| MD5 |
33e4d34a3e4a9ae75a5f43cd0b83da1e
|
|
| BLAKE2b-256 |
1983bc6e337be1144e76c19a68f16c0e2c70bb566a360d351ea287914a4e8b10
|
File details
Details for the file manifold_microscope-0.0.1-py3-none-any.whl.
File metadata
- Download URL: manifold_microscope-0.0.1-py3-none-any.whl
- Upload date:
- Size: 94.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4c0540f19c6e8907d4c467949bdef19b6dda5ea019f61f99c2f2ec9c86344c9
|
|
| MD5 |
05c2c588a38414d0d685be45109d8c85
|
|
| BLAKE2b-256 |
4b40f0d1ec223757e2cba36db088cbeac1025416c6e35153ed77cdfad5c45c85
|