Skip to main content

Quantify drift between two embedding spaces over the same corpus.

Project description

Embedding Drift Monitor (EDM)

Quantify how much an embedding space changes when the underlying model is updated, swapped, or retrained. Given the same corpus embedded under two models (or two versions of the same model), EDM computes a six-metric battery and produces an opinionated report identifying where the space shifted and how severely.


Installation

pip install embedding-drift-monitor

Dev install from source:

git clone https://github.com/Datasculptures/embedding-drift-monitor
cd embedding-drift-monitor
pip install -e .[dev]

Requires Python 3.11+ and a C compiler for the HDBSCAN dependency. On Windows, install Microsoft C++ Build Tools first.


Quickstart

edm compare reference.npy candidate.npy

That's it. EDM loads both matrices, runs all six metrics, classifies severity, and prints a report. The matrices must have the same number of rows (one row per corpus item) and can have different column counts.

For a fast single-metric check:

edm quick reference.npy candidate.npy

CLI Reference

edm compare

Full drift analysis. Runs all six metrics and produces a report.

edm compare EMBEDDINGS_A EMBEDDINGS_B [OPTIONS]
Option Default Description
-k, --k-values 5,10,25,50 Comma-separated k values for neighbourhood analysis
-f, --format text Output format: text, json, or markdown
-o, --output stdout Write output to this file
-q, --quiet off Suppress progress messages on stderr
--force off Overwrite existing output file
-l, --labels Labels file (one label per line, corpus row order)
-m, --metadata Metadata file (.csv or .json) with per-item attributes
--include-per-point off Embed per-point arrays in JSON output (required for identify-regions)
--exclude-nan off Drop rows containing NaN or Inf before analysis
--sample-size 5000 Max pairs sampled for distance/geometry metrics
--seed 0 RNG seed for deterministic sampling
--config TOML file with custom severity thresholds
--no-distance off Skip the KS distance-distribution metric
--no-geometry off Skip the Mantel global-geometry metric
--no-clusters off Skip the HDBSCAN cluster-stability metric
--no-hubs off Skip the hubness N_k shift metric
--min-cluster-size 5 Minimum cluster size for HDBSCAN

Examples:

# Text report to stdout
edm compare ref.npy cand.npy

# JSON output with per-point arrays, saved to file
edm compare ref.npy cand.npy --format json --include-per-point -o results.json

# Markdown report with region breakdown
edm compare ref.npy cand.npy --labels labels.txt --format markdown -o report.md

# Skip slow metrics on a large corpus
edm compare ref.npy cand.npy --no-clusters --no-geometry -k 5,10

# Exclude rows with NaN/Inf before analysis
edm compare ref.npy cand.npy --exclude-nan

edm quick

Fast single-metric check: Jaccard stability at k=10 only. No distance, geometry, cluster, or hubness metrics.

edm quick EMBEDDINGS_A EMBEDDINGS_B

No options. Prints one line:

Neighbourhood stability (k=10): 0.7234 [MODERATE DRIFT]

edm report

Convert a saved JSON results file to text or markdown.

edm report JSON_RESULTS [OPTIONS]
Option Default Description
-f, --format markdown text or markdown
-o, --output stdout Write to this file
--force off Overwrite existing output file
edm report results.json --format markdown -o report.md

edm identify-regions

Apply a labels file to saved JSON results (which must have been produced with --include-per-point) to compute a per-region drift breakdown.

edm identify-regions JSON_RESULTS -l LABELS_FILE [OPTIONS]
Option Default Description
-l, --labels required Labels file — one label per line
-f, --format text text, json, or markdown
-o, --output stdout Write to this file
--force off Overwrite existing output file
edm identify-regions results.json -l labels.txt --format markdown

Input Formats

Embedding matrices

Accepted formats: .npy, .npz, .csv, .tsv

  • .npy: NumPy binary format. Fastest. No pickle — allow_pickle=False.
  • .npz: NumPy compressed archive with exactly one array.
  • .csv / .tsv: Numeric data, optional header row. 10 MB cap.

Both matrices must have the same number of rows. Column counts may differ (different model dimensions are valid). File size cap: 50 MB per embedding file.

Labels file

Plain text, one label per line, UTF-8 encoding. Blank lines are not allowed. The label count must match the number of embedding rows exactly. Leading/trailing whitespace is stripped. Both LF and CRLF line endings are accepted.

electronics
electronics
books
clothing
books

Metadata file

CSV or JSON with one record per corpus item.

CSV: First row is a header. Subsequent rows are per-item values.

JSON: Top-level array of objects, one per item.

Numeric columns are stored as float64; columns that cannot be parsed as floats are stored as string arrays.


Metric Battery

# Metric What it measures
1 Neighbourhood stability (Jaccard) What fraction of each point's k nearest neighbours are the same in both spaces? A score of 1.0 means the local structure is perfectly preserved. A score near 0 means most neighbours changed.
2 Neighbourhood rank correlation (Spearman) Among the common neighbours, do the distance rankings agree? Measures whether relative proximity is preserved, not just set membership.
3 Distance distribution shift (KS statistic) Do pairwise distances follow the same statistical distribution in both spaces? A high KS statistic means the global scale has shifted.
4 Global geometry (Mantel correlation) Is the pairwise distance matrix similar in both spaces? Measures overall geometric correspondence. A high Mantel r means items that were far apart remain far apart.
5 Cluster membership stability (HDBSCAN + ARI) Do the natural clusters found in one space correspond to those in the other? Adjusted Rand Index — 1.0 is a perfect match, 0.0 is random.
6 Hubness shift (N_k Spearman) Do the same items dominate as hub nodes (appearing frequently as a nearest neighbour)? A high correlation means hub structure is preserved.

All six metrics run by default. Skip expensive ones with --no-clusters, --no-geometry, etc.


Severity Interpretation

Each metric is classified as LOW, MODERATE, HIGH, or CRITICAL based on configurable thresholds. The overall severity is the worst individual metric.

Severity Meaning Recommended action
LOW Minimal drift Safe to deploy
MODERATE Notable drift Review before deploying
HIGH Significant drift Full review required
CRITICAL Severe drift Do not deploy without investigation

Default thresholds (Jaccard stability, higher is better):

Threshold Severity
>= 0.80 LOW
>= 0.50 MODERATE
>= 0.20 HIGH
< 0.20 CRITICAL

Configuration

Override thresholds via a TOML file named .edm.toml:

[thresholds.jaccard]
low = 0.85
moderate = 0.60
high = 0.30
higher_is_better = true

[thresholds.ks_statistic]
low = 0.05
moderate = 0.15
high = 0.30
higher_is_better = false

Pass it to any compare run:

edm compare ref.npy cand.npy --config .edm.toml

Output Formats

Text (default)

Human-readable. Sections separated by dividers. Suitable for terminal output and log files.

JSON

Machine-readable. Contains all metrics, per-metric severities, overall severity, and the recommendation. Use --include-per-point to add per-point score arrays (required for identify-regions).

JSON schema version: 4.0. Top-level keys: version, reference, candidate, corpus, overall_severity, metric_severities, metrics, regions, recommendation, warnings.

NaN values are serialized as null.

Markdown

Full structured report suitable for GitHub, Notion, or documentation systems. Includes a metadata table, one section per metric, a drift regions table (if labels were provided), and a recommendation.


Performance Notes

Corpus size Dimensions Approx. time (all metrics)
1K items 768d < 5s
10K items 768d < 30s
100K items 768d < 5 min (with default --sample-size)

HDBSCAN cluster analysis is O(n^2) in memory. For corpora larger than ~23K items, EDM will warn that it may require > 4 GB RAM. Use --no-clusters to skip it.

Increasing --sample-size beyond 10,000 has diminishing returns for statistical accuracy at significant cost. The default of 5,000 is appropriate for most corpora.


Portfolio Context

EDM is the third tool in the datasculptures embedding space trilogy:

  • LLE — What structures exist in this embedding space? (Exploration)
  • RQB — When I reduce this space, how much structure is preserved? (Evaluation)
  • EDM — When the underlying model changes, how much structure shifts? (Monitoring)

datasculptures.com


License

MIT License — Copyright 2026 Sean Patrick Morris / datasculptures

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_drift_monitor-1.0.0.tar.gz (70.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedding_drift_monitor-1.0.0-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file embedding_drift_monitor-1.0.0.tar.gz.

File metadata

  • Download URL: embedding_drift_monitor-1.0.0.tar.gz
  • Upload date:
  • Size: 70.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for embedding_drift_monitor-1.0.0.tar.gz
Algorithm Hash digest
SHA256 48761669894c3254a77cfba9000ed985e9f7d23d03b72707a72c7ee9123edf25
MD5 d180dda1610a9050dee5705092200e86
BLAKE2b-256 df7487d043038fce8cdcd5fb3f60c9d074283298323b2da8335de108da33e874

See more details on using hashes here.

File details

Details for the file embedding_drift_monitor-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for embedding_drift_monitor-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 28da55d5e4f1f3977ca7b48ff2696ab8ee01eabfb26a0275d8c601a460385869
MD5 f4b174bd317b93aeb482bb9620ccafbf
BLAKE2b-256 4155b33658c176e9b6fd78dbedcbd591e42df8a2e994ec6c1083fd4801cfe8ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page