Skip to main content

Verifiable integrity for AI embedding stores.

Project description

VectorPin

Verifiable integrity for AI embedding stores.

License: Apache 2.0 Python 3.11+ Rust stable Node 20+ Status: alpha DOI

Vector databases are the new soft underbelly of the AI stack. Models trust them. Agents query them. Compliance audits don't yet ask about them. VectorPin pins every embedding to its source content and the model that produced it, then continuously verifies the store has not been tampered with — including covert steganographic modifications invisible to traditional DLP.

Part of the ThirdKey Trust Stack, alongside Symbiont (policy-governed agent runtime) and SchemaPin (cryptographic tool verification).

Why this matters

Modern RAG systems convert sensitive content into high-dimensional vectors and store them in databases that:

  • Don't inspect what gets written
  • Don't verify integrity on read
  • Treat embeddings as opaque numerical artifacts

That's a giant attack surface. The companion VectorSmuggle research project demonstrates that an attacker with write access to a vector pipeline can hide arbitrary data inside embeddings using techniques that pass standard observability:

  • Noise injection, rotation, scaling, and offset perturbations
  • Cross-model fragmentation
  • Steganographic encoding that survives database quantization

Cryptographic pinning is the kill shot for these attacks. Every steganographic technique requires modifying the vector after the model produces it. If each vector ships with a signed attestation binding it to its source text and the producing model, any modification breaks the signature.

Quick start

Python

pip install vectorpin
import numpy as np
from vectorpin import Signer, Verifier

# At ingestion time
signer = Signer.generate(key_id="prod-2026-05")
embedding = my_model.embed("The quick brown fox.")
pin = signer.pin(
    source="The quick brown fox.",
    model="text-embedding-3-large",
    vector=embedding,
)
# Store pin.to_json() alongside the embedding in your vector DB metadata.

# At read/audit time
verifier = Verifier({"prod-2026-05": signer.public_key_bytes()})
result = verifier.verify(pin, source="The quick brown fox.", vector=embedding)
if not result.ok:
    print(f"INTEGRITY FAILURE: {result.error.value}{result.detail}")

Rust

[dependencies]
vectorpin = "0.1"
use vectorpin::{Signer, Verifier};

let signer = Signer::generate("prod-2026-05".to_string());
let embedding: Vec<f32> = my_model_embed("The quick brown fox.");
let pin = signer.pin(
    "The quick brown fox.",
    "text-embedding-3-large",
    embedding.as_slice(),
)?;

let mut verifier = Verifier::new();
verifier.add_key(signer.key_id(), signer.public_key_bytes());

let result = verifier.verify_full::<&[f32]>(
    &pin,
    Some("The quick brown fox."),
    Some(embedding.as_slice()),
    None,
);
assert!(result.is_ok());

TypeScript / JavaScript

npm install vectorpin
import { Signer, Verifier } from 'vectorpin';

const signer = Signer.generate('prod-2026-05');
const embedding = new Float32Array(/* ... 3072 floats from your model ... */);
const pin = signer.pin({
  source: 'The quick brown fox.',
  model: 'text-embedding-3-large',
  vector: embedding,
});

const verifier = new Verifier({ [signer.keyId]: signer.publicKeyBytes() });
const result = verifier.verify(pin, {
  source: 'The quick brown fox.',
  vector: embedding,
});
if (!result.ok) throw new Error(`integrity failure: ${result.error}`);

The Python, Rust, and TypeScript implementations are byte-for-byte compatible. A pin produced by any of them verifies on the other two, enforced by shared test vectors at testvectors/v1.json consumed in all three test suites. The TS port is pure JavaScript via @noble/ed25519 and @noble/hashes, so it also runs in Deno, Bun, and edge runtimes.

What VectorPin guarantees

Each Pin commits to:

  • The source text, by SHA-256 of UTF-8 NFC-normalized bytes.
  • The model, by identifier (and optionally by content hash).
  • The vector itself, by SHA-256 of canonical little-endian bytes.
  • The producer, by Ed25519 signing key.
  • The time, by RFC 3339 timestamp.

Verification distinguishes failure modes so callers can route them differently:

Outcome Meaning
OK Signature valid, vector intact, source matches.
SIGNATURE_INVALID Pin was forged or re-signed by an attacker.
VECTOR_TAMPERED Embedding modified after pinning. This is the steganography kill shot.
SOURCE_MISMATCH Source text differs from what was pinned.
MODEL_MISMATCH Pin was produced by a different embedding model than expected.
UNKNOWN_KEY Pin signed by a key not in the verifier's registry.
SHAPE_MISMATCH / UNSUPPORTED_VERSION Structural problems with the data.

CLI

# Generate a signing key pair
vectorpin keygen --key-id prod-2026-05 --output ./keys

# Pin a single (text, vector) pair (debug/demo)
vectorpin pin \
    --private-key ./keys/prod-2026-05.priv \
    --key-id prod-2026-05 \
    --model text-embedding-3-large \
    --source ./doc.txt \
    --vector ./embedding.npy

# Verify a pin
vectorpin verify-pin \
    --public-key ./keys/prod-2026-05.pub \
    --key-id prod-2026-05 \
    --pin ./pin.json \
    --source ./doc.txt \
    --vector ./embedding.npy

# Audit an entire Qdrant collection
vectorpin audit-qdrant \
    --url http://localhost:6333 \
    --collection my-rag \
    --public-key ./keys/prod-2026-05.pub \
    --key-id prod-2026-05

Vector store integrations

Backend Status Install
LanceDB (default) Alpha pip install 'vectorpin[default]'
Chroma Alpha pip install 'vectorpin[chroma]'
Pinecone Alpha pip install 'vectorpin[pinecone]'
Qdrant Alpha pip install 'vectorpin[qdrant]'
pgvector Planned
FAISS Planned Use LanceDBAdapter (embedded, has metadata column natively).

LanceDB is the recommended default: embedded, file-based, no daemon, with a typed schema column that holds the Pin natively — matching the Symbiont runtime's default vector backend. Choose Chroma or Pinecone if you already run those; Qdrant if you need server-side payload filtering.

For Symbiont deployments, the source text the embedding was produced from lives in Symbiont's content column (Symbiont's column literally named source is upstream provenance like a URL, not VectorPin's source argument). Pass source=record.metadata["content"] when calling signer.pin. See tests/test_adapter_lancedb_symbiont.py for an end-to-end example against the Symbiont schema.

from vectorpin import Signer, Verifier
from vectorpin.adapters import LanceDBAdapter

adapter = LanceDBAdapter.connect("./data/vector_db", "rag-corpus")
signer = Signer.generate(key_id="prod-2026-05")
verifier = Verifier(public_keys={signer.key_id: signer.public_key_bytes()})

# Replace "text" below with whichever column on your table holds
# the source text the embedding was produced from. On Symbiont's
# default schema, that column is named "content".
for record in adapter.iter_records():
    pin = signer.pin(
        source=record.metadata["text"],
        model="text-embedding-3-large",
        vector=record.vector,
    )
    adapter.attach_pin(record.id, pin)

The adapter protocol is intentionally thin; community contributions for new backends are welcome.

Performance

Pinning and verification are sub-millisecond per vector on commodity hardware — well below the embedding-model latency they sit alongside. Microbenchmarks for both implementations live at rust/vectorpin/benches/perf.rs (criterion) and scripts/bench_python.py (time.perf_counter_ns).

# Rust (criterion writes a report to target/criterion/)
cd rust && cargo bench --bench perf

# Python (standalone, no extra deps)
python scripts/bench_python.py --iters 5000

Indicative numbers on a modern x86_64 laptop, 3072-dim vectors (matching text-embedding-3-large):

Operation Rust (µs) Python (µs)
hash_vector 6.4 5.8
sign (pin) 35 35
verify_full 42 79
verify_signature_only 22 75

Re-run on your own hardware before quoting numbers.

Statistical detectors

Pinning catches modifications. Detectors catch ingestion-time tampering and poisoning campaigns that inject new tampered vectors. The two are complementary defenses:

from vectorpin.detectors.isolation_forest import IsolationForestDetector

detector = IsolationForestDetector().fit(clean_embeddings)
flagged = detector.decide(suspect_embeddings)

In the VectorSmuggle empirical study, this single line of defense flagged every operating point of every distribution-shifting steganographic technique that hides a non-trivial amount of data — but it does not catch orthogonal rotation (which preserves every density feature the detector fits on) and is brittle against attackers who know the detector. Cryptographic pinning is the durable layer; statistical detection is defense-in-depth.

Threat model

VectorPin is designed against an attacker who can:

  • Modify vectors after they are produced (via a poisoned ingestion pipeline, a compromised vector DB, or backup-level access)
  • See the public verification key, but not the private signing key
  • Replay or selectively delete pins

VectorPin does not defend against:

  • An attacker with the private signing key (out of scope; key custody is the user's responsibility)
  • An attacker who modifies the source documents before embedding (use upstream content integrity controls)
  • An attacker who uses a legitimate signing key to attest a malicious vector at ingestion time (use upstream input validation)

Status

Alpha (v0.1). Core protocol (Pin, Signer, Verifier) is stable and tested. Python and Rust ports are byte-for-byte compatible and locked together by shared test vectors in CI. Adapter coverage is partial. Hosted attestation service is not yet available.

The protocol version field (v: 1) lets future revisions break compatibility cleanly. We will not break existing pins without bumping the major version. See docs/spec.md for the wire-format specification.

Citation

If you reference VectorPin or the threat model it defends against, please cite the companion preprint:

Wanger, J. (2026). VectorSmuggle: Steganographic Exfiltration in Embedding Stores and a Cryptographic Provenance Defense. Zenodo. https://doi.org/10.5281/zenodo.20058256

@misc{wanger2026vectorsmuggle,
  title  = {{VectorSmuggle}: Steganographic Exfiltration in Embedding Stores and a Cryptographic Provenance Defense},
  author = {Wanger, Jascha},
  year   = {2026},
  publisher = {Zenodo},
  doi    = {10.5281/zenodo.20058256},
  url    = {https://doi.org/10.5281/zenodo.20058256}
}

Related work

  • VectorSmuggle — companion threat-research project demonstrating the attacks VectorPin defends against. Empirical results in the linked Zenodo preprint.
  • Symbiont — policy-governed agent runtime; consumes VectorPin attestations to enforce "agents may only retrieve from verified vector stores."
  • SchemaPin — sister project doing the same kind of cryptographic provenance for tool schemas in MCP.
  • sigstore — inspired our approach to OSS-friendly cryptographic provenance.

Contributing

Issues and PRs welcome. For security-sensitive findings, please email security@thirdkey.ai rather than filing public issues.

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorpin-0.1.0.tar.gz (38.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vectorpin-0.1.0-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file vectorpin-0.1.0.tar.gz.

File metadata

  • Download URL: vectorpin-0.1.0.tar.gz
  • Upload date:
  • Size: 38.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vectorpin-0.1.0.tar.gz
Algorithm Hash digest
SHA256 abb975970cb4e713d31516e31589c83c40bfd82fa0489518631cba9349f98441
MD5 79d63eea705d22647f4806153beb7c2b
BLAKE2b-256 b43c762279dee771c24062cc8b8c8381c83bf7b6e2913d5b6b2155e3a6e37c60

See more details on using hashes here.

File details

Details for the file vectorpin-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vectorpin-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vectorpin-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1facb6ed1b6f73495a19d931f6b2da9fe50cd1aae91d271583a9b71abf0ff761
MD5 ac9bf6f2794bd629824a1ccfc5468ce8
BLAKE2b-256 de05921a2dab3ea22ab2d278de2d3d8f93d3dd64a24f847497b5b672d33dce10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page