Skip to main content

PQC-native Merkle-tree commitments for AI training datasets. Prove what a model was trained on without revealing the data. SHA3-256 tree + ML-DSA signatures.

Project description

PQC Training Data Transparency

PQC Native Merkle SHA3-256 ML-DSA License Version

Cryptographic transparency for AI training data. Build an SHA3-256 Merkle tree over every record in your training set, sign the root with ML-DSA (FIPS 204), and publish it. Anyone who holds a single document can later receive an O(log n) inclusion proof showing that the record was in the training set — without revealing any of the other records. The audit trail survives the transition to post-quantum cryptography, so commitments made today remain verifiable in 2035 and beyond.

The Problem

AI copyright litigation, regulatory audits, and red-team requests keep asking the same question: what exactly was used to train this model? Model creators today have no cryptographic answer.

  • "Prove this document was NOT in your training set" — requires revealing the entire training set (impossible for proprietary or licensed data).
  • "Prove your model wasn't trained on PII" — requires deleting, then proving a negative.
  • "Which records were used for fine-tune v2 vs v3?" — no binding commitment exists, so claims are unfalsifiable.

And the few audit trails that do exist are typically RSA- or ECDSA-signed. A cryptographically relevant quantum computer breaks those signatures, and the entire audit chain collapses retroactively. Training data provenance has a 15-20 year shelf life; the crypto under it must survive that long.

The Solution

Commit once, prove selectively:

  • Hash every record into a leaf: SHA3-256(content || canonical(metadata)).
  • Build an SHA3-256 Merkle tree over the leaves.
  • Wrap the root in a TrainingCommitment (dataset name, version, record count, timestamps, licenses, tags).
  • Sign the canonical commitment with ML-DSA at model-release time.
  • Publish the commitment anywhere — on-chain, in a transparency log, on quantamrkt.com, stapled to the model card.

Later, anyone can ask "was record X in the training set?" The creator returns an inclusion proof (log₂(n) sibling hashes). The verifier checks the proof against the signed root. No other record is revealed.

Installation

pip install pqc-training-data-transparency

Development:

pip install -e ".[dev]"

Quick Start

Build and sign a commitment

from quantumshield import AgentIdentity
from pqc_training_data import (
    CommitmentBuilder, CommitmentSigner, DataRecord,
)

identity = AgentIdentity.create("model-creator")
signer = CommitmentSigner(identity)

corpus = [
    DataRecord(content=doc_bytes, metadata={"source": "internal", "id": i})
    for i, doc_bytes in enumerate(your_documents)
]

builder = CommitmentBuilder(dataset_name="model-v1-train", dataset_version="1.0.0")
builder.add_records(corpus)
builder.licenses = ["cc-by-4.0"]
builder.tags = ["production"]

commitment = signer.sign(builder.build(description="Production training set"))

# Publish commitment.to_json() — this is the public audit artifact.

Prove a single record is in the training set

# Auditor holds only one specific record + the public commitment.
proof = builder.tree.inclusion_proof(index=42)
result = CommitmentVerifier.verify(corpus[42], proof, commitment)

assert result.fully_verified
# -> signature_valid=True, proof_valid=True, leaf_matches_record=True

Detect a forged inclusion claim

forged = DataRecord(content=b"never-in-training", metadata={"id": 999})
pretend_proof = builder.tree.inclusion_proof(index=0)  # hijack a real slot

result = CommitmentVerifier.verify(forged, pretend_proof, commitment)
assert not result.fully_verified            # rejected
# result.error: "record leaf_hash ... does not match proof ..."

Architecture

  Training Pipeline (creator)                        Audit Path (third party)
  --------------------------                         ------------------------
                                                                |
  records = [doc1, doc2, ..., docN]                             |
         |                                                      |
         | 1. leaf_hash = SHA3-256(                              |
         |       SHA3-256(content) || canonical_json(metadata)) |
         v                                                      |
  [leaf_1, leaf_2, ..., leaf_N]                                 |
         |                                                      |
         | 2. Merkle fold (SHA3-256, 0x00/0x01 domain sep)      |
         v                                                      |
       ROOT                                                     |
         |                                                      |
         | 3. wrap in TrainingCommitment                        |
         |    (id, dataset, version, created_at, ...)           |
         |                                                      |
         | 4. ML-DSA.sign(canonical(commitment))                |
         v                                                      |
  SIGNED COMMITMENT  -->  published (on-chain, log, model card) |
                                                                |
                                                                | 5. request
                                                                |    inclusion
                                                                |    proof for
                                                                |    record R
                                                                v
                          InclusionProof (leaf, siblings, dirs, root)
                                                                |
                                                                | 6. verify:
                                                                |    ML-DSA(commitment) OK?
                                                                |    leaf_hash(R) == proof.leaf?
                                                                |    walk siblings -> root?
                                                                |    proof.root == commitment.root?
                                                                v
                                                         VerificationResult
                                                         (fully_verified T/F)

Threat Model

Threat Handled Notes
Forged inclusion claim (attacker claims doc X is in the set) Yes Verifier recomputes leaf_hash(X) and compares to the proof; walk to root fails or mismatches.
Tampered commitment signature (attacker edits dataset_name, record_count, root) Yes Canonical bytes change, ML-DSA signature no longer verifies.
Tampered inclusion proof (attacker flips a sibling hash) Yes Root recomputation diverges from signed root.
Quantum forgery in 2035+ (CRQC forges the audit trail retroactively) Yes ML-DSA is a FIPS 204 post-quantum signature; not broken by Shor/Grover.
Proving NON-inclusion (prove a record was not in training) No Requires a sorted-tree / Verkle construction. Future work.
Revealing private training data No (by design) Commitment contains only the root; proofs reveal log₂(n) sibling hashes, never other records. The creator decides what to reveal.
Selective disclosure of metadata fields No A record's metadata is fully inside its leaf. Hashing over metadata is all-or-nothing; carve out separate fields into the leaf if you need partial reveals.
Re-publication of old commitment (attacker re-uses prior root for a new model release) Partial commitment_id + dataset_version + created_at are all signed; enforce freshness by policy.

API Reference

DataRecord

Frozen dataclass. One training example.

Field / Method Description
content: bytes Raw record payload (doc text, image bytes, serialized row, ...).
metadata: dict Arbitrary metadata — participates in the leaf hash.
canonical_bytes() Deterministic `SHA3-256(content)
leaf_hash() -> RecordHash SHA3-256 of canonical bytes — the Merkle leaf value.
to_dict() Safe serialization. Does not include raw content.

MerkleTree

SHA3-256 Merkle tree with RFC6962-style odd-node promotion.

Method Description
add(leaf_hash) / add_many(leaves) Append leaves.
root() -> str Hex Merkle root. Raises EmptyTreeError for empty trees.
inclusion_proof(index) -> InclusionProof O(log n) proof for leaf at index.
MerkleTree.verify_inclusion(proof) -> bool Static verification — independent of tree state.

InclusionProof

Frozen dataclass carried from prover to verifier.

Field Description
leaf_hash Hex of the leaf being proven.
index, tree_size Position and total size at time of proof.
root Hex root the prover claims.
siblings, directions log₂(n) sibling hashes + 'L'/'R' flags.

TrainingCommitment

The signed audit artifact.

Field Description
commitment_id urn:pqc-td:<uuid>.
dataset_name, dataset_version, description Human-readable identification.
record_count, root Cryptographic binding to the tree.
created_at, licenses, tags, extra Provenance metadata — all signed.
signer_did, algorithm, signature, public_key, signed_at ML-DSA signature block (populated by CommitmentSigner.sign).
to_json() / from_json() Network-safe round-trip.
canonical_bytes() Deterministic JSON covered by the signature.

CommitmentBuilder

Accumulator for records, emits an unsigned TrainingCommitment.

Method Description
CommitmentBuilder(dataset_name, dataset_version) Start a build.
add_record(record) / add_records(records) Queue records.
add_leaf_hash_hex(hex) Direct-add when caller pre-hashed the data.
build(description="") -> TrainingCommitment Produce unsigned commitment.
.tree Underlying MerkleTree — use to generate inclusion proofs later.

CommitmentSigner

ML-DSA sign + verify.

Method Description
CommitmentSigner(identity) Wrap a QuantumShield AgentIdentity.
sign(commitment) -> TrainingCommitment Populate signature fields.
CommitmentSigner.verify(commitment) -> bool Static — verify signature against embedded public key.

CommitmentVerifier + VerificationResult

End-to-end check of (record, proof, commitment).

Method Description
CommitmentVerifier.verify(record, proof, commitment) Returns a VerificationResult.
CommitmentVerifier.verify_or_raise(...) Raises CommitmentVerificationError on any failure.

VerificationResult fields: signature_valid, proof_valid, leaf_matches_record, commitment_id, record_leaf_hash, claimed_root, error, and the fully_verified property.

Exceptions

Exception When
TrainingDataError Base class.
EmptyTreeError Tree operation requires at least one leaf.
InclusionProofError Malformed or unverifiable proof.
CommitmentVerificationError Raised by verify_or_raise on failure.
IndexOutOfRangeError Leaf index outside [0, size).

Why PQC for Training Data

Training data provenance is a 15-to-20-year commitment:

  • Regulatory discovery can ask about training data decades after the model was released.
  • Copyright plaintiffs litigate on timelines that long outlive a model's commercial life.
  • Medical, legal, and financial AI systems are audited for the lifetime of the decisions they influenced.

A Merkle commitment signed today with RSA-2048 or ECDSA-P256 becomes forgeable the moment a cryptographically relevant quantum computer exists. An adversary with a CRQC can retroactively forge arbitrary "signed commitments" and "inclusion proofs", collapsing the entire audit trail.

ML-DSA (FIPS 204) is not broken by Shor's algorithm. Commitments minted today remain verifiable through the post-quantum transition.

Examples

See the examples/ directory:

  • commit_corpus.py — build a signed commitment over a small training corpus.
  • prove_inclusion.py — produce and verify an O(log n) inclusion proof.
  • detect_false_inclusion_claim.py — demonstrate rejection of a forged "my data was in training" claim.

Run them:

python examples/commit_corpus.py
python examples/prove_inclusion.py
python examples/detect_false_inclusion_claim.py

Development

pip install -e ".[dev]"
pytest
ruff check src/ tests/ examples/

Related

Part of the QuantaMrkt post-quantum tooling registry. See also:

  • QuantumShield — the PQC toolkit (AgentIdentity, SignatureAlgorithm, sign/verify).
  • PQC RAG Signing — sister tool for signing RAG corpus chunks with ML-DSA.
  • PQC Content Provenance — signed manifests for content authenticity.
  • PQC MCP Transport — signed JSON-RPC transport for Model Context Protocol.

License

Apache License 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pqc_training_data_transparency-0.1.0.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pqc_training_data_transparency-0.1.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file pqc_training_data_transparency-0.1.0.tar.gz.

File metadata

File hashes

Hashes for pqc_training_data_transparency-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2a41b821683a7cacc5774dc97f6b4c9e9d9c96ed417cbb7f261a4adb5b82e375
MD5 dcebfce3654305210d09bc85f6c04cb6
BLAKE2b-256 c78260b399cbb40c2029012eca0cb737ad681919c90e87e6863ac301fbb45796

See more details on using hashes here.

File details

Details for the file pqc_training_data_transparency-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pqc_training_data_transparency-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f50c77d9d14a7edb9accc30c4da2bf19e12d467227a87bb5252495b68c4fd1d4
MD5 f22779ae20493c897b66cc183cfe3066
BLAKE2b-256 20b2133fd67c0ebaf1de99d8f31ea18997c77a72444461fbd02a230bfb0596d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page