PQC-native Merkle-tree commitments for AI training datasets. Prove what a model was trained on without revealing the data. SHA3-256 tree + ML-DSA signatures.

These details have not been verified by PyPI

Project description

PQC Training Data Transparency

PQC Native Merkle SHA3-256 ML-DSA License Version

Cryptographic transparency for AI training data. Build an SHA3-256 Merkle tree over every record in your training set, sign the root with ML-DSA (FIPS 204), and publish it. Anyone who holds a single document can later receive an O(log n) inclusion proof showing that the record was in the training set — without revealing any of the other records. The audit trail survives the transition to post-quantum cryptography, so commitments made today remain verifiable in 2035 and beyond.

The Problem

AI copyright litigation, regulatory audits, and red-team requests keep asking the same question: what exactly was used to train this model? Model creators today have no cryptographic answer.

"Prove this document was NOT in your training set" — requires revealing the entire training set (impossible for proprietary or licensed data).
"Prove your model wasn't trained on PII" — requires deleting, then proving a negative.
"Which records were used for fine-tune v2 vs v3?" — no binding commitment exists, so claims are unfalsifiable.

And the few audit trails that do exist are typically RSA- or ECDSA-signed. A cryptographically relevant quantum computer breaks those signatures, and the entire audit chain collapses retroactively. Training data provenance has a 15-20 year shelf life; the crypto under it must survive that long.

The Solution

Commit once, prove selectively:

Hash every record into a leaf: SHA3-256(content || canonical(metadata)).
Build an SHA3-256 Merkle tree over the leaves.
Wrap the root in a TrainingCommitment (dataset name, version, record count, timestamps, licenses, tags).
Sign the canonical commitment with ML-DSA at model-release time.
Publish the commitment anywhere — on-chain, in a transparency log, on quantamrkt.com, stapled to the model card.

Later, anyone can ask "was record X in the training set?" The creator returns an inclusion proof (log₂(n) sibling hashes). The verifier checks the proof against the signed root. No other record is revealed.

Installation

pip install pqc-training-data-transparency

Development:

pip install -e ".[dev]"

Quick Start

Build and sign a commitment

from quantumshield import AgentIdentity
from pqc_training_data import (
    CommitmentBuilder, CommitmentSigner, DataRecord,
)

identity = AgentIdentity.create("model-creator")
signer = CommitmentSigner(identity)

corpus = [
    DataRecord(content=doc_bytes, metadata={"source": "internal", "id": i})
    for i, doc_bytes in enumerate(your_documents)
]

builder = CommitmentBuilder(dataset_name="model-v1-train", dataset_version="1.0.0")
builder.add_records(corpus)
builder.licenses = ["cc-by-4.0"]
builder.tags = ["production"]

commitment = signer.sign(builder.build(description="Production training set"))

# Publish commitment.to_json() — this is the public audit artifact.

Prove a single record is in the training set

# Auditor holds only one specific record + the public commitment.
proof = builder.tree.inclusion_proof(index=42)
result = CommitmentVerifier.verify(corpus[42], proof, commitment)

assert result.fully_verified
# -> signature_valid=True, proof_valid=True, leaf_matches_record=True

Detect a forged inclusion claim

forged = DataRecord(content=b"never-in-training", metadata={"id": 999})
pretend_proof = builder.tree.inclusion_proof(index=0)  # hijack a real slot

result = CommitmentVerifier.verify(forged, pretend_proof, commitment)
assert not result.fully_verified            # rejected
# result.error: "record leaf_hash ... does not match proof ..."

Architecture

  Training Pipeline (creator)                        Audit Path (third party)
  --------------------------                         ------------------------
                                                                |
  records = [doc1, doc2, ..., docN]                             |
         |                                                      |
         | 1. leaf_hash = SHA3-256(                              |
         |       SHA3-256(content) || canonical_json(metadata)) |
         v                                                      |
  [leaf_1, leaf_2, ..., leaf_N]                                 |
         |                                                      |
         | 2. Merkle fold (SHA3-256, 0x00/0x01 domain sep)      |
         v                                                      |
       ROOT                                                     |
         |                                                      |
         | 3. wrap in TrainingCommitment                        |
         |    (id, dataset, version, created_at, ...)           |
         |                                                      |
         | 4. ML-DSA.sign(canonical(commitment))                |
         v                                                      |
  SIGNED COMMITMENT  -->  published (on-chain, log, model card) |
                                                                |
                                                                | 5. request
                                                                |    inclusion
                                                                |    proof for
                                                                |    record R
                                                                v
                          InclusionProof (leaf, siblings, dirs, root)
                                                                |
                                                                | 6. verify:
                                                                |    ML-DSA(commitment) OK?
                                                                |    leaf_hash(R) == proof.leaf?
                                                                |    walk siblings -> root?
                                                                |    proof.root == commitment.root?
                                                                v
                                                         VerificationResult
                                                         (fully_verified T/F)

Threat Model

Threat	Handled	Notes
Forged inclusion claim (attacker claims doc X is in the set)	Yes	Verifier recomputes `leaf_hash(X)` and compares to the proof; walk to root fails or mismatches.
Tampered commitment signature (attacker edits dataset_name, record_count, root)	Yes	Canonical bytes change, ML-DSA signature no longer verifies.
Tampered inclusion proof (attacker flips a sibling hash)	Yes	Root recomputation diverges from signed root.
Quantum forgery in 2035+ (CRQC forges the audit trail retroactively)	Yes	ML-DSA is a FIPS 204 post-quantum signature; not broken by Shor/Grover.
Proving NON-inclusion (prove a record was not in training)	No	Requires a sorted-tree / Verkle construction. Future work.
Revealing private training data	No (by design)	Commitment contains only the root; proofs reveal `log₂(n)` sibling hashes, never other records. The creator decides what to reveal.
Selective disclosure of metadata fields	No	A record's metadata is fully inside its leaf. Hashing over `metadata` is all-or-nothing; carve out separate fields into the leaf if you need partial reveals.
Re-publication of old commitment (attacker re-uses prior root for a new model release)	Partial	`commitment_id` + `dataset_version` + `created_at` are all signed; enforce freshness by policy.

API Reference

`DataRecord`

Frozen dataclass. One training example.

Field / Method	Description
`content: bytes`	Raw record payload (doc text, image bytes, serialized row, ...).
`metadata: dict`	Arbitrary metadata — participates in the leaf hash.
`canonical_bytes()`	Deterministic `SHA3-256(content)
`leaf_hash() -> RecordHash`	SHA3-256 of canonical bytes — the Merkle leaf value.
`to_dict()`	Safe serialization. Does not include raw content.

`MerkleTree`

SHA3-256 Merkle tree with RFC6962-style odd-node promotion.

Method	Description
`add(leaf_hash)` / `add_many(leaves)`	Append leaves.
`root() -> str`	Hex Merkle root. Raises `EmptyTreeError` for empty trees.
`inclusion_proof(index) -> InclusionProof`	`O(log n)` proof for leaf at `index`.
`MerkleTree.verify_inclusion(proof) -> bool`	Static verification — independent of tree state.

`InclusionProof`

Frozen dataclass carried from prover to verifier.

Field	Description
`leaf_hash`	Hex of the leaf being proven.
`index`, `tree_size`	Position and total size at time of proof.
`root`	Hex root the prover claims.
`siblings`, `directions`	`log₂(n)` sibling hashes + `'L'`/`'R'` flags.

`TrainingCommitment`

The signed audit artifact.

Field	Description
`commitment_id`	`urn:pqc-td:<uuid>`.
`dataset_name`, `dataset_version`, `description`	Human-readable identification.
`record_count`, `root`	Cryptographic binding to the tree.
`created_at`, `licenses`, `tags`, `extra`	Provenance metadata — all signed.
`signer_did`, `algorithm`, `signature`, `public_key`, `signed_at`	ML-DSA signature block (populated by `CommitmentSigner.sign`).
`to_json()` / `from_json()`	Network-safe round-trip.
`canonical_bytes()`	Deterministic JSON covered by the signature.

`CommitmentBuilder`

Accumulator for records, emits an unsigned TrainingCommitment.

Method	Description
`CommitmentBuilder(dataset_name, dataset_version)`	Start a build.
`add_record(record)` / `add_records(records)`	Queue records.
`add_leaf_hash_hex(hex)`	Direct-add when caller pre-hashed the data.
`build(description="") -> TrainingCommitment`	Produce unsigned commitment.
`.tree`	Underlying `MerkleTree` — use to generate inclusion proofs later.

`CommitmentSigner`

ML-DSA sign + verify.

Method	Description
`CommitmentSigner(identity)`	Wrap a QuantumShield `AgentIdentity`.
`sign(commitment) -> TrainingCommitment`	Populate signature fields.
`CommitmentSigner.verify(commitment) -> bool`	Static — verify signature against embedded public key.

`CommitmentVerifier` + `VerificationResult`

End-to-end check of (record, proof, commitment).

Method	Description
`CommitmentVerifier.verify(record, proof, commitment)`	Returns a `VerificationResult`.
`CommitmentVerifier.verify_or_raise(...)`	Raises `CommitmentVerificationError` on any failure.

VerificationResult fields: signature_valid, proof_valid, leaf_matches_record, commitment_id, record_leaf_hash, claimed_root, error, and the fully_verified property.

Exceptions

Exception	When
`TrainingDataError`	Base class.
`EmptyTreeError`	Tree operation requires at least one leaf.
`InclusionProofError`	Malformed or unverifiable proof.
`CommitmentVerificationError`	Raised by `verify_or_raise` on failure.
`IndexOutOfRangeError`	Leaf index outside `[0, size)`.

Why PQC for Training Data

Training data provenance is a 15-to-20-year commitment:

Regulatory discovery can ask about training data decades after the model was released.
Copyright plaintiffs litigate on timelines that long outlive a model's commercial life.
Medical, legal, and financial AI systems are audited for the lifetime of the decisions they influenced.

A Merkle commitment signed today with RSA-2048 or ECDSA-P256 becomes forgeable the moment a cryptographically relevant quantum computer exists. An adversary with a CRQC can retroactively forge arbitrary "signed commitments" and "inclusion proofs", collapsing the entire audit trail.

ML-DSA (FIPS 204) is not broken by Shor's algorithm. Commitments minted today remain verifiable through the post-quantum transition.

Examples

See the examples/ directory:

commit_corpus.py — build a signed commitment over a small training corpus.
prove_inclusion.py — produce and verify an O(log n) inclusion proof.
detect_false_inclusion_claim.py — demonstrate rejection of a forged "my data was in training" claim.

Run them:

python examples/commit_corpus.py
python examples/prove_inclusion.py
python examples/detect_false_inclusion_claim.py

Development

pip install -e ".[dev]"
pytest
ruff check src/ tests/ examples/

Part of the QuantaMrkt post-quantum tooling registry. See also:

QuantumShield — the PQC toolkit (AgentIdentity, SignatureAlgorithm, sign/verify).
PQC RAG Signing — sister tool for signing RAG corpus chunks with ML-DSA.
PQC Content Provenance — signed manifests for content authenticity.
PQC MCP Transport — signed JSON-RPC transport for Model Context Protocol.

License

Apache License 2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pqc_training_data_transparency-0.1.0.tar.gz (18.8 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pqc_training_data_transparency-0.1.0-py3-none-any.whl (17.0 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file pqc_training_data_transparency-0.1.0.tar.gz.

File metadata

Download URL: pqc_training_data_transparency-0.1.0.tar.gz
Upload date: Apr 20, 2026
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for pqc_training_data_transparency-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2a41b821683a7cacc5774dc97f6b4c9e9d9c96ed417cbb7f261a4adb5b82e375`
MD5	`dcebfce3654305210d09bc85f6c04cb6`
BLAKE2b-256	`c78260b399cbb40c2029012eca0cb737ad681919c90e87e6863ac301fbb45796`

See more details on using hashes here.

File details

Details for the file pqc_training_data_transparency-0.1.0-py3-none-any.whl.

File metadata

Download URL: pqc_training_data_transparency-0.1.0-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for pqc_training_data_transparency-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f50c77d9d14a7edb9accc30c4da2bf19e12d467227a87bb5252495b68c4fd1d4`
MD5	`f22779ae20493c897b66cc183cfe3066`
BLAKE2b-256	`20b2133fd67c0ebaf1de99d8f31ea18997c77a72444461fbd02a230bfb0596d5`

See more details on using hashes here.

pqc-training-data-transparency 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PQC Training Data Transparency

The Problem

The Solution

Installation

Quick Start

Build and sign a commitment

Prove a single record is in the training set

Detect a forged inclusion claim

Architecture

Threat Model

API Reference

DataRecord

MerkleTree

InclusionProof

TrainingCommitment

CommitmentBuilder

CommitmentSigner

CommitmentVerifier + VerificationResult

Exceptions

Why PQC for Training Data

Examples

Development

Related

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`DataRecord`

`MerkleTree`

`InclusionProof`

`TrainingCommitment`

`CommitmentBuilder`

`CommitmentSigner`

`CommitmentVerifier` + `VerificationResult`