Skip to main content

Epistemic infrastructure for AI scientists

Project description

Mareforma

Python Tests PyPI License: MIT

AI scientists are being deployed on real research problems before any infrastructure exists to know which of their findings can be trusted. Observability tools record what an agent did — they do not record what it means, whether it converges with independent evidence, or how far a conclusion is from its raw data.

Mareforma is a local epistemic graph that accumulates findings across agent runs, detects convergence automatically when independent agents reach the same conclusion through different data paths, and exposes the full provenance chain so trust can be derived from structure rather than from self-reported confidence.

Every individual capability mareforma uses — Ed25519 signing, DSSE envelopes, Sigstore-Rekor transparency, GRADE-shaped evidence vectors, local SQLite — exists in mature form elsewhere. What's missing in the OSS landscape is the combination: a runtime, opt-in Python library that bundles them as the place an agent writes claims to. See ARCHITECTURE.md for the lane.

Mareforma records what the agent claimed and how — so silent pipeline failures, prior-knowledge fallbacks, and self-reported convergence become visible. Trust signals are derived from graph structure (whose key signed what, which upstreams are cited, how many independent agents wrote the same conclusion), not from the agent's own confidence. The substrate is honest about what those signals do and do not prove — see "What mareforma is NOT" below.

import mareforma

with mareforma.open() as graph:

    # Query what is already established before asserting
    prior = graph.query("topic X", min_support="REPLICATED")

    claim_id = graph.assert_claim(
        "Cell type A exhibits property X under condition Y (n=842, p<0.001)",
        classification="ANALYTICAL",
        generated_by="agent/model-a/lab_a",
        supports=[c["claim_id"] for c in prior],
    )
graph LR
    P(["ESTABLISHED upstream<br/>(prior literature)"]) --> A["ANALYTICAL · lab_a"]
    P --> B["ANALYTICAL · lab_b"]
    A --> R(["REPLICATED ✓"])
    B --> R
    R -->|"graph.validate()"| E(["ESTABLISHED ✓"])

    style P fill:#713f12,stroke:#f59e0b,color:#fde68a
    style A fill:#1e3a5f,stroke:#3b82f6,color:#93c5fd
    style B fill:#1e3a5f,stroke:#3b82f6,color:#93c5fd
    style R fill:#14532d,stroke:#22c55e,color:#86efac
    style E fill:#713f12,stroke:#f59e0b,color:#fde68a

REPLICATED requires that the converging claims share an ESTABLISHED upstream in supports[] — matches Cochrane/GRADE evidence-chain methodology. On a fresh graph, bootstrap an ESTABLISHED anchor with seed=True (enrolled validator only); see Example 03 for the full seed-then-converge pattern.

Findings contradict — both stay in the graph

# ESTABLISHED consensus sits in the graph
prior = graph.query("Treatment X", min_support="ESTABLISHED")

# New larger study gets a different result — don't discard it, document it
graph.assert_claim(
    "Treatment X shows no effect (n=1240, p=0.21) — larger and more diverse cohort",
    classification="ANALYTICAL",
    contradicts=[c["claim_id"] for c in prior],
)

Science advances by documented contestation, not by one side disappearing. Both claims coexist. A human reviewer sees the tension explicitly in the graph.

The infrastructure gap

The current generation of agent systems provides no principled way to distinguish:

  • a finding backed by data from one backed by LLM prior knowledge
  • genuine independent replication from two agents repeating each other
  • an established consensus from a single speculative assertion

Without that structure, every output looks like a result.

  • Tracing and observability tools record what the agent did. They do not record what it means, whether it can be trusted, or whether another agent already found the same thing by a different path.

  • One-shot pipelines evaporate findings between runs. There is no memory of what was established, no way to detect convergence, no accumulated graph. Each run starts from scratch.

  • Silent pipeline failures look like results. If the data pipeline returns null and the agent falls back to LLM prior knowledge, the finding looks identical to a data-driven one — unless you record the classification at assertion time.

Architecture

graph = mareforma.open()                                # zero setup, no init required
graph.assert_claim(text, classification="ANALYTICAL", supports=[...])
graph.query(text, min_support="REPLICATED")
graph.validate(claim_id, evidence_seen=[...])           # human promotes to ESTABLISHED, citing reviewed claims
graph.health()                                          # HealthReport: per-status + per-support-level counters, traffic light
graph.refresh_unresolved()                              # retry DOI verification for offline claims
graph.refresh_all_dois()                                # force-re-check every DOI (retraction drift)
graph.refresh_unsigned()                                # retry Rekor (replays from rekor_inclusions sidecar if present)
graph.refresh_convergence()                             # retry detection for claims whose post-INSERT promotion errored
graph.find_dangling_supports()                          # audit UUID refs that point nowhere
graph.classify_supports([...])                          # tag each entry as claim/doi/external
graph.get_tools(generated_by="agent/model-a/lab_a")     # framework-ready callables

External verification, opt-in by component:

  • DOIs in supports[]/contradicts[] are HEAD-checked against Crossref and DataCite at assertion time. Failed verifications hold the claim out of REPLICATED until refresh_unresolved() succeeds.
  • Cryptographic signing is opt-in. mareforma bootstrap once to generate an Ed25519 keypair; every claim is then signed and tamper-evident.
  • Sigstore-Rekor transparency log is opt-in via mareforma.open(rekor_url=mareforma.signing.PUBLIC_REKOR_URL). Signed claims are submitted to the log; the entry uuid + logIndex + raw response bytes are persisted to the local rekor_inclusions sidecar. Submission failures are retried by refresh_unsigned().
  • RFC 6962 inclusion-proof verification is opt-in via mareforma.open(rekor_log_pubkey_pem=...) (or rekor_log_pubkey_path=). When the log operator's public key is supplied, every signed-claim submit and every refresh_unsigned() re-fetches the entry from Rekor and cryptographically verifies the Merkle audit path against the log's signed checkpoint. Verification failure refuses to mark the row transparency_logged=1. The key is persisted to .mareforma/rekor_log_pubkey.pem as a TOFU pin — silent rotation is refused on subsequent opens. See the Rekor limits below for what verification does and does not prove.

Trust levels — derived from graph topology, never self-reported:

Level Meaning
PRELIMINARY One agent asserted it. The substrate has cryptographic provenance but no convergence signal yet.
REPLICATED ≥2 enrolled keys signed claims sharing an ESTABLISHED upstream with different generated_by strings. As strong as the writers' honesty about independence — the substrate detects that two strings differ, not that two physical agents do.
ESTABLISHED An enrolled human-typed key signed a validation envelope. Pass evidence_seen=[...] to bind the validator's review citations (every cited claim must exist and predate validation). Without evidence_seen, the envelope records that a human pressed a button; with it, the envelope records what they reviewed.

Classification — declared by the agent, records epistemic origin:

Value When
INFERRED LLM reasoning (default)
ANALYTICAL Deterministic analysis against source data
DERIVED Explicitly built on ESTABLISHED or REPLICATED claims

Trust level and classification are independent axes. An INFERRED + ESTABLISHED claim is possible (a human validated an LLM-prior-knowledge assertion); so is an ANALYTICAL + PRELIMINARY claim (a data-grounded finding with no convergence yet). Trust answers "how many parties witnessed this?"; classification answers "what did the agent compute it from?" Use both when querying: graph.query(text, min_support="REPLICATED", classification="ANALYTICAL").

Storage: local SQLite, WAL mode, ACID guarantees. Network calls only for opt-in external verification: DOI lookups (Crossref + DataCite) and Sigstore-Rekor transparency log.

Silent pipeline failures become visible

The reproduction-worthy use case mareforma was built for. A real AI scientist agent runs a multi-step analysis: query a public dataset, regress a gene's expression against a phenotype, return the top hit. The data lookup silently returns null because of a stale EFO id. The agent's LLM reasoning fills the gap with prior knowledge and returns a plausible-sounding answer. The output looks identical to a data-driven result.

finding_text = run_pipeline(target_gene, phenotype)

graph.assert_claim(
    finding_text,
    # The one line that breaks the symmetry: classification depends on
    # whether real data flowed through. The substrate doesn't compute
    # this — the agent's wrapper inspects the pipeline state and tells
    # the truth at assertion time.
    classification="ANALYTICAL" if generated_code_ran else "INFERRED",
    generated_by="agent/gpt-4o/lab_a",
    source_name="depmap_24q2" if data_actually_loaded else None,
)

Now a downstream consumer querying min_support="REPLICATED", classification="ANALYTICAL" excludes the silent-fallback rows. The hallucinated finding is in the graph (auditable, signed), but it's NOT in the trustworthy result set. The wrapper that picks "ANALYTICAL" vs "INFERRED" is doing the work — the substrate makes that work visible and tamper-evident.

Example 05 — Drug Target Provenance wraps MEDEA (a real AI scientist agent published on arXiv), reproduces a real silent-failure mode in its EFO id lookup, and shows the classification gate catching it.

What mareforma is NOT

Honest scope, so the design choices land in the right frame:

  • Not a global trust system. Trust is local to a project's enrolled validators. Two projects' ESTABLISHED claims are not comparable across installations — there is no federation layer, no cross-project key directory, no shared ground truth.
  • Classification is self-declared. ANALYTICAL, INFERRED, DERIVED are the asserter's claims about epistemic origin. The substrate signs them tamper-evidently but does not verify them against the actual code path. A misclassified claim is a trust failure of the asserter, not the substrate.
  • generated_by is self-declared too. It is the substrate's primary independence signal — REPLICATED fires only when two claims sharing an upstream have different generated_by values. But the substrate cannot verify that agent/claude/lab_a and agent/claude/lab_b correspond to different physical agents; both are free-form strings supplied by the caller. An adversary running both labs can produce REPLICATED at will. Honest agents produce a useful signal; the signal is no stronger than the discipline of the agents writing to the graph.
  • GRADE EvidenceVector is GRADE-shaped storage, not GRADE evaluation. Each claim carries a 5-domain vector (risk_of_bias, inconsistency, indirectness, imprecision, publication_bias) plus three upgrade flags, bound into the signed predicate. The substrate stores and round-trips these values; it does not derive a single GRADE certainty rating (high/moderate/low/very-low), does not gate upgrade flags on study design (RCT vs. observational), and does not integrate the vector into the PRELIMINARY → REPLICATED → ESTABLISHED ladder. Full GRADE-style certainty derivation is out of scope for the current release and on the deferred-features backlog.
  • Rekor inclusion verification is opt-in, not on by default. When rekor_url= is configured, signed claims are submitted to the log and the response — entry uuid, logIndex, integratedTime, and the raw body bytes — is persisted to the local rekor_inclusions sidecar. The submit-time response is cryptographically checked (the returned entry must record OUR hash + OUR signature; mismatch → submission treated as a failure), so a hostile log operator cannot get the substrate to accept an unrelated entry as ours. When the caller additionally supplies the log operator's public key (rekor_log_pubkey_pem= or rekor_log_pubkey_path=), the substrate re-fetches the entry and verifies the RFC 6962 Merkle inclusion proof against the log's signed checkpoint — closing the residual "the log forked or rewrote history after we submitted" gap on the submit and refresh_unsigned() paths. Without an explicit pubkey, the substrate trusts the log operator's submit-time response (persisted locally for future re-verification); the operator who needs strict transparency must opt in. Restore-time re-verification of stored proofs is on the deferred-features list (it needs the rekor_inclusions sidecar round-tripped through claims.toml).
  • DOIs are HEAD-checked, not content-verified. External references resolve to a 200 response; the substrate does not parse, sign, or archive the referenced content. A DOI that resolves today may resolve to different content tomorrow. The positive cache holds resolved DOIs for 30 days — operators worried about retraction drift can call graph.refresh_all_dois() to force-re-check every DOI against the registry and surface the diff.
  • No semantic deduplication. Two claims with different text but identical meaning are distinct rows. Convergence detection runs on supports[] topology, not on text similarity.
  • Dangling references in supports[] are accepted. The substrate does not refuse UUID-shaped strings that don't point to any existing claim — a supports entry could legitimately reference a claim from another project, a not-yet-asserted upstream, or a DOI. REPLICATED detection requires the referenced ESTABLISHED claim to actually exist + be open, so a dangling reference cannot trigger spurious promotion. Operators auditing a graph for integrity can call graph.find_dangling_supports() to surface every such hanging arrow.
  • Contradiction is per-claim, not propagated downstream. A signed contradiction marks the older of two referenced claims; claims that cited the now-invalidated one via supports[] are unaffected. Deliberate boundary — see ARCHITECTURE.md.
  • Verdict-issuer method labels are self-declared. When an enrolled issuer signs a replication_verdict with method="semantic-cluster", the substrate stores the label and binds it into the signature. It does not verify that the labeled method was actually run, that two semantic-cluster verdicts from different issuers used the same algorithm or threshold, or that the issuer's confidence dict is calibrated. Like generated_by, method labels are as strong as the issuer's honesty. The verdict-issuer table is a typed-string registry with cryptographic stapling, not an epistemic primitive on its own.
  • No automated fraud detection. The substrate refuses self-validation and gates LLM-typed validators on both promotion and contradiction paths, but it cannot detect colluding human validators, manufactured datasets, or fabricated DOIs.

Related work mareforma does not replace: W3C PROV-O / PROV-AGENT (W3C-recommended provenance vocabulary), FAIRSCAPE's Evidence Graph Ontology (EVI — research-evidence ontology at w3id.org/EVI, MIT- licensed, not a W3C deliverable), IETF SCITT (signed supply-chain transparency architecture, currently draft-ietf-scitt-architecture-22). Mareforma is a runtime substrate for an agent's working graph, not a publication-grade provenance record.

Get started

uv add mareforma
mareforma bootstrap            # one-time: generate Ed25519 signing key

mareforma bootstrap is optional. Without it, claims are stored unsigned. With it, every claim carries a tamper-evident signature and can be published to a Sigstore-Rekor transparency log on demand.

See AGENTS.md — execution contract, forbidden patterns, signing and transparency log, idempotency convention, generated_by requirements.

See ARCHITECTURE.md for the substrate design (rails not trains), CONTRIBUTING.md for the dev workflow, SECURITY.md for the threat model and disclosure channel, CHANGELOG.md for release notes.

Full documentation: https://docs.mareforma.com

Examples

Example What it shows
01 API Walkthrough Full API reference
02 Compounding Agents Findings accumulate across agent runs
03 Documented Contestation Agent challenges established consensus
04 Private Data, Public Findings Two labs share provenance without sharing data
05 Drug Target Provenance Real AI scientist with honest epistemic status

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mareforma-0.3.0.tar.gz (258.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mareforma-0.3.0-py3-none-any.whl (152.4 kB view details)

Uploaded Python 3

File details

Details for the file mareforma-0.3.0.tar.gz.

File metadata

  • Download URL: mareforma-0.3.0.tar.gz
  • Upload date:
  • Size: 258.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for mareforma-0.3.0.tar.gz
Algorithm Hash digest
SHA256 77496feb2113d3cd0f57ddd7ced436b424af32777ac5eaac3d5c432fab1a1f9e
MD5 e13e87397ec83691660a0d30f1b8516b
BLAKE2b-256 46e6c153199dea263bc3c0b465a19d3d2cd44fd9fecaf4e2d3cbe33b3eaf2d02

See more details on using hashes here.

Provenance

The following attestation bundles were made for mareforma-0.3.0.tar.gz:

Publisher: publish.yml on mareforma/mareforma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mareforma-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: mareforma-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 152.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for mareforma-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c73641fbe8f39d331a6d07446e532c8bc44ad20d78ce49d6ef5e8daaeb6e0616
MD5 3fd012649e5a0d14c6146d1b1cc98c7a
BLAKE2b-256 b83c738350238b8400a40036cf04719043df3d84993b7cffb02d64a82c75d511

See more details on using hashes here.

Provenance

The following attestation bundles were made for mareforma-0.3.0-py3-none-any.whl:

Publisher: publish.yml on mareforma/mareforma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page