Verifiable knowledge graph for scientific experiments

These details have not been verified by PyPI

Project links

Project description

SciTeX Clew (`scitex-clew`)

Full Documentation · uv pip install scitex-clew[all]

Problem

Scientific publications are growing exponentially — accelerated by LLM-assisted writing — yet peer review remains a manual bottleneck. 70% of researchers report failed replication attempts, and only 11-36% of high-profile findings are successfully reproduced. Existing tools (pre-registration, containerization, workflow managers) address whether research could be reproduced, but not whether it has been.

Solution

SciTeX Clew records every artifact produced during research — code, data, figures, statistics — into a hash-linked DAG (directed acyclic graph). This creates a verifiable knowledge graph of scientific experiments, which can be explored by humans or AI agents.

Named after the thread Ariadne gave Theseus to trace his path through the labyrinth, Clew serves two purposes:

Reproducibility verification — confirm that outputs remain unchanged and that every step in the pipeline is intact.
Research logic comprehension — visualize and navigate the structural skeleton of a research project, from raw data through analysis to manuscript claims.

The DAG is a structured, machine-readable representation of an entire research project — enabling both human reviewers and AI agents to inspect, verify, and understand the logic programmatically. It lets you:

Verify that outputs remain consistent with recorded hashes
Trace provenance chains from any file back to its source
Visualize the structural logic of a research project as a navigable graph
Re-execute scripts in a sandbox to confirm reproducibility
Link manuscript claims to the computational sessions that produced them

Case Study: The Broken Twin

A real incident (NeuroVista seizure-forecasting analysis, 2026-06-30) shows why claim→source binding matters. Two same-named "warning-metrics Table 03/04" scripts coexisted in the repository:

Broken twin — compute_warning_tables.py fabricated timestamps (times = arange(n) * 60 s) from a block-ordered CSV that had no time column. On that surrogate timeline, a uniform-Poisson alarm beat the real model: AUC 0.46, IoC < 0.
Valid script — _table03_warning_fullwindow.py used the real window_datetime column with forecasting.evaluate_stream: sensitivity 0.70, specificity 0.96, 0.17 false positives/h, 10.7 min median lead time, IoC +0.56.

Without a claim→source→@stx.session binding, both scripts were equally plausible as "the source" of the table. Hours went into diagnosing the broken twin, and near-chance numbers were almost shipped. With clew, "which code produced this value?" has exactly one answer: the claim resolves to its registered source session, and the broken twin — which has no registered claim — cannot masquerade as evidence. This incident drove ADR-0021: clew registration is mandatory for every manuscript value.

Five Node Classes

Every node in the DAG is classified into one of five semantic roles:

Class	Role	Examples
Source	Data acquisition scripts	`01_download.py`, `collect.sh`
Input	Raw data and configuration	`raw_data.csv`, `config.yaml`
Processing	Transform and analysis scripts	`03_analyze.py`, `train.R`
Output	Intermediate and final data products	`results.csv`, `figure1.png`
Claim	Manuscript assertions tied to evidence	`"Fig 1 shows p<0.05"`, `"Table 2"`

_{Table 1. Five node classes. Classification is inferred automatically from file extensions and session roles, or set explicitly via set_node_class().}

This classification turns the DAG into a navigable map of the research project. The key operation is backpropagation from claims to sources: starting from a manuscript assertion (claim), Clew traces backward through outputs, processing scripts, and inputs to the original raw data — verifying every hash along the way.

Three Verification Modes

Mode	Scope	API	Description
Project	Entire pipeline	`clew.dag()`	Verifies every session recorded in the database in topological order. A navigation map for ongoing project monitoring. Answers: "Is the whole project intact?"
Files	Specific outputs	`clew.dag(["output.csv"])`	Traces backward from target files through their dependency chain and verifies each session. Answers: "Can I trust this specific file?"
Claims	Manuscript assertions	`clew.verify_claim("Fig 1")`	Verifies individual claims linked to source sessions. Answers: "Is this figure/statistic still backed by the data?"

_{Table 2. Three verification modes. Each mode supports both cache verification (millisecond hash comparison) and re-run verification (sandbox re-execution with rerun_dag / rerun_claims).}

Verification Caching Guarantee

All caching in Clew is content-keyed: every cache key is a SHA-256 hash of live file bytes — no mtime logic exists anywhere in the package. Per-pass hash caches are created fresh for each verification pass and never persisted; the opt-in rerun_dag(skip_unchanged=True) skip re-hashes the script and every recorded input before skipping (skipped sessions are marked level=CACHE, never RERUN); and stored verdicts are an append-only history, never read back to skip live hashing. A cache can speed up verification but can never return "verified" for content that has changed. Full audited statement: Verification caching — correctness guarantee.

Grouping for Readable DAGs

Large pipelines emit many per-patient / per-fold files. The grouping API collapses related files into a single DAG node while preserving every underlying hash via a Merkle root — aggregate verification remains cryptographically meaningful.

from scitex_clew.groupers import pattern_grouper, auto, compose
import scitex_clew as clew

clew.mermaid(claims=True, grouper=compose(
    pattern_grouper(r"P\d{2}"),   # collapse P01, P02, ..., P15
    auto(),                        # sensible directory + bundle fallbacks
))

Project default via <project_root>/.scitex/clew/config.yaml (auto-loaded):

grouper:
  type: compose
  steps:
    - {type: pattern, regex: 'P\d{2}'}
    - {type: directory, min_size: 10}
    - {type: auto}

The same JSON/dict schema works across Python, CLI (--grouper), MCP ({"grouper": {...}}), and the YAML config file. See the grouping skill.

Installation

Requires Python >= 3.10. Zero dependencies — pure stdlib + sqlite3.

pip install scitex-clew

Architecture

graph LR
    S[Source<br/>01_download.py] --> I[Input<br/>raw_data.csv]
    I --> P[Processing<br/>03_analyze.py]
    P --> O[Output<br/>figure1.png]
    O --> C[Claim<br/>'Fig 1: p<0.05']
    classDef src fill:#cfe8ff,stroke:#1f6feb
    classDef inp fill:#e6ffec,stroke:#1a7f37
    classDef proc fill:#fff8c5,stroke:#9a6700
    classDef out fill:#ffe0b2,stroke:#bc4c00
    classDef cl fill:#ffd6cc,stroke:#cf222e
    class S src
    class I inp
    class P proc
    class O out
    class C cl

scitex-clew/
├── src/scitex_clew/
│   ├── __init__.py              # status, run, chain, dag, rerun, mermaid
│   ├── _db/                     # sqlite3 hash-linked DAG store (package)
│   │   ├── __init__.py
│   │   ├── _core.py             # VerificationDB, connection mgmt
│   │   ├── _chain.py            # ChainMixin: get_chain, get_children, set_parent
│   │   ├── _queries.py          # VerificationQueryMixin
│   │   └── _parents.py          # Parent-session operations
│   ├── _hash.py                 # file + directory Merkle hashing
│   ├── _chain/                  # chain/DAG verification (types, ops, routes)
│   ├── _claim/                  # Claim CRUD + verification
│   ├── _citation/               # \cite -> scholar-verified source gate
│   ├── _core/                   # config, logging, node classes, public-API registry
│   ├── _attest/                 # external attestation
│   │   ├── _stamp.py            # Temporal stamping backends (file/RFC3161/Zenodo)
│   │   └── _registry.py         # Clew Registry client (scitex.ai)
│   ├── _rerun.py                # Sandbox re-execution
│   ├── _tracker.py              # Session tracking
│   ├── _register_intermediate.py# Agentic intermediate-value registration
│   ├── _viz/                    # Mermaid/HTML/Graphviz DAG rendering
│   ├── _estimate.py             # Pre-flight runtime/success estimate
│   ├── _examples.py             # Bundled example locator
│   ├── _linter_plugin.py        # scitex-linter plugin entry point
│   ├── _observers/              # scitex-io / scitex-session hook subscribers
│   ├── _groupers/               # Pattern / directory / auto / compose
│   │   ├── __init__.py
│   │   └── _config.py           # Per-project grouper config loader
│   ├── groupers.py              # Public re-exports
│   ├── _cli/                    # clew entrypoint (recursive --help)
│   ├── _mcp/                    # MCP server for AI agents
│   │   ├── server.py            # FastMCP server
│   │   └── tools/               # Tool definitions (skills, claims, hashing, stamping, verification)
│   └── _skills/scitex-clew/     # Workflow skill pages
└── tests/

Quickstart

import scitex_clew as clew

# Git-status-like overview
clew.status()

# Verify a run (hash check)
result = clew.run("session_20250301_143022")

# Trace a file's provenance chain
chain = clew.chain("output/figure.png")

# Verify the full DAG
dag_result = clew.dag(["output/figure.png"])

# Re-execute in sandbox and compare
rerun_result = clew.rerun("session_20250301_143022")

DAG verification example

_{Figure 1. Example DAG visualization. Green nodes indicate verified sessions; red nodes indicate hash mismatches. Clew traces the dependency graph backward from target files to raw data sources.}

Four Interfaces

Python API

import scitex_clew as clew

clew.status()                              # overview
clew.run("session_id")                     # verify one run
clew.chain("output/figure.png")            # trace provenance
clew.dag(["output/figure.png"])            # verify full DAG
clew.rerun("session_id")                   # sandbox re-execution
clew.mermaid(claims=True)                  # Mermaid DAG diagram
clew.add_claim("Fig 1 shows p<0.05", source_files=["fig1.png"])

Full API reference

CLI Commands

clew --help-recursive                      # Show all commands
clew status                                # Git-status-like overview
clew verify <session_id>                   # Verify a run
clew list                                  # List tracked runs
clew stats                                 # Database statistics
clew mermaid                               # Generate Mermaid diagram
clew list-python-apis                      # List Python API tree
clew mcp list-tools                        # List MCP tools

# Claims, hashing, stamping (F1)
clew claim add --file-path paper.tex --type statistic --value "p=0.003"
clew claim list
clew claim verify <claim_id>
clew hash-file path/to/data.csv
clew hash-directory path/to/dir/
clew stamp --backend file
clew list-stamps
clew check-stamp [STAMP_ID]

# Universal --json on every command (F5)
clew --json status
clew status --json
clew --json list --limit 20

# Strict DAG verification with failure attribution (F2)
clew dag --strict --json --target results/figure.csv

Full CLI reference

MCP Server — for AI Agents

AI agents can verify reproducibility and trace provenance autonomously.

Tool	Description
`clew_status`	Git-status-like overview
`clew_run`	Verify a specific run
`clew_chain`	Trace file provenance chain
`clew_dag`	Verify full DAG (`strict=True` returns failure attribution, F2)
`clew_list_runs`	List tracked runs
`clew_stats`	Database statistics
`clew_mermaid`	Generate Mermaid DAG diagram
`clew_rerun_dag`	Rerun full DAG in sandbox
`clew_rerun_claims`	Rerun all claim-backing sessions
`clew_add_claim` / `clew_list_claims` / `clew_verify_claim`	Claim CRUD (F1)
`clew_hash_file` / `clew_hash_directory`	File/directory hashing (F1)
`clew_stamp` / `clew_list_stamps` / `clew_check_stamp`	Temporal stamping (F1)

_{Table 3. MCP tools available for AI-assisted verification. All tools accept JSON parameters and return JSON results.}

clew mcp start

Full MCP specification

Skills

Skills provide workflow-oriented guides that AI agents query to discover capabilities and usage patterns.

clew skills list              # List available skill pages
clew skills get SKILL         # Show main skill page
scitex-dev skills export --package scitex-clew  # Export to Claude Code

Skill	Content
`quick-start`	Basic API, session tracking, first verification
`cli-commands`	CLI reference (`clew status`, `clew verify`, etc.)
`mcp-tools-for-ai-agents`	MCP tool reference for AI agents
`common-workflows`	Claims, DAG patterns, stamps, reproducibility

Demo

DAG verification example

_{Figure 2. Live DAG verification. Green nodes are sessions whose recorded hashes still match disk; red nodes flag a drift. clew dag --strict walks claims back to raw data and prints the first failure.}

Part of SciTeX

scitex-clew is part of SciTeX. Install via the umbrella with pip install scitex[clew] to use as scitex.clew (Python) or scitex clew ... (CLI).

import scitex

@scitex.session
def main(CONFIG=scitex.INJECTED):
    data = scitex.io.load("input.csv")    # auto-tracked as input
    result = process(data)
    scitex.io.save(result, "output.csv")   # auto-tracked as output
    return 0

All file I/O through scitex.io is recorded in the clew database:

scitex.clew.status()              # overview
scitex.clew.run("session_id")     # verify
scitex.clew.mermaid(claims=True)  # DAG diagram

The SciTeX system follows the Four Freedoms for Research below, inspired by the Free Software Definition:

Four Freedoms for Research

The freedom to run your research anywhere — your machine, your terms.

The freedom to study how every step works — from raw data to final manuscript.

The freedom to redistribute your workflows, not just your papers.

The freedom to modify any module and share improvements with the community.

AGPL-3.0 — because we believe research infrastructure deserves the same freedoms as the software it runs on.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.19.1

Jul 14, 2026

0.19.0

Jul 13, 2026

0.18.0

Jul 13, 2026

0.17.0

Jul 6, 2026

0.16.0

Jul 4, 2026

0.15.0

Jul 4, 2026

0.14.0

Jul 4, 2026

0.13.0

Jul 4, 2026

0.12.0

Jul 4, 2026

0.11.0

Jul 3, 2026

0.10.1

Jul 3, 2026

0.10.0

Jul 3, 2026

0.9.0

Jul 3, 2026

0.8.1

Jul 3, 2026

0.8.0

Jul 3, 2026

0.7.0

Jul 2, 2026

0.6.0

Jul 2, 2026

0.5.0

Jul 2, 2026

0.4.0

Jul 1, 2026

0.3.0

Jul 1, 2026

0.2.22

Jun 30, 2026

0.2.21

Jun 30, 2026

0.2.20

Jun 30, 2026

0.2.19

Jun 29, 2026

0.2.18

Jun 28, 2026

0.2.17

Jun 20, 2026

0.2.16

Jun 17, 2026

0.2.15

Jun 1, 2026

0.2.14

May 29, 2026

0.2.12

May 27, 2026

0.2.11

May 27, 2026

0.2.10

May 26, 2026

0.2.8

Apr 28, 2026

0.2.7

Apr 28, 2026

0.2.6

Apr 28, 2026

0.2.5

Mar 26, 2026

0.2.3

Mar 18, 2026

0.2.2

Mar 14, 2026

0.2.1

Mar 14, 2026

0.2.0

Mar 14, 2026

0.1.6

Mar 11, 2026

0.1.4

Mar 10, 2026

0.1.3

Mar 10, 2026

0.1.2

Mar 10, 2026

0.1.1

Mar 10, 2026

0.1.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scitex_clew-0.19.1.tar.gz (6.0 MB view details)

Uploaded Jul 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scitex_clew-0.19.1-py3-none-any.whl (5.1 MB view details)

Uploaded Jul 14, 2026 Python 3

File details

Details for the file scitex_clew-0.19.1.tar.gz.

File metadata

Download URL: scitex_clew-0.19.1.tar.gz
Upload date: Jul 14, 2026
Size: 6.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for scitex_clew-0.19.1.tar.gz
Algorithm	Hash digest
SHA256	`f70ec90f0815a89d7d4d4f7f99ece58f3e7eb652c7fd9242262a1612a08824cb`
MD5	`5ade728cd73e3093dccaebebf5dee619`
BLAKE2b-256	`601d885c50b4a5c9c3d96228389339d0e6f3b3ac3b034d5c8ea263e7126eb7fd`

See more details on using hashes here.

File details

Details for the file scitex_clew-0.19.1-py3-none-any.whl.

File metadata

Download URL: scitex_clew-0.19.1-py3-none-any.whl
Upload date: Jul 14, 2026
Size: 5.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for scitex_clew-0.19.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`00b1491a0a21f08ba001e11f65cf2eaa2fc0201ab4e5958baa7641ae8b156318`
MD5	`6ab30e87dbcacb9b066cb7b254fe5e31`
BLAKE2b-256	`66139c6359f454c8ebcb30bca5ad45446a6cb7f6c0573be89cbe826aaab00c7a`

See more details on using hashes here.

scitex-clew 0.19.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

SciTeX Clew (scitex-clew)

Problem

Solution

Case Study: The Broken Twin

Five Node Classes

Three Verification Modes

Verification Caching Guarantee

Grouping for Readable DAGs

Installation

Architecture

Quickstart

Four Interfaces

Demo

Part of SciTeX

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

SciTeX Clew (`scitex-clew`)