Skip to main content

Hash-based reproducibility verification for scientific pipelines

Project description

Clew (scitex-clew)

SciTeX

Hash-based reproducibility verification for scientific pipelines

PyPI version Documentation Tests License: AGPL-3.0

Full Documentation · pip install scitex-clew


Problem

Scientific publications are growing exponentially — accelerated by LLM-assisted writing — yet peer review remains a manual bottleneck. 70% of researchers report failed replication attempts, and only 11-36% of high-profile findings are successfully reproduced. Existing tools (pre-registration, containerization, workflow managers) address whether research could be reproduced, but not whether it has been.

Solution

Clew — named after the thread Ariadne gave Theseus to trace his path through the labyrinth — records SHA-256 hashes at every step your pipeline reads and writes, stored in a local SQLite database. The resulting DAG (directed acyclic graph) is a structured, machine-readable logic representation of an entire research project — from raw data through processing scripts to final figures and manuscript claims — enabling both human reviewers and AI agents to verify reproducibility programmatically. It lets you:

  • Verify that outputs haven't changed since recording
  • Trace provenance chains from any file back to its source
  • Re-execute scripts in a sandbox to confirm reproducibility
  • Link manuscript claims to the sessions that produced them

Five Node Classes

Every node in the DAG is classified into one of five semantic roles:

Class Role Examples
Source Data acquisition scripts 01_download.py, collect.sh
Input Raw data and configuration raw_data.csv, config.yaml
Processing Transform and analysis scripts 03_analyze.py, train.R
Output Intermediate and final data products results.csv, figure1.png
Claim Manuscript assertions tied to evidence "Fig 1 shows p<0.05", "Table 2"

Table 1. Five node classes. Classification is inferred automatically from file extensions and session roles, or set explicitly via set_node_class().

This classification turns the DAG into a navigable map of the research project. The key operation is backpropagation from claims to sources: starting from a manuscript assertion (claim), Clew traces backward through outputs, processing scripts, and inputs to the original raw data — verifying every hash along the way.

Three Verification Modes

Mode Scope API Description
Project Entire pipeline clew.dag() Verifies every session recorded in the database in topological order. A navigation map for ongoing project monitoring. Answers: "Is the whole project intact?"
Files Specific outputs clew.dag(["output.csv"]) Traces backward from target files through their dependency chain and verifies each session. Answers: "Can I trust this specific file?"
Claims Manuscript assertions clew.verify_claim("Fig 1") Verifies individual claims linked to source sessions. Answers: "Is this figure/statistic still backed by the data?"

Table 2. Three verification modes. Each mode supports both cache verification (millisecond hash comparison) and re-run verification (sandbox re-execution with rerun_dag / rerun_claims).

Installation

Requires Python >= 3.10. Zero dependencies — pure stdlib + sqlite3.

pip install scitex-clew

SciTeX users: pip install scitex already includes Clew. Tracking is automatic via @scitex.session + scitex.io.

Quickstart

import scitex_clew as clew

# Git-status-like overview
clew.status()

# Verify a run (hash check)
result = clew.run("session_20250301_143022")

# Trace a file's provenance chain
chain = clew.chain("output/figure.png")

# Verify the full DAG
dag_result = clew.dag(["output/figure.png"])

# Re-execute in sandbox and compare
rerun_result = clew.rerun("session_20250301_143022")

DAG verification example

Figure 1. Example DAG visualization. Green nodes indicate verified sessions; red nodes indicate hash mismatches. Clew traces the dependency graph backward from target files to raw data sources.

Three Interfaces

Python API
import scitex_clew as clew

clew.status()                              # overview
clew.run("session_id")                     # verify one run
clew.chain("output/figure.png")            # trace provenance
clew.dag(["output/figure.png"])            # verify full DAG
clew.rerun("session_id")                   # sandbox re-execution
clew.mermaid(claims=True)                  # Mermaid DAG diagram
clew.add_claim("Fig 1 shows p<0.05", source_files=["fig1.png"])

Full API reference

CLI Commands
clew --help-recursive                      # Show all commands
clew status                                # Git-status-like overview
clew verify <session_id>                   # Verify a run
clew list                                  # List tracked runs
clew stats                                 # Database statistics
clew mermaid                               # Generate Mermaid diagram
clew list-python-apis                      # List Python API tree
clew mcp list-tools                        # List MCP tools

Full CLI reference

MCP Server — for AI Agents

AI agents can verify reproducibility and trace provenance autonomously.

Tool Description
clew_status Git-status-like overview
clew_run Verify a specific run
clew_chain Trace file provenance chain
clew_dag Verify full DAG
clew_list List tracked runs
clew_stats Database statistics
clew_mermaid Generate Mermaid DAG diagram
clew_rerun_dag Rerun full DAG in sandbox
clew_rerun_claims Rerun all claim-backing sessions

Table 3. Nine MCP tools available for AI-assisted verification. All tools accept JSON parameters and return JSON results.

clew mcp start

Full MCP specification

Part of SciTeX

Clew is part of SciTeX. When used inside the SciTeX framework, tracking is automatic:

import scitex

@scitex.session
def main(CONFIG=scitex.INJECTED):
    data = scitex.io.load("input.csv")    # auto-tracked as input
    result = process(data)
    scitex.io.save(result, "output.csv")   # auto-tracked as output
    return 0

All file I/O through scitex.io is recorded in the clew database:

scitex.clew.status()              # overview
scitex.clew.run("session_id")     # verify
scitex.clew.mermaid(claims=True)  # DAG diagram

The SciTeX ecosystem follows the Four Freedoms for researchers:

Four Freedoms for Research

  1. The freedom to run your research anywhere — your machine, your terms.
  2. The freedom to study how every step works — from raw data to final manuscript.
  3. The freedom to redistribute your workflows, not just your papers.
  4. The freedom to modify any module and share improvements with the community.

AGPL-3.0 — because research infrastructure deserves the same freedoms as the software it runs on.


SciTeX

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scitex_clew-0.1.2.tar.gz (536.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scitex_clew-0.1.2-py3-none-any.whl (183.0 kB view details)

Uploaded Python 3

File details

Details for the file scitex_clew-0.1.2.tar.gz.

File metadata

  • Download URL: scitex_clew-0.1.2.tar.gz
  • Upload date:
  • Size: 536.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scitex_clew-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f4b763fc61bf13d936f038733430b1fc54baa7efa96a78252ddff786dd6c18ef
MD5 bd4f4d696ec61280b7c522f1ed3be80a
BLAKE2b-256 75e3c34f29433f7b9721d0f83ef38c7cee0666253a0ac6a9e35cbb4d2d6f2c24

See more details on using hashes here.

Provenance

The following attestation bundles were made for scitex_clew-0.1.2.tar.gz:

Publisher: publish-pypi.yml on ywatanabe1989/scitex-clew

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scitex_clew-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: scitex_clew-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 183.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scitex_clew-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 35ea7261df752a5f0d3e272a36ab1e1c96372c9e88e060bf7f71d6ef78d988f8
MD5 9346768d450aeff8c6e96f476b5b9205
BLAKE2b-256 e8883f1d3bbe0eae005931bb96314224f2b24a1482f5c600133fa5f10b7f7164

See more details on using hashes here.

Provenance

The following attestation bundles were made for scitex_clew-0.1.2-py3-none-any.whl:

Publisher: publish-pypi.yml on ywatanabe1989/scitex-clew

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page