Git-first, hypothesis-forcing experiment tracking for agent-driven ML research. Vendors Limina as the research harness, uses signac for local execution/run state, and bridges to W&B for remote observability.

These details have not been verified by PyPI

Project links

Project description

agentic-experiments

When your agent runs ML experiments, make it run them like a scientist.

Beta Python 3.11+

Quick Start • How It Works • Why It's Different • Features • Architecture • Docs

agentic-experiments (import name aexp) is an opinionated research harness for ML experimentation done with an AI agent — typically Claude Code. It forces a Hypothesis → Experiment → Finding chain on every run, ties that chain to git commits, and validates citation integrity at every turn.

10 CLI verbs • 9 MCP tools • 3 slash commands • 4 research skills • 170+ tests

What this looks like in practice

Your agent proposes a hypothesis and writes it to kb/research/hypotheses/H001-*.md — session-start hooks refuse work that skips this step
It designs an experiment that explicitly cites the hypothesis; a pre-write hook blocks orphaned experiments before they land
It creates signac-backed runs via the MCP tool new_run — each run records its git commit, experiment ID, and hypothesis ID on the job document
A W&B run (optional) is bound to the signac job with a deterministic group slug derived from (hypothesis, experiment, condition)
When it writes a finding, the supporting_runs array must cite real jobs — aexp validate flags dangling references
Delete an experiment by accident? Every run pointing at it is flagged run.broken_experiment_link on the next validation pass

Principles

Hypothesis-first, not metric-first — you can't start a run without a live hypothesis; you can't ship a finding without cited runs
Git is the source of truth — every run carries its commit SHA; the knowledge base lives in git; nothing load-bearing is ephemeral
Integrate, don't reinvent — signac for run state, W&B for observability, Limina for the research-graph primitives (the H→E→F artifact model, templates, and methodology skills this project builds on). aexp is the glue and the discipline
Portable by default — the MCP server runs via uvx from PyPI; .mcp.json is identical on every machine and committable to git

The Problem

Agents are great at running experiments. Left unattended, they are also great at running a lot of experiments with no shared thread — ablation sprawl, metric-chasing, findings with no clear question behind them, and a W&B workspace full of orphan runs nobody can reconstruct a month later.

The missing layer is not another tracker. It's a grammar — a structure the agent has to operate within, enforced deterministically by hooks rather than by reminder text in the prompt. Hypothesis before experiment. Experiment before run. Finding cites runs. Runs tied to commits.

aexp provides that grammar. Your agent proposes, designs, runs, and concludes; the harness makes sure the chain stays intact and the paper trail is reproducible.

How It Works

aexp stacks three concerns — research grammar, run state, and observability — glued together with a typed Python API and three agent-facing surfaces.

Layer	What lives here
Research grammar	`kb/` artifact graph — Hypothesis → Experiment → Finding plus Literature / Challenge Review / Strategic Review. Claude Code hooks enforce the H→E→F chain at write time. Four research-methodology skills (`experiment-rigor`, `exploratory-sota-research`, `research-devil-advocate`, `build-maintainable-software`) install into `.claude/skills/`
Local run state (signac)	`.runs/.signac/` plus one `.runs/workspace/<job_id>/` directory per run. `job.sp` carries identity params; `job.doc` carries the artifact link, tracker IDs, status, and summary metrics
Observability (W&B, optional `[wandb]` extra)	Remote runs grouped by a deterministic slug derived from `(hypothesis_id, experiment_id, condition)`. Offline-by-default on HPC — `aexp sync-offline` walks the run store and syncs every pending run in one call from a login node

Three surfaces, one canonical API

Every operation exists in three places, all thin wrappers over the same Python functions in aexp.*:

Surface	Triggered by	Best for
MCP tools (`new_run`, `list_runs`, `validate`, …)	The agent during a turn	Structured queries, programmatic chaining, typed JSON returns
Slash commands (`/aexp-new-run`, `/aexp-close-run`, `/aexp-close-batch`)	User typing `/aexp-…`	Guided multi-step workflows
CLI (`aexp new-run ...`)	Human at a terminal	Scripts, CI, PowerShell sessions

The hooks are a fourth surface — invisible to the user, they inject kb/ACTIVE.md at session start, block HEF-chain violations, validate KB writes, and run structural validation at turn end.

Why This Is Different

Most ML experiment infrastructure records what happened. aexp polices what's allowed to happen.

Unlike generic trackers (W&B, MLflow, Aim) — they log the numbers beautifully, but they don't care whether those numbers answer a question. aexp refuses runs that don't name their hypothesis and experiment.
Unlike notebook-driven research — no commit ties, no structural validation, no citation integrity when you share the notebook three months later.
Unlike DIY harnesses — this ships with working MCP integration, hook-enforced chain discipline, and a validation pass that catches broken references before they rot.

The design bet: agents already know how to run experiments. What they need is a runtime that makes rigorous research the path of least resistance.

Features

Research grammar


H→E→F artifact graph	Every run descends from an Experiment, which descends from a Hypothesis. Findings cite runs with strong references (either specific job IDs or batch selectors).
Hook-enforced discipline	SessionStart, PreToolUse, PostToolUse, and Stop hooks inject active context, block chain violations, and validate KB integrity at turn end. Hooks ship inside the installed package and upgrade via `pip install -U`.
Research methodology skills	Four SKILL.md files install into `.claude/skills/` — experiment rigor, exploratory SOTA research, devil's advocate review, and build-maintainable-software. Trigger with `$experiment-rigor` etc.

Run state + observability


signac-backed runs	Identity-hashed workspaces; idempotent creation keyed on state point; status and summary metrics in `job.doc`. Re-run at a new commit produces a distinct persistent workspace, both preserved.
W&B tracker adapter	Optional, behind `[wandb]` extra. Group slug is deterministic so the same run is never double-created. Offline-first; co-locates with its signac workspace.
HPC-friendly sync	`aexp sync-offline` walks the run store and runs `wandb sync` on every offline run — one command from a login node, no shell gymnastics.
Tracker ABC	`TrackerAdapter` is a small ABC; the noop + wandb adapters are reference implementations. MLflow / Aim / DVC adapters reserved for v1.1.

Agent surfaces


MCP server	FastMCP with 9 tools covering the full run lifecycle. Runs via `uvx --from agentic-experiments[mcp] aexp-mcp-server` — no absolute paths, no per-machine config, `.mcp.json` committable to git.
Slash commands	`/aexp-new-run`, `/aexp-close-run`, `/aexp-close-batch` — guided multi-step workflows for the common cases.
CLI	10 verbs: `install`, `new-run`, `list-runs`, `list-batches`, `show-run`, `show-batch`, `link`, `bind-tracker`, `sync-offline`, `validate`, `install-slash-commands`. Python API is a one-line `from aexp import ...`.
Typed JSON contracts	Pydantic models (`RunLink`, `BatchSelector`, `Issue`, …) back the schema; MCP tools and CLI return the same shapes.

Architecture

graph TB
    subgraph "Claude Code"
        CC[Claude Code Session]
        SC[Slash Commands<br/>/aexp-*]
        HOOKS[Hooks<br/>session_start, enforce_hef, kb_write_guard, stop_validate]
    end

    subgraph "aexp (Python package)"
        MCP[MCP Server<br/>FastMCP, 9 tools]
        CLI[CLI — typer<br/>10 verbs]
        API[Python API<br/>aexp.*]
    end

    subgraph "Research grammar"
        KB[(kb/<br/>H→E→F artifact graph)]
        SKILLS[research skills<br/>.claude/skills/]
        VALID[aexp.kb_validate<br/>structural check]
    end

    subgraph "Run state — signac"
        SIGNAC[(.runs/<br/>signac project)]
        JOBS[workspace/&lt;job_id&gt;/<br/>per-run directory]
    end

    subgraph "Observability — W&B (optional)"
        WB[wandb.ai<br/>grouped by slug]
        OFFLINE[offline-run-*/<br/>co-located]
    end

    CC --> MCP
    SC --> MCP
    CC -.hooks.-> HOOKS
    HOOKS --> KB
    HOOKS --> VALID
    MCP --> API
    CLI --> API
    API --> KB
    API --> SIGNAC
    API --> WB
    JOBS -.wandb sync.-> WB
    OFFLINE -.aexp sync-offline.-> WB
    SKILLS -.invoked.-> CC

The canonical Python API (aexp.*) is the narrow waist. MCP, CLI, and slash commands all delegate to it; they differ only in how they're triggered.

Quick Start

Prerequisites: Python 3.11+ and uv on PATH (Claude Code uses uvx to run the MCP server).

From inside your research repo, with a virtual environment active:

pip install "agentic-experiments[wandb,mcp]"
aexp install
aexp --help

Heads up — aexp install will modify your repo. It creates .mcp.json, merges into any existing .claude/settings.json (hooks + permissions are additive; yours are preserved), adds .claude/skills/ with four research-methodology skills, copies a kb/ scaffold plus templates/ into the repo root, initializes .runs/ as a signac project, and records the interpreter path in .aexp/installed.json. It prints the plan and asks for confirmation before writing — pass --yes to skip the prompt or --dry-run to preview only. No Python code you didn't write lands in your repo: hook scripts and validator logic live inside the installed aexp package and upgrade via pip install -U.

See docs/quickstart.md for a full worked example — hypothesis → experiment → runs → finding.

Extras

Extra	Installs	When to use
`mcp`	`mcp`	Claude Code MCP server (almost always wanted)
`wandb`	`wandb`	W&B tracker adapter for remote observability

pip install agentic-experiments alone gets you the CLI and Python API. The extras are additive.

Invoking the CLI from inside Claude Code

Three equivalent entry points, listed in order of robustness under agent runtimes:

Form	Best when
`conda run -n <env> python -m aexp <verb>`	Most robust inside Claude Code. Works on Windows / macOS / Linux without shell activation.
`python -m aexp <verb>`	Works when `python` resolves to the env — e.g. an activated shell or a venv install.
`aexp <verb>`	Shortest; only on PATH in human terminals with the env active.

.aexp/installed.json records the interpreter path and conda env name at install time, so slash commands + the MCP server never have to guess.

Stop-hook scope caveat

When a Claude Code session ends, the Stop hook runs aexp.kb_validate — a KB-structural check (frontmatter, aliases, wikilinks, bidirectional backlinks, H→E→F chain). It does not run aexp's run-link / finding-citation validator.

So a session can end cleanly with a broken supporting_runs citation still present. Run aexp validate explicitly for full-coverage validation; treat Stop hook success as "KB structurally sound" rather than "everything coherent."

Documentation

Doc	What it covers
docs/concepts.md	The H→E→F grammar, batches, findings, validation layers
docs/quickstart.md	A full worked example — bootstrap to finding
docs/cli.md	Complete CLI reference, verb by verb
docs/mcp.md	MCP server tools, transport, verification prompt, troubleshooting
docs/mapping.md	`kb/` ↔ signac ↔ W&B mapping in gory detail
docs/tracker-adapters.md	Writing a new tracker adapter; why Weave isn't in v1

Project layout

src/aexp/
  __init__.py           # public API re-exports
  cli.py                # Typer app (aexp)
  __main__.py           # python -m aexp → CLI
  install.py            # apply the harness into a consumer repo
  runs.py               # signac wrappers: create_run, open_run, find_runs, run_lifecycle
  linking.py            # batch queries + retroactive run-to-experiment linking
  limina_io.py          # typed read wrappers for H/E/F/L/CR/SR artifacts
  validate.py           # composes KB structural + run-link + citation integrity
  kb_validate.py        # KB structural validator (frontmatter, aliases, chain)
  schema.py             # pydantic + dataclass types
  mcp_server.py         # FastMCP server — optional [mcp] extra
  hooks/                # Claude Code hooks (session_start, enforce_hef_chain, kb_write_guard, stop_validate)
  slash_commands/       # /aexp-* templates
  trackers/             # TrackerAdapter ABC + noop + wandb adapters
  utils/                # paths, git, atomic writes
  vendor/               # forked research-graph templates, skills, and kb/ scaffold
tests/                  # pytest suite; CI on Ubuntu + Windows × Py 3.11/3.12/3.13
docs/                   # concepts, quickstart, cli, mcp, mapping, tracker-adapters

Status

Pre-release (v0.1.0). Actively developed by one person and the agents they direct; used in the author's own ML research workflow. The API surface is not yet stable.

Developed and primarily tested on Windows 11 / Python 3.12. Supports Python 3.11+. CI runs the full suite on Ubuntu + Windows × Py 3.11/3.12/3.13. macOS hasn't been exercised — issues welcome.
MCP server is the only PyPI-gated surface — the CLI and Python API run from a local checkout without any PyPI round-trip.
v1.1 backlog: artifact-creation CLI verbs (aexp new-hypothesis / new-experiment / new-finding), aexp index dashboard, MLflow / Aim / DVC tracker adapters, OpenTelemetry extra.

If you run ML experiments with Claude Code and find yourself wanting a harness that holds your agent to scientific discipline, this is built for you. Feedback, bug reports, and PRs all welcome.

Contributing

For bugs and feature requests, open an issue.

To hack on the package itself, clone the repo and use Poetry:

git clone https://github.com/KadenMc/agentic-experiments.git
cd agentic-experiments
poetry install --with dev --extras "wandb mcp"

poetry run pytest              # `-m "not slow"` skips the e2e smoke
poetry run ruff check .

Python 3.11, 3.12, and 3.13 are all exercised in CI on Ubuntu and Windows.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Apr 23, 2026

This version

0.1.0

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_experiments-0.1.0.tar.gz (115.3 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentic_experiments-0.1.0-py3-none-any.whl (135.4 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file agentic_experiments-0.1.0.tar.gz.

File metadata

Download URL: agentic_experiments-0.1.0.tar.gz
Upload date: Apr 22, 2026
Size: 115.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for agentic_experiments-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5c7289c8e7e17e1c73c5185378497d428572d00bfdc0ae91a1e73773472ab97a`
MD5	`39b538150b87cb2dc6bd8f76399bdf94`
BLAKE2b-256	`e26ce4f572ffb0be31e63d8dfe72352e1dd3d8c35d95d2eb776c19bbf77d06fb`

See more details on using hashes here.

File details

Details for the file agentic_experiments-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentic_experiments-0.1.0-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 135.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for agentic_experiments-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`699456c493ec571d8167ee355048fd0c4a82d8750f67350eab83fea7f99a5616`
MD5	`23058c2368ba666d0f7d72717787b1c4`
BLAKE2b-256	`283876d9ae89e259d26b8d746ec35d06631e7200b63989bb8d8cc0dd8ce2000b`

See more details on using hashes here.

agentic-experiments 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agentic-experiments

What this looks like in practice

Principles

The Problem

How It Works

Three surfaces, one canonical API

Why This Is Different

Features

Research grammar

Run state + observability

Agent surfaces

Architecture

Quick Start

Extras

Invoking the CLI from inside Claude Code

Stop-hook scope caveat

Documentation

Project layout

Status

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes