AI-powered autonomous ML research framework — agent runs experiments against frozen evaluation contracts on your existing pipelines.
Project description
🥋 Dojo — An AI-powered autonomous ML research framework.
Run controlled, reproducible ML experiments on your existing pipelines and build a memory of what actually works.
What is Dojo?
You define a domain — a research area pointing at your data with a fixed evaluation contract. An AI agent runs experiments inside that contract: writing training code, calling frozen load_data and evaluate tools, logging metrics, and recording findings as durable knowledge atoms.
Domain (you define)
├── Task — the contract: load_data + evaluate (frozen, AI-generated at setup)
├── Workspace — your repo / pipeline (local path or git url)
└── Experiments — agent-created, many per domain
└── Knowledge atoms — linked across experiments, accumulating over time
The agent owns the training code. The framework owns evaluation. That separation is what makes the metrics trustworthy run-over-run, and what makes it safe to leave the agent unsupervised.
Inspired by Karpathy's autoresearch — prepare.py is frozen, train.py is fair game, program.md is what the human iterates on. Dojo generalises that pattern to any well-defined ML problem class.
Current Status
⚠️ Proof of Concept — under active development. Open source. Single-tenant, local-first, by design.
- Agent: Claude Agent SDK (uses your local
claudeCLI auth — no API key needed for runs) - Compute: Local only (in-process / subprocess) — your data stays on your machine
- Storage: Local JSON files in
.dojo/ - Tracking: File-based or MLflow (sits on top of an MLflow you already run)
- Tasks supported:
RegressionTask(more types to come once regression is solid)
Prerequisites
- Python 3.13+
- uv
- just
- The
claudeCLI logged in (Claude Code) — Dojo shells out to it; noANTHROPIC_API_KEYneeded - Node.js 18+ (only if you want the web UI)
just dev # install backend + frontend deps
Getting Started — California Housing in 4 commands
The CLI is a peer of the HTTP API, not a thin wrapper around it. The whole happy path runs in-process — no server needed.
mkdir housing && cd housing
# 1. Scaffold the domain (creates .dojo/, the regression Task, PROGRAM.md, SETUP.md)
dojo init --name housing --task-type regression --non-interactive
# 2. Describe the research goal, target, success
$EDITOR PROGRAM.md
# 3. Describe the dataset and evaluation (read once by `dojo task setup`)
$EDITOR SETUP.md
# 4. AI generates load_data + evaluate from SETUP.md, verifies them against
# the regression contract, and freezes the task. Re-run after edits.
dojo task setup
# 5. Run the agent — events stream live to your terminal
dojo run --max-turns 30
If the AI keeps generating the wrong adapters (verification failures on real-world pipelines, e.g. unusual pandas multi-indexes, custom dataset APIs, or wrapping an existing evaluator), use Opus 4.7 for tool generation instead of the default Sonnet:
DOJO_AGENT__TOOL_GENERATION_MODEL=claude-opus-4-7 dojo task setupOpus is slower (~30–60s vs 15–30s) but noticeably better at translating a messy
SETUP.mdinto correctload_data/evaluatemodules. Set it permanently in.dojo/config.yamlunderagent.tool_generation_modelif you want it as the default.
A reasonable starter PROGRAM.md for California housing:
## Goal
Predict California median house value (regression). Minimise RMSE on a 20% held-out test split.
## Target
Median house value (in $100,000s) for census blocks in California.
## Success
Beat a linear baseline. Try at least one tree-based model. Avoid overfitting.
A reasonable starter SETUP.md:
## Dataset
Use `sklearn.datasets.fetch_california_housing(return_X_y=True)`.
Features and target both come back as numpy arrays — no column names needed.
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html
## Evaluate
Use sklearn's mean_squared_error / r2_score / mean_absolute_error against y_test.
Save a residuals scatter plot to artifacts_dir/residuals.png.
What happens under the hood:
dojo initwrites.dojo/config.yaml, creates the domain + regression task withexpected_metrics = [rmse, r2, mae], scaffoldsPROGRAM.mdandSETUP.md, and setscurrent_domain_id.dojo task setupreadsSETUP.md, asks the AI to generateload_data+evaluate, runs each tool in a sandbox against itsToolContract, and freezes the task. Verification failures tell you which tool failed and why — fixSETUP.md(or the tool code) and re-run.dojo runstarts the agent in-process. The agent writes training code;load_dataandevaluatestay frozen. The metric dict fromevaluateis the only source of truth —complete_experimentrejects metric keys outside the contract, so the agent can't smuggle in custom numbers.
Useful neighbours:
dojo task show # current task status, tools, frozen?
dojo runs ls # recent runs
dojo runs show # last run's events + cost
dojo program show # print the live PROGRAM.md
dojo domain use <name> # switch active domain
Stopping a run
dojo run blocks the foreground until the agent finishes. To stop it early:
- Ctrl-C in the running terminal — the canonical path. The orchestrator is interrupted, the framework asks the backend to summarise any durable findings as knowledge atoms (a small one-shot LLM call), then prints a final cost line. A second Ctrl-C aborts the cleanup immediately.
dojo stop [run_id]from another terminal — marks the runSTOPPEDon disk. This does not halt an in-process foreground run (the orchestrator lives inside the other terminal's Python process); use it to recover records leftRUNNINGafter a hard kill, or to stop server-mode runs.
Reviewing what happened
dojo experiments ls # rank experiments by the primary metric (best first)
dojo experiments best # show the single best experiment so far
dojo experiments show <id> # full detail: hypothesis, metrics, code path, errors
dojo runs show # last run's events + total cost
dojo experiments ls orders by the task's primary_metric and direction
(e.g. rmse minimised), so the leader sits on top regardless of run order.
The agent's training code is preserved per-experiment in the workspace as
__dojo_train_<experiment_id>.py — cat it to reproduce a run by hand.
Artifacts
Each experiment gets a fresh .dojo/domains/{id}/runs/{eid}/artifacts/ directory. The runner passes its path as artifacts_dir to both train() and evaluate().
evaluate(..., artifacts_dir)writes durable per-run diagnostics — residual plots, calibration curves, error breakdowns. These are produced on every run and are part of the user-defined evaluation contract inSETUP.md.train(..., artifacts_dir)writes opportunistic artifacts — model checkpoints (joblib.dump(model, artifacts_dir / "model.pkl")), training curves, feature importances. The agent decides when an artifact is worth keeping; not every run will write here.
Everything written to artifacts_dir is:
- Copied into the durable Dojo archive at
.dojo/artifacts/experiments/{eid}/.... - Forwarded to the active tracking backend (
MlflowTracker.log_artifactuploads to MLflow;FileTrackerrecords a reference;NoopTrackerdrops it).
Migrating from v0.0.10
If your domain has a v0.0.10 PROGRAM.md with mixed Goal/Dataset/Evaluate content:
- Create
SETUP.mdnext toPROGRAM.mdwith the existing## Datasetand## Evaluatesections. - Trim
PROGRAM.mdto## Goal,## Target,## Success,## Notes. - Run
dojo task setupagain — the regression contract is now v4 (train receivesartifacts_dir), so any frozen task needs re-verification anyway.
Running the server (optional)
If you want the web UI or HTTP API:
just run-stub # stub agent (no LLM, deterministic)
just run-claude # Claude agent (uses your local CLI auth)
Backend → http://localhost:8000 · Frontend → http://localhost:5173. The server reads the same .dojo/ your CLI commands write to, so a CLI-started run is visible in the UI and vice versa.
Config
Create .dojo/config.yaml in your project root:
agent:
backend: stub # "stub" (no LLM) or "claude"
tracking:
backend: file # "file" or "mlflow"
Or use environment variables:
DOJO_AGENT__BACKEND=claude
DOJO_TRACKING__BACKEND=mlflow
Tests
just test # all tests
just lint # ruff check
just format # auto-fix lint + format
Project Structure
src/dojo/
core/ # Domain, Task, Experiment, KnowledgeAtom, Workspace, state machine
agents/ # AgentBackend ABC + Claude / Stub backends, orchestrator
api/ # FastAPI app + routers (/domains, /experiments, /knowledge, /agent)
cli/ # Typer CLI: init, run, task, runs, program, domain, config, start
tools/ # Agent tools (experiments, knowledge, tracking) + AI tool generation
runtime/ # LabEnvironment (DI), ExperimentService, ToolVerifier, program loader
sandbox/ # LocalSandbox (subprocess); runs generated tools + agent code
compute/ # Compute backends (LocalCompute today)
storage/ # Local JSON adapters (domain, experiment, knowledge, run)
tracking/ # FileTracker, MlflowTracker, NoopTracker
config/ # pydantic-settings + YAML config
frontend/ # React 19 + Vite 7 + shadcn/ui (currently de-prioritized)
tests/ # unit, integration, e2e
Key API Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/domains |
Create a research domain |
POST |
/domains/{id}/task |
Attach a Task (regression today) |
POST |
/domains/{id}/tools/generate |
AI-generate load_data / evaluate from SETUP.md, verify against contract |
POST |
/domains/{id}/task/freeze |
Freeze the task — gated on every required tool's verification |
POST |
/domains/{id}/workspace/setup |
One-time workspace prep (venv + deps) |
POST |
/agent/run |
Start an agent run on a domain (requires a frozen task) |
GET |
/agent/runs/{id}/events |
Live SSE event stream |
GET |
/experiments?domain_id= |
List experiments |
GET |
/knowledge?domain_id= |
List knowledge atoms |
GET |
/health |
Health check |
For architecture, conventions, and "how do I add X" recipes, see CLAUDE.md. For vision and the typed-Task design, see docs/MASTER_PLAN.md. For the ordered delivery punch-list, see docs/NEXT_STEPS.md. For the PyPI release process, see docs/RELEASING.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dojoml-0.0.12.tar.gz.
File metadata
- Download URL: dojoml-0.0.12.tar.gz
- Upload date:
- Size: 101.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc0b9c35a056a58a70c3f7358e3553d4a6706e451591bd3bd5f29dd3d3fa2939
|
|
| MD5 |
54790b33be8a506c2277e0bcf25a1617
|
|
| BLAKE2b-256 |
96ecf5c1ef7549db351bb7819044fe38efa5421eb49c2b4b62abb7da537b8fcb
|
Provenance
The following attestation bundles were made for dojoml-0.0.12.tar.gz:
Publisher:
release.yml on Garsdal/Dojo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dojoml-0.0.12.tar.gz -
Subject digest:
cc0b9c35a056a58a70c3f7358e3553d4a6706e451591bd3bd5f29dd3d3fa2939 - Sigstore transparency entry: 1453113370
- Sigstore integration time:
-
Permalink:
Garsdal/Dojo@da0554855d150f3695a31d13f3418b6ac3dfaf89 -
Branch / Tag:
refs/tags/v0.0.12 - Owner: https://github.com/Garsdal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@da0554855d150f3695a31d13f3418b6ac3dfaf89 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dojoml-0.0.12-py3-none-any.whl.
File metadata
- Download URL: dojoml-0.0.12-py3-none-any.whl
- Upload date:
- Size: 143.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f7a90553cc4cf5a944720b5bc96acbe812274c67e578a3b09c6291919de7bd9
|
|
| MD5 |
eca0ce40c8d62b3923ea19fe861453c8
|
|
| BLAKE2b-256 |
c8f103e6f28899c94edb7bc3ebfbe25b135895b033afc79880e1b6efa03b358d
|
Provenance
The following attestation bundles were made for dojoml-0.0.12-py3-none-any.whl:
Publisher:
release.yml on Garsdal/Dojo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dojoml-0.0.12-py3-none-any.whl -
Subject digest:
1f7a90553cc4cf5a944720b5bc96acbe812274c67e578a3b09c6291919de7bd9 - Sigstore transparency entry: 1453113483
- Sigstore integration time:
-
Permalink:
Garsdal/Dojo@da0554855d150f3695a31d13f3418b6ac3dfaf89 -
Branch / Tag:
refs/tags/v0.0.12 - Owner: https://github.com/Garsdal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@da0554855d150f3695a31d13f3418b6ac3dfaf89 -
Trigger Event:
push
-
Statement type: