Skip to main content

AI-powered autonomous ML research framework — agent runs experiments against frozen evaluation contracts on your existing pipelines.

Project description

🥋 Dojo — An AI-powered autonomous ML research framework.

Dojo.ml logo

Run controlled, reproducible ML experiments on your existing pipelines and build a memory of what actually works.



What is Dojo?

You define a domain — a research area pointing at your data with a fixed evaluation contract. An AI agent runs experiments inside that contract: writing training code, calling frozen load_data and evaluate tools, logging metrics, and recording findings as durable knowledge atoms.

Domain (you define)
  ├── Task            — the contract: load_data + evaluate (frozen, AI-generated at setup)
  ├── Workspace       — your repo / pipeline (local path or git url)
  └── Experiments     — agent-created, many per domain
        └── Knowledge atoms — linked across experiments, accumulating over time

The agent owns the training code. The framework owns evaluation. That separation is what makes the metrics trustworthy run-over-run, and what makes it safe to leave the agent unsupervised.

Inspired by Karpathy's autoresearchprepare.py is frozen, train.py is fair game, program.md is what the human iterates on. Dojo generalises that pattern to any well-defined ML problem class.


Current Status

⚠️ Proof of Concept — under active development. Open source. Single-tenant, local-first, by design.

  • Agent: Claude Agent SDK (uses your local claude CLI auth — no API key needed for runs)
  • Compute: Local only (in-process / subprocess) — your data stays on your machine
  • Storage: Local JSON files in .dojo/
  • Tracking: File-based or MLflow (sits on top of an MLflow you already run)
  • Tasks supported: RegressionTask (more types to come once regression is solid)

Prerequisites

  • Python 3.13+
  • uv
  • just
  • The claude CLI logged in (Claude Code) — Dojo shells out to it; no ANTHROPIC_API_KEY needed
  • Node.js 18+ (only if you want the web UI)
just dev                     # install backend + frontend deps

Getting Started — California Housing in 4 commands

The CLI is a peer of the HTTP API, not a thin wrapper around it. The whole happy path runs in-process — no server needed.

mkdir housing && cd housing

# 1. Scaffold the domain (creates .dojo/, the regression Task, and PROGRAM.md)
dojo init --name housing --task-type regression --non-interactive

# 2. Describe the dataset, target, and what success looks like
$EDITOR PROGRAM.md

# 3. AI generates load_data + evaluate from PROGRAM.md, verifies them against
#    the regression contract, and freezes the task. Re-run after edits.
dojo task setup

# 4. Run the agent — events stream live to your terminal
dojo run --max-turns 30

If the AI keeps generating the wrong adapters (verification failures on real-world pipelines, e.g. unusual pandas multi-indexes, custom dataset APIs, or wrapping an existing evaluator), use Opus 4.7 for tool generation instead of the default Sonnet:

DOJO_AGENT__TOOL_GENERATION_MODEL=claude-opus-4-7 dojo task setup

Opus is slower (~30–60s vs 15–30s) but noticeably better at translating a messy PROGRAM.md into correct load_data / evaluate modules. Set it permanently in .dojo/config.yaml under agent.tool_generation_model if you want it as the default.

A reasonable starter PROGRAM.md for California housing:

## Goal
Predict California median house value (regression). Minimise RMSE on a 20% held-out test split.

## Dataset
Use `sklearn.datasets.fetch_california_housing(return_X_y=True)`.
Features and target both come back as numpy arrays — no column names needed.
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

## Target
Median house value (in $100,000s) for census blocks in California.

## Success
Beat a linear baseline. Try at least one tree-based model. Avoid overfitting.

What happens under the hood:

  • dojo init writes .dojo/config.yaml, creates the domain + regression task with expected_metrics = [rmse, r2, mae], scaffolds PROGRAM.md, and sets current_domain_id.
  • dojo task setup reads PROGRAM.md, asks the AI to generate load_data + evaluate, runs each tool in a sandbox against its ToolContract, and freezes the task. Verification failures tell you which tool failed and why — fix PROGRAM.md (or the tool code) and re-run.
  • dojo run starts the agent in-process. The agent writes training code; load_data and evaluate stay frozen. The metric dict from evaluate is the only source of truth — complete_experiment rejects metric keys outside the contract, so the agent can't smuggle in custom numbers.

Useful neighbours:

dojo task show               # current task status, tools, frozen?
dojo runs ls                 # recent runs
dojo runs show               # last run's events + cost
dojo program show            # print the live PROGRAM.md
dojo domain use <name>       # switch active domain

Stopping a run

dojo run blocks the foreground until the agent finishes. To stop it early:

  • Ctrl-C in the running terminal — the canonical path. The orchestrator is interrupted, the framework asks the backend to summarise any durable findings as knowledge atoms (a small one-shot LLM call), then prints a final cost line. A second Ctrl-C aborts the cleanup immediately.
  • dojo stop [run_id] from another terminal — marks the run STOPPED on disk. This does not halt an in-process foreground run (the orchestrator lives inside the other terminal's Python process); use it to recover records left RUNNING after a hard kill, or to stop server-mode runs.

Reviewing what happened

dojo experiments ls          # rank experiments by the primary metric (best first)
dojo experiments best        # show the single best experiment so far
dojo experiments show <id>   # full detail: hypothesis, metrics, code path, errors
dojo runs show               # last run's events + total cost

dojo experiments ls orders by the task's primary_metric and direction (e.g. rmse minimised), so the leader sits on top regardless of run order. The agent's training code is preserved per-experiment in the workspace as __dojo_train_<experiment_id>.pycat it to reproduce a run by hand.

Running the server (optional)

If you want the web UI or HTTP API:

just run-stub                # stub agent (no LLM, deterministic)
just run-claude              # Claude agent (uses your local CLI auth)

Backend → http://localhost:8000 · Frontend → http://localhost:5173. The server reads the same .dojo/ your CLI commands write to, so a CLI-started run is visible in the UI and vice versa.


Config

Create .dojo/config.yaml in your project root:

agent:
  backend: stub        # "stub" (no LLM) or "claude"
tracking:
  backend: file        # "file" or "mlflow"

Or use environment variables:

DOJO_AGENT__BACKEND=claude
DOJO_TRACKING__BACKEND=mlflow

Tests

just test       # all tests
just lint       # ruff check
just format     # auto-fix lint + format

Project Structure

src/dojo/
  core/         # Domain, Task, Experiment, KnowledgeAtom, Workspace, state machine
  agents/       # AgentBackend ABC + Claude / Stub backends, orchestrator
  api/          # FastAPI app + routers (/domains, /experiments, /knowledge, /agent)
  cli/          # Typer CLI: init, run, task, runs, program, domain, config, start
  tools/        # Agent tools (experiments, knowledge, tracking) + AI tool generation
  runtime/      # LabEnvironment (DI), ExperimentService, ToolVerifier, program loader
  sandbox/      # LocalSandbox (subprocess); runs generated tools + agent code
  compute/      # Compute backends (LocalCompute today)
  storage/      # Local JSON adapters (domain, experiment, knowledge, run)
  tracking/     # FileTracker, MlflowTracker, NoopTracker
  config/       # pydantic-settings + YAML config
frontend/       # React 19 + Vite 7 + shadcn/ui (currently de-prioritized)
tests/          # unit, integration, e2e

Key API Endpoints

Method Path Description
POST /domains Create a research domain
POST /domains/{id}/task Attach a Task (regression today)
POST /domains/{id}/tools/generate AI-generate load_data / evaluate from PROGRAM.md, verify against contract
POST /domains/{id}/task/freeze Freeze the task — gated on every required tool's verification
POST /domains/{id}/workspace/setup One-time workspace prep (venv + deps)
POST /agent/run Start an agent run on a domain (requires a frozen task)
GET /agent/runs/{id}/events Live SSE event stream
GET /experiments?domain_id= List experiments
GET /knowledge?domain_id= List knowledge atoms
GET /health Health check

For architecture, conventions, and "how do I add X" recipes, see CLAUDE.md. For vision and the typed-Task design, see docs/MASTER_PLAN.md. For the ordered delivery punch-list, see docs/NEXT_STEPS.md. For the PyPI release process, see docs/RELEASING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dojoml-0.0.3.tar.gz (96.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dojoml-0.0.3-py3-none-any.whl (138.5 kB view details)

Uploaded Python 3

File details

Details for the file dojoml-0.0.3.tar.gz.

File metadata

  • Download URL: dojoml-0.0.3.tar.gz
  • Upload date:
  • Size: 96.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dojoml-0.0.3.tar.gz
Algorithm Hash digest
SHA256 fad74bd614e6ca9a99058ce750dab482e8a11e13be2ae94d546da7d0f50cf709
MD5 f5b4479140f95c3d8a4ad14dd8209be3
BLAKE2b-256 5840fc33c3489802eb5b05ef6e35cbb85ff490a98bf6dc1942d219c9af0603cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for dojoml-0.0.3.tar.gz:

Publisher: release.yml on Garsdal/Dojo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dojoml-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: dojoml-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 138.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dojoml-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c4fd1129026a1fd52ba03d91af6d924632b0f361c2c17e8c3e11ad7c6b7ee8c9
MD5 28b218311859b0849b0c5c6786798f71
BLAKE2b-256 bd7327cd8e2bb4edee137574216396bbea7c7042cd43a36fe5548ead591160ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for dojoml-0.0.3-py3-none-any.whl:

Publisher: release.yml on Garsdal/Dojo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page