Autonomous research-aware iteration agent for ML models and LLM prompts. Multi-LLM backend (Claude/GPT/Llama/Deepseek), persistent memory, literature-grounded experiments, human-approval safety gates.

These details have not been verified by PyPI

Project links

Project description

iterate

Autonomous research-aware iteration agent for ML models and LLM prompts.

Every YC batch ships 200+ AI startups with 2-3 engineer teams. Under shipping pressure, two things break: nobody re-iterates models against new baselines, and LLM prompts sit in production for months untouched. Engineers re-run failed experiments because nobody logged why. Teams pay GPT-5 prices because nobody tested whether Haiku + better prompting would do the job at 1/50th the cost.

AutoML brute-forces. Experiment trackers only log. Prompt evals only evaluate. AIDE iterates Kaggle problems once. iterate is the only system that runs an autonomous, literature-aware, memory-persistent improvement loop on ML models, DL/vision models, AND LLM prompts in production — optimizing for the best model you can actually afford to serve (cheapest cloud, cost/month, requests/hour) — pulling its own training data + context from your DBs, files, and docs (via MCP) — with human-approval safety gates and append-only reasoning logs to wherever your team reads.

How this gets built: WORKFLOW.md (the method) · DECISIONS.md (every call I made against the AI's default)

Status

v0.1 released — the working agentic loop on tabular ML. Install with pip install iterate-ai. Incremental releases follow through to the full v1.0 (~early Sep 2026); the inputs you give shrink and the problem types grow release by release.

Agent-first: the autonomous loop is the v0.1 milestone (Week 3), not a late-stage add-on. After that, two dials turn release to release — the inputs you must give shrink (toward one-sentence input) and the problem types grow (tabular → prompts → DL/vision). Full roadmap + daily trail in BUILD_LOG.md.

Week	Phase	Status
0	Scaffolding + scope lock	done
1	Foundation — schemas + LLM client (tool-calling) + config + CLI	done
2	Tabular execution substrate — `BenchmarkTarget` + data adapter + `ModelTarget` + model factory + local executor	done
3	The agentic loop — Proposer + Orchestrator + Terminator + Memory + CLI → first autonomous tabular run (v0.1)	done
4–5	Sandboxed code-gen + cheap interactive wins (live progress, streaming, Ctrl-C) (v0.2)	—
6	Full interactive CLI — pause, mid-run chat, resume (v0.3)	—
7	Agent picks the metric + starting model (v0.4)	—
8	`PromptTarget` — agentic prompt iteration (v0.5)	—
9	`DLModelTarget` — vision transfer learning, validated on local RTX 4050 (v0.6)	—
10	Cost-constrained recommendation + serving profile + `iterate cost` (v0.7)	—
11	Infer features/target from the data + a description (v0.8)	—
12	MCP discovery — find the data/code itself (v0.9)	—
13	Multi-backend benchmark + Streamlit chat UI + demos (v0.10)	—
14	Full minimum-viable-input + polish + launch (v1.0)	—

What it does

You give it one input. It figures out the rest.

> iterate "improve our customer churn baseline"

Everything else is discovered:

The agent autonomously finds	How
Which repo has the code	Filesystem + GitHub MCP — scan READMEs for keywords, fall back to recent commit activity
Training script + current baseline metric	Code parsing + MLflow / W&B MCP — extract from runs, comments, results JSON
Eval methodology + holdout split	Filesystem — find test/eval scripts
Relevant data tables	Postgres MCP — list, sample, infer relationships
Past experiment history (and why things failed)	Notion MCP — semantic search
Domain context	Synthesize from READMEs + commit messages

It then surfaces what it found and pauses for your gap-fill before iterating:

agent> I found:
   Repo:       customer-platform/ml-models/churn (last commit: 3d ago)
   Training:   train_churn.py — CatBoost, F1=0.78 baseline
   Past tries: 4 attempts in Notion. Best: March, tenure features, F1=0.78.
   Tables:     users, subscriptions, support_tickets, events
   Missing:    No eval script located. Where does evaluation live?

> Eval is in customer-platform/eval/churn_eval.py. Also new plan_tier column.

agent> Got it. My top recommendation:
   → LightGBM + focal loss (Lin et al 2024) — addresses class imbalance 
     that broke March's attempt. Est +0.04 F1, 4 min runtime. Go?

Then the autonomous loop:

1. Research — arxiv + papers-with-code for relevant 2024-2026 work
2. Propose — LLM ranks candidate experiments by expected score gain
3. Memory check — has this been tried? did conditions change since the last failure?
4. Run — execute in a sandboxed environment
5. Score — compare against baseline
6. Log — write a reasoning-trail card to your logging target
7. Decide — continue or terminate (deadline / patience / plateau / idea-exhaustion)

Every decision cites either a paper or a past experiment. Every failure is logged with the reason it failed so the agent can revisit when conditions change.

Three target families

Target	What it iterates on	Example demo (ships with the framework)
`ModelTarget`	Trains a tabular model, scores it on a holdout	Tabular churn prediction (Kaggle)
`DLModelTarget`	Transfer-learns a vision model (fine-tunes a pretrained backbone), scores it	Image classification (validated on local RTX 4050)
`PromptTarget`	Runs an LLM prompt in production, scores outputs (LLM-as-judge or labeled set)	Jigsaw toxicity classification

All inherit from BenchmarkTarget. Same iteration loop. Different execution path. (LLMs are prompt-iteration only — we don't fine-tune foundation models.)

Pluggable tools + data sources (via MCP)

iterate uses Model Context Protocol (MCP) servers as its tool + data layer. Adding a new data source = config-only, no code changes.

Ships with:

MCP server	What it enables
`filesystem`	Read local notebooks, past experiment logs, internal docs
`postgres`	DB introspection + read-only sampling (for data discovery)
`notion`	Search past experiment pages, write new experiment cards

The discovery workflow (Week 11 feature — v0.8):

> iterate init --target churn_baseline --discover

[agent introspects via MCP]
  postgres.list_tables             → users, subs, tickets, ...
  postgres.describe_table("users") → schema
  notion.search("churn")           → 3 past experiment pages
  filesystem.search("churn|retention") → 2 local notebooks

Agent SUMMARY (paused for human review):
  Found 8 tables. Likely relevant: users, subscriptions, support_tickets.
  Past experiments in Notion: 3 attempts, best F1=0.78 (CatBoost, March).
  Inferred target: users.churned_30d. Inferred metric: F1.
  
  Any other artifacts I should know about?
  > [paste URLs, additional context, then 'go']

Add any other MCP server (Drive, GitHub, Slack, Sentry, custom) by editing one YAML file. The MCP-to-OpenAI-tool bridge layer means it works against Ollama, Groq, Together, Deepseek, OpenAI, and Anthropic alike.

Quick start (v0.1)

Local-first. $0. No API keys required. v0.1 iterates tabular models — it chooses the best model + hyperparameters from scikit-learn / XGBoost / LightGBM for a prepared dataset.

# 1. Install Ollama + the tool-calling model (one-time)
brew install ollama
ollama pull qwen3:14b          # ~9.3 GB
ollama serve                   # background server at localhost:11434

# 2. Install iterate (heads-up: pulls scikit-learn / XGBoost / LightGBM)
pip install iterate-ai         # "iterate" was taken on PyPI; the command is still `iterate`

# 3. Prepare a tabular CSV (your standard ML data cleaning) and run
iterate run --data train.clean.csv --target churn --metric f1

# Seed the baseline from an existing notebook/script (read as text, never executed):
iterate run --data train.clean.csv --target churn --metric f1 \
            --source baseline_notebook.ipynb --baseline 0.78

# Use a cloud model instead of local Ollama:
iterate run --data train.clean.csv --target churn --metric f1 \
            --backend openai-compatible --base-url https://api.groq.com/openai/v1 \
            --model llama-3.3-70b --api-key "$GROQ_API_KEY"

The best model is saved to .iterate/runs/<run_id>/best_model.joblib (override with --output) — load and use it directly: joblib.load(path).predict(X). Every experiment persists in .iterate/memory.db, so the next run builds on it.

Full CLI reference: iterate run --help

Note on the one-line form. The iterate "improve our churn baseline" experience — where the agent discovers the data, baseline, and metric itself — is the v1.0 vision, not v0.1. Today you pass --data/--target/--metric explicitly; the inputs shrink release by release (see the roadmap). Auto-discovery, iterate history / why-failed / best, prompt + vision targets, and cost-constrained serving are all on the roadmap, not shipped in v0.1.

Demo UI (Week 12 — v0.9)

A Streamlit-based chat interface that looks and feels like a desktop app — launches in your browser, runs entirely locally, screenshot-ready for demos:

Sidebar (live state):
  MCP status   filesystem, postgres, notion, github  (connected)
  Experiments  #001 win +0.04    #002 fail    #003 retry
  Memory       47 entries, 12 retried
  Cost         $0.03 today

Chat:
  > iterate "improve churn baseline"
    Scanning your repos... found 3 candidates.
    Reading customer-platform/... baseline CatBoost F1=0.78.
    Found 4 past experiments in Notion (best: March, tenure features).
    Anything else I should know about?
  > Eval lives in eval/churn_eval.py
    Got it. Top recommendation: ...

CLI is the canonical install. The Streamlit chat is the demo-ready interface.

Architecture

iterate/
├── core/                # framework reasoning engine
│   ├── orchestrator     # the main loop
│   ├── researcher       # arxiv + papers-with-code retrieval
│   ├── proposer         # ranks candidates by score, within the serving budget
│   ├── serving_cost     # cheapest-cloud + cost/mo + req/hr estimator
│   ├── memory           # persistent store (sqlite)
│   ├── terminator       # deadline / patience / plateau gates
│   └── reporter         # PR-shaped report generator
├── targets/             # what gets iterated on
│   ├── base             # BenchmarkTarget protocol
│   ├── model            # ModelTarget (tabular)
│   ├── dl_model         # DLModelTarget (vision, transfer learning)
│   └── prompt           # PromptTarget (prompt-iteration)
├── adapters/            # pluggable I/O
│   ├── data/            # csv, kaggle, huggingface, postgres
│   ├── models/          # sklearn, xgboost, lightgbm, pytorch
│   ├── compute/         # local (MPS), gpu (RTX 4050), e2b, cloud (user's cloud / rented GPU)
│   └── logging/         # markdown, notion_mcp, slack
├── llm/                  # pluggable multi-backend LLM client
│   ├── base              # LLMClient protocol (provider-agnostic interface)
│   ├── openai_compatible # one client for ALL OpenAI-compatible backends
│   │                     #   (Ollama default · Groq · Together · Deepseek · OpenAI · vLLM)
│   └── anthropic_client  # Claude — the only non-OpenAI-compatible backend (optional, later)
└── schemas/             # Pydantic types

The LLM is plug-and-play. Claude, GPT, Llama 3.3, Deepseek — flip a config flag. The moat is the agentic harness (memory + research + tools + bounded loop), not the model — and the optimization target is the best model you can actually afford to serve: pure score inside a hard serving-cost budget, with a recommendation of the cheapest cloud to host it on, its monthly cost, and its requests/hour throughput.

Multi-LLM backend (planned Week 12 benchmark)

Backend	Est. cost per run	Notes
Claude Opus 4.7	~$4	Best tool-use reliability
Claude Haiku 4.5	~$0.30	Recommended default
Llama 3.3 70B (Together)	~$0.20	Free tier available via Groq
Deepseek V3	~$0.10	Strong on code

Week 12 will ship the head-to-head matrix on identical tasks — scored on quality and serving cost.

Comparison with existing tools

Capability	AutoML (DataRobot/H2O)	W&B / MLflow	Braintrust / LangSmith	AIDE	iterate
Iterates ML models	✓	—	—	✓	✓
Iterates DL / vision models (transfer learning)	partial	—	—	partial	✓
Iterates LLM prompts	—	—	eval only	—	✓
Literature-aware	✗	✗	✗	partial	✓
Persistent memory across sessions	✗	log only	✗	✗	✓
Revisits failures when conditions change	✗	✗	✗	✗	✓
Bounded autonomy (deadline / patience)	✗	✗	✗	partial	✓
Auditable reasoning trail	✗	—	✗	basic	✓
Human-approval gate	✗	n/a	✗	✗	✓
Logs to Notion / Drive / MD	✗	own dashboard	own dashboard	✗	✓
Multi-LLM backend	✗	n/a	partial	✗	✓
Cost-to-serve–aware optimization (cheapest cloud, $/mo, req/hr — best score you can afford to serve)	✗	✗	✗	✗	✓
Auto-discovers training data + context (DB / MCP / Drive)	✗	✗	✗	partial	✓
Open-source	mostly ✗	MLflow yes	✗	✓	✓

Why this exists

Production AI teams forget what they've tried. So they keep retrying it. iterate is the institutional memory + research desk + experiment runner those teams don't have time to build.

License

MIT (planned). The framework is open-source. Adapters for proprietary data sources can be built on top.

Author

Anthony Rodrigues — GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

May 31, 2026

0.1.2

May 31, 2026

0.1.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iterate_ai-0.1.3.tar.gz (595.9 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

iterate_ai-0.1.3-py3-none-any.whl (52.8 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file iterate_ai-0.1.3.tar.gz.

File metadata

Download URL: iterate_ai-0.1.3.tar.gz
Upload date: May 31, 2026
Size: 595.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iterate_ai-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`8ffef98edf79dab5926c41ad7cd79a86541335b193cd33878a76def0e37e1e01`
MD5	`be5d059e3c0b202416589f2969600a2b`
BLAKE2b-256	`b4684855d5ba54a72d6f4c8bbd63ca40ecda3b70ddc93e29d9923475fc1bfd12`

See more details on using hashes here.

File details

Details for the file iterate_ai-0.1.3-py3-none-any.whl.

File metadata

Download URL: iterate_ai-0.1.3-py3-none-any.whl
Upload date: May 31, 2026
Size: 52.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iterate_ai-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb2d8773daac8aedee4bcbe212c400d0b783837da090d87f524f3b2574e78f89`
MD5	`dd2d978a4a95ce53babddcbe3be7be74`
BLAKE2b-256	`b1d785bc3b8a3470076da11ef27f119be0cd54098ca116c54e3c32630ddf62a4`

See more details on using hashes here.

iterate-ai 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

iterate

Status

What it does

Three target families

Pluggable tools + data sources (via MCP)

Quick start (v0.1)

Demo UI (Week 12 — v0.9)

Architecture

Multi-LLM backend (planned Week 12 benchmark)

Comparison with existing tools

Why this exists

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes