Skip to main content

An LLM agent that sits next to you through your whole ML pipeline

Project description

mlcompass

An LLM agent that sits next to you through your whole ML pipeline โ€” from data, through training, all the way to deployment.

PyPI Python License

๐Ÿšง Alpha (v0.6.0) โ€” under active development. APIs may change before v1.0.

What it does

mlcompass is a single CLI that follows your ML project from data to production, keeping context across every step.

data.csv         train.py          two runs        results.csv      production
   โ”‚                โ”‚                  โ”‚                โ”‚                โ”‚
   โ–ผ                โ–ผ                  โ–ผ                โ–ผ                โ–ผ
 advise   โ”€โ”€โ”€โ”€โ–บ   audit   โ”€โ”€โ”€โ”€โ–บ    compare     โ”€โ”€โ”€โ”€โ–บ evaluate โ”€โ”€โ”€โ”€โ”€โ–บ  deploy
                  watch

Each command writes to and reads from a shared project context (.mlcompass/), so by the time you reach deploy, the tool already knows your dataset, your model choice, your training history, and your evaluation results.

What's in v0.6

Eleven commands โ€” every stage of the ML pipeline, post-deploy drift detection, hyperparameter optimization, a status inspector, and a self-driving agent with cross-session memory.

Command When you run it What you get Status
init Starting a new project A .mlcompass/ folder that tracks decisions โœ… v0.1
advise You have a CSV, what now? Models to try, features to derive, pitfalls to avoid โœ… v0.1
audit Before you press train Static analysis of training script (seed, val, optimizer, โ€ฆ) โœ… v0.2
watch While training runs Plateau / overfit / NaN / divergence (plain log / TB / W&B) โœ… v0.2
compare After several runs Side-by-side config + final-metric diff with verdict โœ… v0.2
evaluate Training done Metrics, threshold sweep, confusion matrix, leakage-smell โœ… v0.3
deploy Going to production Model + deps + target-specific checks + production checklist โœ… v0.3
status Any time Project metadata, active state, command activity, decisions โœ… v0.3
agent "Just do it for me" LLM-driven router across the other tools, with memory โœ… v0.5
monitor Model deployed; new data flowing PSI + KS + chiยฒ drift across features, retrain verdict โœ… v0.6
optimize You have a few runs; what's next? HPO sub-agent: leaderboard, sensitivity, N suggested configs โœ… v0.6

Every command except init, status, and agent keeps a fully deterministic default path and offers an opt-in --llm flag that adds a Claude-driven interpretation step on top. The agent command is the inverse: LLM-first by design, with the other tools as its hands โ€” and now remembers across runs via per-project memory.

Install

pip install mlcompass
export ANTHROPIC_API_KEY="sk-ant-..."   # only needed for --llm modes

Optional extras:

pip install "mlcompass[tensorboard]"          # adds tbparse for TB event files
pip install "mlcompass[mcp]"                  # adds the Claude / Cursor MCP server
pip install "mlcompass[agent]"                # adds the self-driving agent (anthropic API)
pip install "mlcompass[agent-claude-code]"    # alt agent backend via Claude Code CLI

Use from Claude Desktop / Cursor (MCP)

mlcompass ships a Model Context Protocol server, so any MCP-capable client (Claude Desktop, Claude Code, Cursor, Continue, โ€ฆ) can call its eight tools directly โ€” you describe the situation in natural language, the assistant picks the right mlcompass_* tool and feeds the result back into the conversation.

pip install "mlcompass[mcp]"

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json on macOS, %APPDATA%\Claude\claude_desktop_config.json on Windows):

{
  "mcpServers": {
    "mlcompass": {
      "command": "mlcompass-mcp"
    }
  }
}

Cursor (.cursor/mcp.json in your project, or ~/.cursor/mcp.json):

{
  "mcpServers": {
    "mlcompass": {
      "command": "mlcompass-mcp"
    }
  }
}

Restart the client and the eight tools appear:

Tool Use it whenโ€ฆ
mlcompass_init Starting a new project
mlcompass_advise Asking the assistant to look at a dataset
mlcompass_audit Asking the assistant to review a training script
mlcompass_watch Pointing the assistant at a training log / TB / W&B run
mlcompass_compare "Which of these two runs is better, and why?"
mlcompass_evaluate "Read these predictions and tell me what they mean"
mlcompass_deploy "Is this model ready to ship to Lambda?"
mlcompass_status "What does this project look like right now?"

All tools are deterministic โ€” the assistant reads their structured output and does its own interpretation, with full access to your conversation's context. The CLI stays available for scripted use and for the --llm reasoning modes.

Use as a self-driving agent (CLI)

When you're not in Claude Desktop โ€” CI runs, cron jobs, an ssh session on a GPU box โ€” you can let an agent drive the same eight tools from the terminal:

pip install "mlcompass[agent]"
export ANTHROPIC_API_KEY="sk-ant-..."

mlcompass agent "I have data.csv, take me from raw data to a model recommendation"

The agent picks tools (mlcompass_advise, then mlcompass_status, then โ€ฆ), streams every reasoning step + tool call + tool result to the terminal, and writes a transcript under .mlcompass/agent_runs/<id>/transcript.jsonl plus a human-readable summary.md next to it.

Two backends

Backend Dependency Best for
api (default) mlcompass[agent] Universal: API key + nothing else.
claude-code mlcompass[agent-claude-code] + the claude CLI on PATH Power users already on Claude Code; routes through Anthropic's official Agent SDK.
# Default โ€” talks straight to the Anthropic API.
mlcompass agent "Compare run-3 and run-7" --project-path .

# Alt โ€” routes through your local Claude Code CLI.
pip install "mlcompass[agent-claude-code]"
mlcompass agent "Audit train.py and tell me what to fix" --backend claude-code

# Headless / CI โ€” skip the y/N permission prompt for mutating tools.
mlcompass agent "Init a new churn project here" --auto-approve

# Cap the safety budget if you're worried about runaway loops.
mlcompass agent "Diagnose this run" --max-turns 10 --model claude-sonnet-4-5

The agent will ask before mutating by default โ€” the only mutating tool is mlcompass_init. Read/compute tools (advise, audit, watch, compare, evaluate, deploy, status) auto-allow. Add --auto-approve to skip the prompt for headless runs.

Five-minute tour

mlcompass init my-project

# Pre-training
mlcompass advise data.csv --target churn

# Training-time
mlcompass audit train.py                     # static checks
mlcompass audit train.py --llm               # + prioritized synthesis
mlcompass watch train.log                    # one-shot plain-text scan
mlcompass watch runs/tb_run/                 # TensorBoard event files
mlcompass watch wandb/run-001/               # W&B local run directory
mlcompass watch train.log --follow           # live tail (plain-text only)
mlcompass watch train.log --llm              # + diagnostician
mlcompass watch train.log --llm \            # + permission-gated edits
  --apply --config train.yaml                #   (prompted per change)

# Comparing runs
mlcompass compare run-3 run-7                # deterministic diff
mlcompass compare run-3 run-7 --llm          # + hypothesis + next experiment

# Post-training
mlcompass evaluate results.csv               # metrics + threshold sweep
mlcompass evaluate results.csv --llm         # + assessment + next steps

# Deployment
mlcompass deploy model.pt                    # model + checklist
mlcompass deploy model.pt --requirements reqs.txt --target lambda
mlcompass deploy model.pt --llm              # + production verdict

# Any time โ€” what's the project look like right now?
mlcompass status
mlcompass status --recent 10                 # last 10 decisions

# Let the agent drive the whole pipeline
mlcompass agent "I have data.csv, take me to a deployed model"
mlcompass agent "Compare run-3 and run-7" --backend claude-code
mlcompass agent "Init a new project here" --auto-approve
mlcompass agent "Continue what we started" --resume 20260530-150000

# Post-deploy drift check
mlcompass monitor reference.csv current.csv             # PSI/KS/chiยฒ per feature
mlcompass monitor reference.csv current.csv --llm       # + LLM interpretation

# What hyperparameters to try next?
mlcompass optimize --metric val_acc                     # reads .mlcompass/runs/
mlcompass optimize --metric val_acc --llm               # + strategist plan
mlcompass optimize --metric val_acc \
  --constraints "lr:0.0001-0.1,batch_size:16-256"       # bound the search

Example โ€” advise

mlcompass advise examples/customer_churn.csv
๐Ÿ“Š Dataset analysis
   Path:    examples/customer_churn.csv
   Shape:   500 rows ร— 8 columns
   Target:  churn (high confidence)
   Task:    binary classification (0=98%, 1=2%)

โš  Warnings
  โ€ข Class imbalance detected (1.6% minority class). Don't optimise
    accuracy โ€” use AUC/F1/recall@k. Consider class_weight='balanced'
    or focal loss.

โœจ Recommended models  (with --llm)
  โ€ข XGBoost                 AUC 0.78 โ€“ 0.83
  โ€ข Logistic Regression     AUC 0.70 โ€“ 0.74
  โ€ข LightGBM                AUC 0.78 โ€“ 0.84

Example โ€” audit

mlcompass audit train.py
๐Ÿ”Ž Script audit
   Path: train.py | Lines: 23 | Frameworks: torch

   โœ— error    seed              No random seed set anywhere
   โœ— error    optimizer   L17   Adam does not accept momentum=
   โš  warning  val_split         No validation split detected
   โš  warning  grad_clipping L8  LSTM but no clip_grad_norm_
   โš  warning  dataloader  L20   DataLoader missing shuffle=
   โš  warning  loss_stability L23 log(x) without epsilon clipping
   โ„น info     batch_size  L20   batch_size=1 is very small

   Summary: 2 error   4 warning   1 info

Eight pure-AST rules:

Rule Catches
seed No torch.manual_seed / np.random.seed / set_seed call
val_split No split detected, or split implausibly small
optimizer Adam-family + momentum=, weird lr, SGD without momentum
loss_stability log(x) / np.log(x) without clamp or epsilon
dataloader DataLoader(...) without explicit shuffle=
grad_clipping RNN / Transformer built but clip_grad_norm_ never called
eval_mode model.train() appears but .eval() never does
batch_size Implausibly small (<4) or huge (>4096)

Example โ€” watch

mlcompass watch train.log
๐Ÿ‘  Watch report
   Log:        train.log
   Snapshots:  9
   Last epoch: 7
   Findings:   1 warning

Recent metrics (last 8)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Epoch โ”‚ train_loss โ”‚ val_loss โ”‚ val_acc โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   0   โ”‚       0.65 โ”‚     0.68 โ”‚   0.612 โ”‚
โ”‚   โ€ฆ   โ”‚        โ€ฆ   โ”‚      โ€ฆ   โ”‚    โ€ฆ    โ”‚
โ”‚   7   โ”‚       0.08 โ”‚     0.59 โ”‚   0.773 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โš  warning  overfitting  L7  train_loss dropped -0.17 but val_loss
                            rose +0.11; current gap is 0.51

Four detectors:

Rule Triggers when
nan Any loss-like metric becomes NaN or ยฑInf
divergence Train loss jumps โ‰ฅ10ร— between consecutive snapshots
plateau Primary loss flat across the last 5 snapshots
overfitting Train falling, val rising, with a meaningful gap

Add --follow to tail the log file and surface new findings live.

Example โ€” compare

mlcompass compare run-3 run-7
๐Ÿ†š Run comparison
   Run A  run-3  (baseline)             ยท 20 epochs
   Run B  run-7  (lower-lr-more-dropout) ยท 20 epochs

Final-epoch metrics
   Metric      Run A    Run B    ฮ” (B โˆ’ A)   Winner
   train_loss  0.18     0.24     +0.06       A
   val_acc     0.79     0.87     +0.08       B
   val_loss    0.42     0.28     -0.14       B

Config differences
   dropout     0.1      0.3
   lr          0.001    0.0003

โš–๏ธ Mixed result: A wins 1, B wins 2, 0 tie(s).

Why mlcompass

The ML ecosystem already has great tools โ€” but each owns one slice of the pipeline, and none of them advise:

pandas-profiling W&B / TensorBoard Cursor / Devin mlcompass
Analyzes raw data โœ… โŒ โŒ โœ…
Recommends models + features โŒ โŒ partial โœ…
Audits training scripts โŒ โŒ reactive โœ…
Watches training in real time โŒ dashboard โŒ โœ…
Diagnoses problems proactively โŒ โŒ reactive โœ…
Persistent project memory โŒ per-run โŒ โœ…
Permission-gated actions โŒ โŒ partial first-class

mlcompass is the advisor that sits next to all of these tools โ€” not a replacement for any.

How it works

Built on agentlite โ€” a small Claude agent library โ€” mlcompass uses one deterministic analyzer per command (pure pandas / pure AST / pure log parser) plus an optional LLM agent layer that runs on top of the analyzer's structured output.

        cli.py
          โ”‚
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ–ผ      โ–ผ      โ–ผ         โ–ผ          โ–ผ
 init  advise  audit     watch     compare
                โ”‚         โ”‚           โ”‚
                โ–ผ         โ–ผ           โ–ผ
            (--llm)    (--llm)     (--llm)
            priori-   diagnos-   hypothes-
            tizer     tician     izer

Every action that would modify your code, config, or run a training process asks permission first โ€” agentlite's permission system is first-class, not an afterthought.

See ARCHITECTURE.md for the full design.

Project context

Each mlcompass project keeps a small folder, similar in spirit to .git/:

.mlcompass/
โ”œโ”€โ”€ project.yaml        # metadata
โ”œโ”€โ”€ context.json        # decisions, recommendations, active state
โ”œโ”€โ”€ datasets/           # registered datasets
โ”œโ”€โ”€ runs/               # training run history (consumed by compare)
โ””โ”€โ”€ advice.log          # JSONL of every command run

This is what makes mlcompass more than a chat tool: by the time you run deploy, every earlier decision is still in memory.

Roadmap

Phase Commands Status
Faz 1 (v0.1) init, advise โœ… Shipped
Faz 2 (v0.2) audit, watch, compare + --llm โœ… Shipped
Faz 2.2 (v0.3) TensorBoard / W&B sources, --apply โœ… Shipped
Faz 3 (v0.3) evaluate + leakage-smell warning โœ… Shipped
Faz 4 (v0.3) deploy โœ… Shipped
Faz 5 (v0.3) status โœ… Shipped
Faz 6 (v0.4) MCP server โ€” mlcompass-mcp โœ… Shipped
Faz 7 (v0.5) agent โ€” self-driving (api + claude-code backends) โœ… Shipped
Faz 8 (v0.6) monitor + optimize + agent memory โœ… Shipped

See CHANGELOG.md for the detailed log and ARCHITECTURE.md for the design.

Non-goals

To stay focused, mlcompass will not try to be:

  • AutoML (use AutoGluon, AutoSklearn)
  • Experiment tracker (use MLflow, W&B)
  • Code assistant (use Cursor, Copilot, aider)
  • Monitoring dashboard (use Grafana, Streamlit)

mlcompass advises; you decide.

Contributing

Alpha-stage โ€” issues and discussions welcome, see CONTRIBUTING.md for the dev setup.

License

MIT ยฉ 2026 Hakan Sabunis

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlcompass-0.6.0.tar.gz (166.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlcompass-0.6.0-py3-none-any.whl (137.5 kB view details)

Uploaded Python 3

File details

Details for the file mlcompass-0.6.0.tar.gz.

File metadata

  • Download URL: mlcompass-0.6.0.tar.gz
  • Upload date:
  • Size: 166.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for mlcompass-0.6.0.tar.gz
Algorithm Hash digest
SHA256 9838758e939e977fa70b821d625d3f5027a32356d9747337e84ecc3ce2199faa
MD5 006acae578258bca6bb0bb0d9786a445
BLAKE2b-256 6335b1a9c535fa322bd3c8150403968c742ba8de890eee73816de2935d60d562

See more details on using hashes here.

File details

Details for the file mlcompass-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: mlcompass-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 137.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for mlcompass-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3f6c9620dc6bbd8760870a0fe50c3374d4f4451f7efbefc666e72e27d794bab
MD5 4ffc1d7d2eed561a38aa9cbf1960c219
BLAKE2b-256 cc6f1ff71748d73b6c982174ed9899d1f87c7a77cd1483c9e4b7fe33df6bfe15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page