MLflow integration for Inspect AI: experiment tracking, execution tracing, and Scout analysis
Project description
inspect-mlflow
MLflow integration for Inspect AI. Provides experiment tracking, execution tracing, and artifact logging for Inspect AI evaluations.
Install
pip install inspect-mlflow
Quick Start
Hooks auto-register via entry points when the package is installed. No code changes needed.
# Start MLflow server
mlflow server --port 5000
# Set env vars
export MLFLOW_TRACKING_URI="http://localhost:5000"
export MLFLOW_INSPECT_TRACING="true"
# Run evals as usual. Both hooks activate automatically.
inspect eval my_task.py --model openai/gpt-4o
Then open http://localhost:5000 to see runs and traces.
What it does
This package provides two hooks that run automatically during Inspect AI evaluations. Both hooks use the MlflowClient API for full isolation from user MLflow state (no global mlflow.start_run calls). Thread-safe for concurrent sample processing.
Tracking Hook
Activated when MLFLOW_TRACKING_URI is set. Creates hierarchical MLflow runs with full evaluation telemetry.
What gets logged:
- Parent run per eval invocation with nested child runs per task
- Task configuration as parameters (model, dataset, solver, temperature, top_p, max_tokens)
- Per-sample scores as step metrics (accuracy, timing per sample)
- Aggregate metrics (total_samples, completed_samples, match/accuracy, match/stderr)
- Model token usage (input/output/total tokens per model)
- Real-time event counting (total_model_calls, total_tool_calls)
- Eval artifacts: per-sample results JSON + full eval log JSON
- Additional rich table artifacts for analysis (
inspect/*.json) - Trace assessments: eval scores logged as MLflow assessments via
mlflow.log_feedback(), visible in the Traces UI assessment column
Task run showing 17 metrics and parameters from a tool-using eval:
Traces table with assessment column showing eval scores (match: AVG 1.0):
Tracing Hook
Activated when MLFLOW_INSPECT_TRACING=true is also set. Maps eval execution to MLflow trace spans, giving you a visual debugging view of every model call, tool invocation, and scoring step.
Span hierarchy:
eval_run:98h4b4KN (CHAIN)
task:task (CHAIN)
sample:keAdeL1U (CHAIN)
solvers (from SpanBeginEvent)
use_tools (solver span)
model:openai/gpt-4o-mini (LLM) - 5,167 tokens
tool:calculator (TOOL) - args: {"expression": "47 * 89"}, result: "4183"
model:openai/gpt-4o-mini (LLM) - 5,263 tokens
generate (solver span)
model:openai/gpt-4o-mini (LLM) - 182 tokens
scorers (from SpanBeginEvent)
match (scorer span)
score (EVALUATOR) - value: C
sample:HWl2wp2B (CHAIN)
...
Each span type captures different data:
| Span Type | Data Captured |
|---|---|
| CHAIN | eval run, task, and sample lifecycle with scores and timing |
| LLM | model name, input/output token counts, temperature, cache status, response text |
| TOOL | function name, arguments, result, working time, errors |
| EVALUATOR | score value, explanation, target |
Traces list showing 3 eval runs (simple math + tool-using calculator eval):
Full span tree showing solver/scorer hierarchy with tool calls:
LLM span detail with model name, token counts, and response:
Autolog
Autolog enables MLflow provider integrations at run start.
Supported providers are: openai, anthropic, langchain, litellm,
mistral, groq, cohere, gemini, bedrock.
Each provider is enabled only when both the MLflow flavor module and provider SDK are installed.
Artifact Tables
When artifact logging is enabled (INSPECT_MLFLOW_LOG_ARTIFACTS=true or
MLFLOW_INSPECT_LOG_ARTIFACTS=true), the tracking hook logs the following artifacts:
inspect/tasks.jsoninspect/samples.jsoninspect/messages.jsoninspect/sample_scores.jsoninspect/events.jsoninspect/model_usage.jsonsample_results/*.jsoneval_logs/*.json
Configuration
Configuration is loaded from environment variables. When pydantic-settings is installed (pip install inspect-mlflow[config]), settings are typed and validated with the INSPECT_MLFLOW_ prefix. Without it, standard os.getenv() is used.
| Env var | Required | Default | Description |
|---|---|---|---|
MLFLOW_TRACKING_URI |
Yes | - | MLflow server URL |
MLFLOW_EXPERIMENT_NAME |
No | inspect_ai |
Experiment name |
MLFLOW_INSPECT_TRACING |
No | false |
Enable execution tracing |
MLFLOW_INSPECT_LOG_ARTIFACTS |
No | true |
Log eval artifacts |
INSPECT_MLFLOW_LOG_ARTIFACTS |
No | true |
Same as above (new prefix, takes priority) |
INSPECT_MLFLOW_AUTOLOG_ENABLED |
No | true |
Enable MLflow provider autolog integrations |
INSPECT_MLFLOW_AUTOLOG_MODELS |
No | openai,anthropic,langchain,litellm |
CSV or JSON array of providers to autolog |
Examples
Basic eval (tracking + tracing)
from inspect_ai import Task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate
# No special imports needed. Hooks auto-register on install.
task = Task(
dataset=[
Sample(input="What is 2 + 2?", target="4"),
Sample(input="What is 3 * 5?", target="15"),
Sample(input="What is 10 - 7?", target="3"),
],
solver=generate(),
scorer=match(),
)
logs = eval(task, model="openai/gpt-4o-mini")
# MLflow now has: runs with metrics + traces with span tree
Eval with tool calls
from inspect_ai import Task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import tool
@tool
def calculator():
"""Perform arithmetic calculations."""
async def run(expression: str) -> str:
"""Evaluate a math expression.
Args:
expression: A math expression to evaluate, e.g. "47 * 89"
"""
allowed = {"__builtins__": {}}
return str(eval(expression, allowed))
return run
task = Task(
dataset=[
Sample(
input="Use the calculator to compute 47 * 89.",
target="4183",
),
Sample(
input="Use the calculator to compute 1024 / 16.",
target="64",
),
],
solver=[use_tools([calculator()]), generate()],
scorer=match(),
)
logs = eval(task, model="openai/gpt-4o-mini")
# Traces now include TOOL spans for each calculator() call
# with function name, arguments, and result
Development
git clone https://github.com/debu-sinha/inspect-mlflow.git
cd inspect-mlflow
uv sync --group dev
uv run pre-commit install
uv run pytest tests/ -v
See CONTRIBUTING.md for integration testing and PR guidelines.
Related
- Documentation - Full API reference and usage guide
- Inspect AI - AI evaluation framework by UK AI Security Institute
- MLflow - ML experiment tracking and model management
- Inspect AI hooks docs - How hooks work
- Issue #3547 - Original proposal
- Vector Institute inspect-mlflow - Related extension whose features are being consolidated here
Contributors
- Debu Sinha - Creator and maintainer
- Vector Institute / National Research Council of Canada (NRC) - Autolog provider support, contributed on behalf of the Canadian AI Safety Institute (CAISI). Consolidated from VectorInstitute/inspect-mlflow.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inspect_mlflow-0.6.0.tar.gz.
File metadata
- Download URL: inspect_mlflow-0.6.0.tar.gz
- Upload date:
- Size: 3.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e0dd5a6076cc8d1566d162f10a5dcd6df88b2e01142e3d4a6d6562835f8c44b
|
|
| MD5 |
45de0675306c7081847cf59c0e86a057
|
|
| BLAKE2b-256 |
de1ab22f2a0cba6dae6867225a67ecaa1236393c6ead5dafa07a14a7ca855398
|
Provenance
The following attestation bundles were made for inspect_mlflow-0.6.0.tar.gz:
Publisher:
release.yml on debu-sinha/inspect-mlflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inspect_mlflow-0.6.0.tar.gz -
Subject digest:
7e0dd5a6076cc8d1566d162f10a5dcd6df88b2e01142e3d4a6d6562835f8c44b - Sigstore transparency entry: 1185822459
- Sigstore integration time:
-
Permalink:
debu-sinha/inspect-mlflow@8d88f37f13e3cafbb4cb9798b74936ad2d2a56cb -
Branch / Tag:
refs/tags/v0.6.0 - Owner: https://github.com/debu-sinha
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8d88f37f13e3cafbb4cb9798b74936ad2d2a56cb -
Trigger Event:
push
-
Statement type:
File details
Details for the file inspect_mlflow-0.6.0-py3-none-any.whl.
File metadata
- Download URL: inspect_mlflow-0.6.0-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf0275dfb3bdb9a038f1234f342a989007ac1edef455fac72c10410d036b6200
|
|
| MD5 |
2bf4ebbe8e55a6c414f0de9e02742e18
|
|
| BLAKE2b-256 |
328c3a2f14a8ba21827348db2d03cea58da3369bcfa748c3d1a69ec13aaba065
|
Provenance
The following attestation bundles were made for inspect_mlflow-0.6.0-py3-none-any.whl:
Publisher:
release.yml on debu-sinha/inspect-mlflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inspect_mlflow-0.6.0-py3-none-any.whl -
Subject digest:
bf0275dfb3bdb9a038f1234f342a989007ac1edef455fac72c10410d036b6200 - Sigstore transparency entry: 1185822460
- Sigstore integration time:
-
Permalink:
debu-sinha/inspect-mlflow@8d88f37f13e3cafbb4cb9798b74936ad2d2a56cb -
Branch / Tag:
refs/tags/v0.6.0 - Owner: https://github.com/debu-sinha
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8d88f37f13e3cafbb4cb9798b74936ad2d2a56cb -
Trigger Event:
push
-
Statement type: