Skip to main content

MLflow integration for Inspect AI: experiment tracking, execution tracing, and Scout analysis

Project description

inspect-mlflow

logo

CI CodeQL PyPI version Downloads Python 3.10+ License: MIT Docs GitHub stars

MLflow integration for Inspect AI. Provides experiment tracking, execution tracing, and artifact logging for Inspect AI evaluations.

Install

pip install inspect-mlflow

Quick Start

Hooks auto-register via entry points when the package is installed. No code changes needed.

# Start MLflow server
mlflow server --port 5000

# Set env vars
export MLFLOW_TRACKING_URI="http://localhost:5000"
export MLFLOW_INSPECT_TRACING="true"

# Run evals as usual. Both hooks activate automatically.
inspect eval my_task.py --model openai/gpt-4o

Then open http://localhost:5000 to see runs and traces.

What it does

This package provides two hooks that run automatically during Inspect AI evaluations. Both hooks use the MlflowClient API for full isolation from user MLflow state (no global mlflow.start_run calls). Thread-safe for concurrent sample processing.

Tracking Hook

Activated when MLFLOW_TRACKING_URI is set. Creates hierarchical MLflow runs with full evaluation telemetry.

What gets logged:

  • Parent run per eval invocation with nested child runs per task
  • Task configuration as parameters (model, dataset, solver, temperature, top_p, max_tokens)
  • Per-sample scores as step metrics (accuracy, timing per sample)
  • Aggregate metrics (total_samples, completed_samples, match/accuracy, match/stderr)
  • Model token usage (input/output/total tokens per model)
  • Real-time event counting (total_model_calls, total_tool_calls)
  • Eval artifacts: per-sample results JSON + full eval log JSON
  • Additional rich table artifacts for analysis (inspect/*.json)
  • Trace assessments: eval scores logged as MLflow assessments via mlflow.log_feedback(), visible in the Traces UI assessment column

Task run showing 17 metrics and parameters from a tool-using eval:

Task run detail

Traces table with assessment column showing eval scores (match: AVG 1.0):

Trace assessments

Tracing Hook

Activated when MLFLOW_INSPECT_TRACING=true is also set. Maps eval execution to MLflow trace spans, giving you a visual debugging view of every model call, tool invocation, and scoring step.

Span hierarchy:

eval_run:98h4b4KN (CHAIN)
  task:task (CHAIN)
    sample:keAdeL1U (CHAIN)
      solvers (from SpanBeginEvent)
        use_tools (solver span)
          model:openai/gpt-4o-mini (LLM) - 5,167 tokens
          tool:calculator (TOOL) - args: {"expression": "47 * 89"}, result: "4183"
          model:openai/gpt-4o-mini (LLM) - 5,263 tokens
        generate (solver span)
          model:openai/gpt-4o-mini (LLM) - 182 tokens
      scorers (from SpanBeginEvent)
        match (scorer span)
          score (EVALUATOR) - value: C
    sample:HWl2wp2B (CHAIN)
      ...

Each span type captures different data:

Span Type Data Captured
CHAIN eval run, task, and sample lifecycle with scores and timing
LLM model name, input/output token counts, temperature, cache status, response text
TOOL function name, arguments, result, working time, errors
EVALUATOR score value, explanation, target

Traces list showing 3 eval runs (simple math + tool-using calculator eval):

Traces list

Full span tree showing solver/scorer hierarchy with tool calls:

Span tree

LLM span detail with model name, token counts, and response:

LLM detail

Autolog

Autolog enables MLflow provider integrations at run start. Supported providers are: openai, anthropic, langchain, litellm, mistral, groq, cohere, gemini, bedrock. Each provider is enabled only when both the MLflow flavor module and provider SDK are installed.

Artifact Tables

When artifact logging is enabled (INSPECT_MLFLOW_LOG_ARTIFACTS=true or MLFLOW_INSPECT_LOG_ARTIFACTS=true), the tracking hook logs the following artifacts:

  • inspect/tasks.json
  • inspect/samples.json
  • inspect/messages.json
  • inspect/sample_scores.json
  • inspect/events.json
  • inspect/model_usage.json
  • sample_results/*.json
  • eval_logs/*.json

Configuration

Configuration is loaded from environment variables. When pydantic-settings is installed (pip install inspect-mlflow[config]), settings are typed and validated with the INSPECT_MLFLOW_ prefix. Without it, standard os.getenv() is used.

Env var Required Default Description
MLFLOW_TRACKING_URI Yes - MLflow server URL
MLFLOW_EXPERIMENT_NAME No inspect_ai Experiment name
MLFLOW_INSPECT_TRACING No false Enable execution tracing
MLFLOW_INSPECT_LOG_ARTIFACTS No true Log eval artifacts
INSPECT_MLFLOW_LOG_ARTIFACTS No true Same as above (new prefix, takes priority)
INSPECT_MLFLOW_AUTOLOG_ENABLED No true Enable MLflow provider autolog integrations
INSPECT_MLFLOW_AUTOLOG_MODELS No openai,anthropic,langchain,litellm CSV or JSON array of providers to autolog

Examples

Basic eval (tracking + tracing)

from inspect_ai import Task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

# No special imports needed. Hooks auto-register on install.

task = Task(
    dataset=[
        Sample(input="What is 2 + 2?", target="4"),
        Sample(input="What is 3 * 5?", target="15"),
        Sample(input="What is 10 - 7?", target="3"),
    ],
    solver=generate(),
    scorer=match(),
)

logs = eval(task, model="openai/gpt-4o-mini")
# MLflow now has: runs with metrics + traces with span tree

Eval with tool calls

from inspect_ai import Task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import tool


@tool
def calculator():
    """Perform arithmetic calculations."""

    async def run(expression: str) -> str:
        """Evaluate a math expression.

        Args:
            expression: A math expression to evaluate, e.g. "47 * 89"
        """
        allowed = {"__builtins__": {}}
        return str(eval(expression, allowed))

    return run


task = Task(
    dataset=[
        Sample(
            input="Use the calculator to compute 47 * 89.",
            target="4183",
        ),
        Sample(
            input="Use the calculator to compute 1024 / 16.",
            target="64",
        ),
    ],
    solver=[use_tools([calculator()]), generate()],
    scorer=match(),
)

logs = eval(task, model="openai/gpt-4o-mini")
# Traces now include TOOL spans for each calculator() call
# with function name, arguments, and result

Development

git clone https://github.com/debu-sinha/inspect-mlflow.git
cd inspect-mlflow
uv sync --group dev
uv run pre-commit install
uv run pytest tests/ -v

See CONTRIBUTING.md for integration testing and PR guidelines.

Related

Contributors

  • Debu Sinha - Creator and maintainer
  • Vector Institute / National Research Council of Canada (NRC) - Autolog provider support, contributed on behalf of the Canadian AI Safety Institute (CAISI). Consolidated from VectorInstitute/inspect-mlflow.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inspect_mlflow-0.6.0.tar.gz (3.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inspect_mlflow-0.6.0-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file inspect_mlflow-0.6.0.tar.gz.

File metadata

  • Download URL: inspect_mlflow-0.6.0.tar.gz
  • Upload date:
  • Size: 3.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inspect_mlflow-0.6.0.tar.gz
Algorithm Hash digest
SHA256 7e0dd5a6076cc8d1566d162f10a5dcd6df88b2e01142e3d4a6d6562835f8c44b
MD5 45de0675306c7081847cf59c0e86a057
BLAKE2b-256 de1ab22f2a0cba6dae6867225a67ecaa1236393c6ead5dafa07a14a7ca855398

See more details on using hashes here.

Provenance

The following attestation bundles were made for inspect_mlflow-0.6.0.tar.gz:

Publisher: release.yml on debu-sinha/inspect-mlflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inspect_mlflow-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: inspect_mlflow-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inspect_mlflow-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf0275dfb3bdb9a038f1234f342a989007ac1edef455fac72c10410d036b6200
MD5 2bf4ebbe8e55a6c414f0de9e02742e18
BLAKE2b-256 328c3a2f14a8ba21827348db2d03cea58da3369bcfa748c3d1a69ec13aaba065

See more details on using hashes here.

Provenance

The following attestation bundles were made for inspect_mlflow-0.6.0-py3-none-any.whl:

Publisher: release.yml on debu-sinha/inspect-mlflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page