Skip to main content

Portia Labs Eval framework for evaluating agentic workflows.

Project description

🧵 SteelThread: Agent Evaluation Framework

SteelThread is a flexible evaluation framework built around Portia, designed to support robust online and offline testing of agentic workflows. It enables configurable datasets, custom metric definitions, LLM-based judging, and stubbed tool behaviors for reproducible and interpretable scoring.


🚀 Getting Started

1. Install using your framework of choice

pip

pip install steelthread

poetry

poetry add steelthread

uv

uv add steelthread

2. Create your datasets

SteelThread is designed around deep integration with Portia. It uses data from Portia Cloud to generate test cases and evals.

When running evals through SteelThread we offer two distinct types:

  • Offline evals are static datasets designed to be run multiple times to allow you to analyze how changes to your agents affect performance.
  • Online evals are dynamic datasets that automatically include your latest plans and plan runs, allowing you to measure performance in production.

Both types of evals can be configured via the cloud dashboard.. Once you've created a dataset record the name of it.


3. Basic Usage

Run a full suite of online and offline evaluations using the name of the dataset from step 2. This will use the built in set of evaluators to give you data out of the box.

from portia import Config, LogLevel, Portia
from steelthread.steelthread import SteelThread, OnlineEvalConfig, OfflineEvalConfig

# Setup
config = Config.from_default(default_log_level=LogLevel.CRITICAL)
runner = SteelThread()

# Online evals
runner.run_online(
    OnlineEvalConfig(data_set_name="online_evals", config=config)
)

# Offline evals
portia = Portia(config)
runner.run_offline(
    portia,
    OfflineEvalConfig(data_set_name="offline_evals_v1", config=config, iterations=4)
)

🛠️ Features

🧪 Custom Metrics

Define your own evaluators by subclassing OfflineEvaluator:

from steelthread.offline_evaluators.evaluator import OfflineEvaluator
from steelthread.metrics.metric import Metric

class EmojiEvaluator(OfflineEvaluator):
    def eval_test_case(self, test_case, final_plan_run, additional_data):
        output = final_plan_run.outputs.final_output.get_value() or ""
        count = output.count("😊")
        score = min(count / 2, 1.0)
        return Metric(score=score, name="emoji_score", description="Checks for emoji use")

🧩 Tool Stubbing

Stub tool responses deterministically for fast and reproducible testing:

from steelthread.portia.tools import ToolStubRegistry

portia = Portia(
    config,
    tools=ToolStubRegistry(
        DefaultToolRegistry(config),
        stubs={
            "weather_tool": lambda i, ctx, args, kwargs: "20.0"  # Always returns 20.0
        }
    )
)

📊 Metric Reporting

SteelThread is designed around plugable metrics backends. By default metrics are logged and sent to Portia Cloud for visualization but you can add additional backends via the config options.


📁 Project Structure

steelthread/
├── metrics/                 # Metric schema & backend logging
│   └── metric.py
├── offline_evaluators/     # Offline test runners and evaluators
│   ├── eval_runner.py
│   ├── evaluator.py
│   └── test_case.py
├── online_evaluators/      # Online test runners
│   └── eval_runner.py
├── portia/                 # Tool stubbing and integration with Portia
│   └── tools.py
├── shared/                 # Shared storage and model definitions
│   └── readonly_storage.py
└── steelthread.py          # Main runner entry point

🧪 Example: End-to-End Test Script

See how everything fits together:

from steelthread.steelthread import SteelThread, OfflineEvalConfig
from steelthread.portia.tools import ToolStubRegistry
from steelthread.metrics.metric import Metric
from steelthread.offline_evaluators.default_evaluator import DefaultOfflineEvaluator
from steelthread.offline_evaluators.evaluator import OfflineEvaluator
from portia import Config, Portia, DefaultToolRegistry, ToolRunContext

# Custom tool stub
def weather_stub_response(i, ctx, args, kwargs):
    return "33.28" if kwargs.get("city") == "sydney" else "2.00"

# Custom evaluator
class EmojiEvaluator(OfflineEvaluator):
    def eval_test_case(self, test_case, plan_run, metadata):
        out = plan_run.outputs.final_output.get_value() or ""
        count = out.count("🌞")
        return Metric(score=min(count / 2, 1.0), name="emoji_score", description="Emoji usage")

# Setup
config = Config.from_default()
runner = SteelThread()
portia = Portia(
    config,
    tools=ToolStubRegistry(DefaultToolRegistry(config), {"weather_tool": weather_stub_response})
)

runner.run_offline(
    portia,
    OfflineEvalConfig(
        data_set_name="offline_evals_v1",
        config=config,
        iterations=1,
        evaluators=[DefaultOfflineEvaluator(config), EmojiEvaluator(config)],
    ),
)

🧪 Testing

Write tests for your metrics, plans, or evaluator logic using pytest:

uv run pytest tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

steel_thread-0.1.1a0.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

steel_thread-0.1.1a0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file steel_thread-0.1.1a0.tar.gz.

File metadata

  • Download URL: steel_thread-0.1.1a0.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for steel_thread-0.1.1a0.tar.gz
Algorithm Hash digest
SHA256 e76f3549c440bbdfd22a59184bae233b639ea26522611e314972437a9f548a3c
MD5 38e8ea0ab2d7d43ee4d312b4a7fef70c
BLAKE2b-256 51cb0289b01c2afefc4f0be8c3629477219acc92a754a36f5b9c166c086312dc

See more details on using hashes here.

File details

Details for the file steel_thread-0.1.1a0-py3-none-any.whl.

File metadata

File hashes

Hashes for steel_thread-0.1.1a0-py3-none-any.whl
Algorithm Hash digest
SHA256 be654bbc3e1e9a5f5680f97d67a018c5cfcb9960bd0fb93bfc48b836bbb845df
MD5 6277d7617eb6e1b0773db91a64ae7293
BLAKE2b-256 7a826b9826eba21b7b867e7740d949c77477070fd3c90ec38de806c2f39040d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page