Portia Labs Eval framework for evaluating agentic workflows.
Project description
SteelThread: Agent Evaluation Framework
SteelThread is a flexible evaluation framework built around Portia, designed to support robust evals and stream based testing of agentic workflows. It enables configurable datasets, custom metric definitions including both deterministic and LLM-based judging, and stubbed tool behaviors for reproducible and interpretable scoring. But its strongest suite is that you can add successful agent runs from the dashboard directly into your datasets rather than have to build those ground truth from scratch. This means Eval sets that are up to date and easy to maintain at all times.
We offer two distinct types of monitoring through SteelThread:
- Streams are dynamic datasets sampled automatically from your latest plans and plan runs, allowing you to measure performance in production.
- Evals are static datasets designed to be run multiple times to allow you to analyze how changes to your agents affect performance.
For access to the full documentation please visit our docs.
SteelThread relies on access to agent activity in Portia cloud (queries, plans, plan runs). You will need a PORTIA_API_KEY to get started. Get one for free from your Portia dashboard's "Manage API keys" tab.
Install using your framework of choice
pip
pip install steel-thread
poetry
poetry add steel-thread
uv
uv add steel-thread
Create a dataset
If you're new to Portia you may not have agent runs in the cloud just yet. so let's start by creating those. Run the query "Read the user feedback notes in local file {path}, and call out recurring themes in their feedback. Use lots of ⚠️ emojis when highlighting areas of concern." where path is a local file you can put a couple of lines of fictitious user feedback in. Here's the script to save you same time:
from portia import Portia
path = "./uxr/calorify.txt" # TODO: change to your desired path
query =f"Read the user feedback notes in local file {path}, \
and call out recurring themes in their feedback. \
Use lots of ⚠️ emojis when highlighting areas of concern."
Portia().run(query=query)
Basic Usage with Streams
Below is example code to process a stream. Before running it make sure you set up your stream from the Portia dashboard's Observability tab so you can then pass it to the process_stream method below. This method will use the built-in set of Stream evaluators to give you data out of the box.
from portia import Config
from steelthread.steelthread import SteelThread, StreamConfig
from dotenv import load_dotenv
load_dotenv(override=True)
config = Config.from_default()
# Setup SteelThread instance and process stream
st = SteelThread()
st.process_stream(
StreamConfig(
# The stream name is the name of the stream we created in the dashboard.
stream_name="your-stream-name-here",
config=config,
)
)
Features
Custom Metrics
Define your own evaluators by subclassing Evaluator:
from steelthread.evals import Evaluator, EvalMetric
class EmojiEvaluator(Evaluator):
def eval_test_case(self, test_case,plan, plan_run, metadata):
out = plan_run.outputs.final_output.get_value() or ""
count = out.count("🌞")
return EvalMetric.from_test_case(
test_case=test_case,
name="emoji_score",
score=min(count / 2, 1.0),
description="Emoji usage"
)
Tool Stubbing
Stub tool responses deterministically for fast and reproducible testing:
from portia import Portia, Config, DefaultToolRegistry
from steelthread.portia.tools import ToolStubRegistry, ToolStubContext
config = Config.from_default()
# Define stub behavior
def weather_stub_response(
ctx: ToolStubContext,
) -> str:
"""Stub for weather tool to return deterministic weather."""
city = ctx.kwargs.get("city", "").lower()
if city == "sydney":
return "33.28"
if city == "london":
return "2.00"
return f"Unknown city: {city}"
# Run evals with stubs
portia = Portia(
config,
tools=ToolStubRegistry(
DefaultToolRegistry(config),
stubs={
"weather_tool": weather_stub_response,
},
),
)
Metric Reporting
SteelThread is designed around plugable metrics backends. By default metrics are logged and sent to Portia Cloud for visualization but you can add additional backends via the config options.
🧪 End-to-end example with Evals
Let's see how everything fits together. Create an Eval dataset in the dashboard from the plan run we made in the Create a dataset section. Navigate to the "Evaluations" tab of the dashboard, create a new eval set from existing data and select the relevant plan run. Record the name you bestowed upon your Eval dataset as you will to pass it to the evaluators in the code below, which you are now ready to run. This code:
- Uses a custom evaluator to count ⚠️ emojis in the output.
- Stubs the
file_reader_toolwith static text. - Run the evals for the dataset you create to compute the emoji count metric over it.
Feel to mess around with the output from the tool stub and re-run these Evals a few times to see the progression in scoring.
from portia import Portia, Config, DefaultToolRegistry
from steelthread.steelthread import SteelThread, EvalConfig
from steelthread.evals import Evaluator, EvalMetric
from steelthread.portia.tools import ToolStubRegistry, ToolStubContext
# Custom evaluator
class EmojiEvaluator(Evaluator):
def eval_test_case(self, test_case,plan, plan_run, metadata):
out = plan_run.outputs.final_output.get_value() or ""
count = out.count("⚠️")
return EvalMetric.from_test_case(
test_case=test_case,
name="emoji_score",
score=min(count / 2, 1.0),
description="Emoji usage",
explanation=f"Found {count} ⚠️ emojis in the output.",
actual_value=str(count),
expectation="2"
)
# Define stub behavior
def file_reader_stub_response(
ctx: ToolStubContext,
) -> str:
"""Stub response for file reader tool to return static file content."""
filename = ctx.kwargs.get("filename", "").lower()
return f"Feedback from file: {filename} suggests \
⚠️ 'One does not simply Calorify' \
and ⚠️ 'Calorify is not a diet' \
and ⚠️ 'Calorify is not a weight loss program' \
and ⚠️ 'Calorify is not a fitness program' \
and ⚠️ 'Calorify is not a health program' \
and ⚠️ 'Calorify is not a nutrition program' \
and ⚠️ 'Calorify is not a meal delivery service' \
and ⚠️ 'Calorify is not a meal kit service' "
config = Config.from_default()
# Run evals with stubs
portia = Portia(
config,
tools=ToolStubRegistry(
DefaultToolRegistry(config),
stubs={
"file_reader_tool": file_reader_stub_response,
},
),
)
SteelThread().run_evals(
portia,
EvalConfig(
eval_dataset_name="your-dataset-name-here", #TODO: replace with your dataset name
config=config,
iterations=5,
evaluators=[EmojiEvaluator(config)]
),
)
🧪 Testing
Write tests for your metrics, plans, or evaluator logic using pytest:
uv run pytest tests/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file steel_thread-0.1.19.tar.gz.
File metadata
- Download URL: steel_thread-0.1.19.tar.gz
- Upload date:
- Size: 23.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f0b2179b72cc678a54c571a319066b341036673d42b38fbd1bb9eb156f654bf
|
|
| MD5 |
6b88f472e9722d3e3fd3f1008927aabc
|
|
| BLAKE2b-256 |
0cc7fa2c8361e036f1f5b987b3f1a8ac48cca8cc4c6238bbf95e4ac340391f85
|
File details
Details for the file steel_thread-0.1.19-py3-none-any.whl.
File metadata
- Download URL: steel_thread-0.1.19-py3-none-any.whl
- Upload date:
- Size: 32.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71f21673f1ad25f5d8251d14e0019589690ed01acc4807d6779a12e377df740b
|
|
| MD5 |
c16b811f42f0420ba0756673ec28d1eb
|
|
| BLAKE2b-256 |
09565a22aabe95d54f8a0ef279ea8e6a6b14daf440da6d515f55edd23c49e7ed
|