Portia Labs Eval framework for evaluating agentic workflows.

These details have not been verified by PyPI

Project links

Project description

SteelThread: Agent Evaluation Framework

SteelThread is a flexible evaluation framework built around Portia, designed to support robust evals and stream based testing of agentic workflows. It enables configurable datasets, custom metric definitions including both deterministic and LLM-based judging, and stubbed tool behaviors for reproducible and interpretable scoring. But its strongest suite is that you can add successful agent runs from the dashboard directly into your datasets rather than have to build those ground truth from scratch. This means Eval sets that are up to date and easy to maintain at all times.

We offer two distinct types of monitoring through SteelThread:

Streams are dynamic datasets sampled automatically from your latest plans and plan runs, allowing you to measure performance in production.
Evals are static datasets designed to be run multiple times to allow you to analyze how changes to your agents affect performance.

For access to the full documentation please visit our docs. SteelThread relies on access to agent activity in Portia cloud (queries, plans, plan runs). You will need a PORTIA_API_KEY to get started. Get one for free from your Portia dashboard's "Manage API keys" tab.

Install using your framework of choice

`pip`

pip install steel-thread

`poetry`

poetry add steel-thread

`uv`

uv add steel-thread

Create a dataset

If you're new to Portia you may not have agent runs in the cloud just yet. so let's start by creating those. Run the query "Read the user feedback notes in local file {path}, and call out recurring themes in their feedback. Use lots of ⚠️ emojis when highlighting areas of concern." where path is a local file you can put a couple of lines of fictitious user feedback in. Here's the script to save you same time:

from portia import Portia

path = "./uxr/calorify.txt" # TODO: change to your desired path
query =f"Read the user feedback notes in local file {path}, \
            and call out recurring themes in their feedback. \
                Use lots of ⚠️ emojis when highlighting areas of concern."

Portia().run(query=query)

Basic Usage with Streams

Below is example code to process a stream. Before running it make sure you set up your stream from the Portia dashboard's Observability tab so you can then pass it to the process_stream method below. This method will use the built-in set of Stream evaluators to give you data out of the box.

from portia import Config
from steelthread.steelthread import SteelThread, StreamConfig
from dotenv import load_dotenv


load_dotenv(override=True)

config = Config.from_default()

# Setup SteelThread instance and process stream
st = SteelThread()
st.process_stream(
    StreamConfig(
        # The stream name is the name of the stream we created in the dashboard.
        stream_name="your-stream-name-here",
        config=config,
    )
)

Features

Custom Metrics

Define your own evaluators by subclassing Evaluator:

from steelthread.evals import Evaluator, EvalMetric

class EmojiEvaluator(Evaluator):
    def eval_test_case(self, test_case,plan, plan_run, metadata):
        out = plan_run.outputs.final_output.get_value() or ""
        count = out.count("🌞")
        return EvalMetric.from_test_case(
            test_case=test_case,
            name="emoji_score",
            score=min(count / 2, 1.0),
            description="Emoji usage"
        )

Tool Stubbing

Stub tool responses deterministically for fast and reproducible testing:

from portia import Portia, Config, DefaultToolRegistry
from steelthread.portia.tools import ToolStubRegistry, ToolStubContext


config = Config.from_default()

# Define stub behavior
def weather_stub_response(
    ctx: ToolStubContext,
) -> str:
    """Stub for weather tool to return deterministic weather."""
    city = ctx.kwargs.get("city", "").lower()
    if city == "sydney":
        return "33.28"
    if city == "london":
        return "2.00"

    return f"Unknown city: {city}"

# Run evals with stubs 
portia = Portia(
    config,
    tools=ToolStubRegistry(
        DefaultToolRegistry(config),
        stubs={
            "weather_tool": weather_stub_response,
        },
    ),
)

`Metric Reporting`

SteelThread is designed around plugable metrics backends. By default metrics are logged and sent to Portia Cloud for visualization but you can add additional backends via the config options.

🧪 End-to-end example with Evals

Let's see how everything fits together. Create an Eval dataset in the dashboard from the plan run we made in the Create a dataset section. Navigate to the "Evaluations" tab of the dashboard, create a new eval set from existing data and select the relevant plan run. Record the name you bestowed upon your Eval dataset as you will to pass it to the evaluators in the code below, which you are now ready to run. This code:

Uses a custom evaluator to count ⚠️ emojis in the output.
Stubs the file_reader_tool with static text.
Run the evals for the dataset you create to compute the emoji count metric over it.

Feel to mess around with the output from the tool stub and re-run these Evals a few times to see the progression in scoring.

from portia import Portia, Config, DefaultToolRegistry
from steelthread.steelthread import SteelThread, EvalConfig
from steelthread.evals import Evaluator, EvalMetric
from steelthread.portia.tools import ToolStubRegistry, ToolStubContext


# Custom evaluator
class EmojiEvaluator(Evaluator):
    def eval_test_case(self, test_case,plan, plan_run, metadata):
        out = plan_run.outputs.final_output.get_value() or ""
        count = out.count("⚠️")
        return EvalMetric.from_test_case(
            test_case=test_case,
            name="emoji_score",
            score=min(count / 2, 1.0),
            description="Emoji usage",
            explanation=f"Found {count} ⚠️ emojis in the output.",
            actual_value=str(count),
            expectation="2"
        )

# Define stub behavior
def file_reader_stub_response(
    ctx: ToolStubContext,
) -> str:
    """Stub response for file reader tool to return static file content."""
    filename = ctx.kwargs.get("filename", "").lower()

    return f"Feedback from file: {filename} suggests \
        ⚠️ 'One does not simply Calorify' \
        and ⚠️ 'Calorify is not a diet' \
        and ⚠️ 'Calorify is not a weight loss program' \
        and ⚠️ 'Calorify is not a fitness program' \
        and ⚠️ 'Calorify is not a health program' \
        and ⚠️ 'Calorify is not a nutrition program' \
        and ⚠️ 'Calorify is not a meal delivery service' \
        and ⚠️ 'Calorify is not a meal kit service' "


config = Config.from_default()

# Run evals with stubs 
portia = Portia(
    config,
    tools=ToolStubRegistry(
        DefaultToolRegistry(config),
        stubs={
            "file_reader_tool": file_reader_stub_response,
        },
    ),
)

SteelThread().run_evals(
    portia,
    EvalConfig(
        eval_dataset_name="your-dataset-name-here", #TODO: replace with your dataset name
        config=config,
        iterations=5,
        evaluators=[EmojiEvaluator(config)]
    ),
)

🧪 Testing

Write tests for your metrics, plans, or evaluator logic using pytest:

uv run pytest tests/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.19

Sep 3, 2025

0.1.18

Sep 3, 2025

0.1.17

Sep 3, 2025

0.1.16

Aug 28, 2025

0.1.15

Aug 15, 2025

0.1.14

Aug 14, 2025

0.1.13

Aug 13, 2025

0.1.12

Aug 12, 2025

0.1.11

Aug 8, 2025

0.1.10

Aug 8, 2025

0.1.9

Aug 8, 2025

0.1.8

Aug 7, 2025

0.1.7

Aug 7, 2025

0.1.6a0 pre-release

Aug 6, 2025

0.1.5a0 pre-release

Aug 5, 2025

0.1.4a0 pre-release

Jul 30, 2025

0.1.3a0 pre-release

Jul 30, 2025

0.1.2a0 pre-release

Jul 29, 2025

0.1.1a0 pre-release

Jul 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

steel_thread-0.1.19.tar.gz (23.5 kB view details)

Uploaded Sep 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

steel_thread-0.1.19-py3-none-any.whl (32.8 kB view details)

Uploaded Sep 3, 2025 Python 3

File details

Details for the file steel_thread-0.1.19.tar.gz.

File metadata

Download URL: steel_thread-0.1.19.tar.gz
Upload date: Sep 3, 2025
Size: 23.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.14

File hashes

Hashes for steel_thread-0.1.19.tar.gz
Algorithm	Hash digest
SHA256	`1f0b2179b72cc678a54c571a319066b341036673d42b38fbd1bb9eb156f654bf`
MD5	`6b88f472e9722d3e3fd3f1008927aabc`
BLAKE2b-256	`0cc7fa2c8361e036f1f5b987b3f1a8ac48cca8cc4c6238bbf95e4ac340391f85`

See more details on using hashes here.

File details

Details for the file steel_thread-0.1.19-py3-none-any.whl.

File metadata

Download URL: steel_thread-0.1.19-py3-none-any.whl
Upload date: Sep 3, 2025
Size: 32.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.14

File hashes

Hashes for steel_thread-0.1.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`71f21673f1ad25f5d8251d14e0019589690ed01acc4807d6779a12e377df740b`
MD5	`c16b811f42f0420ba0756673ec28d1eb`
BLAKE2b-256	`09565a22aabe95d54f8a0ef279ea8e6a6b14daf440da6d515f55edd23c49e7ed`

See more details on using hashes here.

steel-thread 0.1.19

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SteelThread: Agent Evaluation Framework

Install using your framework of choice

`pip`

`poetry`

`uv`

Create a dataset

Basic Usage with Streams

Features

Custom Metrics

Tool Stubbing

`Metric Reporting`

🧪 End-to-end example with Evals

🧪 Testing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes