Portia Labs Eval framework for evaluating agentic workflows.

These details have not been verified by PyPI

Project links

Project description

🧵 SteelThread: Agent Evaluation Framework

SteelThread is a flexible evaluation framework built around Portia, designed to support robust evals and stream based testing of agentic workflows. It enables configurable datasets, custom metric definitions, LLM-based judging, and stubbed tool behaviors for reproducible and interpretable scoring.

🚀 Getting Started

1. Install using your framework of choice

`pip`

pip install steel-thread

`poetry`

poetry add steel-thread

`uv`

uv add steel-thread

2. Create your datasets

SteelThread is designed around deep integration with Portia. It uses data from Portia Cloud to generate test cases and evals.

When running monitoring through SteelThread we offer two distinct types:

Evals are static datasets designed to be run multiple times to allow you to analyze how changes to your agents affect performance.
Streams are dynamic streams that automatically include your latest plans and plan runs, allowing you to measure performance in production.

Both types of monitoring can be configured via the cloud dashboard.. Once you've created a dataset record the name of it.

3. Basic Usage

Run a full suite of evals and streams using the name of the dataset from step 2. This will use the built in set of evaluators to give you data out of the box.

from portia import Config, LogLevel, Portia
from steelthread.steelthread import SteelThread, StreamConfig, EvalConfig

# Setup
config = Config.from_default(default_log_level=LogLevel.CRITICAL)
st = SteelThread()

# Process stream
st.process_stream(
    StreamConfig(stream_name="stream_v1", config=config, additional_tags={"feeling": "neutral"})
)

# Run evals
portia = Portia(config)
st.run_evals(
    portia,
    EvalConfig(
        eval_dataset_name="evals_v1",
        config=config,
        iterations=4,
    ),
)

🛠️ Features

🧪 Custom Metrics

Define your own evaluators by subclassing Evaluator:

from steelthread.evals.evaluator import Evaluator
from steelthread.metrics.metric import Metric

class EmojiEvaluator(Evaluator):
    def eval_test_case(self,  
        test_case: EvalTestCase,
        final_plan: Plan,
        final_plan_run: PlanRun,
        additional_data: PlanRunMetadata, 
    ):
        output = final_plan_run.outputs.final_output.get_value() or ""
        count = output.count("😊")
        score = min(count / 2, 1.0)
        return Metric(score=score, name="emoji_score", description="Checks for emoji use")

🧩 Tool Stubbing

Stub tool responses deterministically for fast and reproducible testing:

from steelthread.portia.tools import ToolStubRegistry, ToolStubContext

def weather_stub_response(
    ctx: ToolStubContext,
) -> str:
    """Stub for weather tool to return deterministic weather."""
    city = ctx.kwargs.get("city", "").lower()
    if city == "sydney":
        return "33.28"
    if city == "london":
        return "2.00"

    return f"Unknown city: {city}"


# Run evals with stubs + custom evaluators.
portia = Portia(
    config,
    tools=ToolStubRegistry(
        DefaultToolRegistry(config),
        stubs={
            "weather_tool": weather_stub_response,
        },
    ),
)

📊 `Metric Reporting`

SteelThread is designed around plugable metrics backends. By default metrics are logged and sent to Portia Cloud for visualization but you can add additional backends via the config options.

🧪 Example: End-to-End Test Script

See how everything fits together:

from steelthread.steelthread import SteelThread, EvalConfig
from steelthread.portia.tools import ToolStubRegistry
from steelthread.metrics.metric import Metric
from steelthread.evals.evaluator import Evaluator
from portia import Config, Portia, DefaultToolRegistry, ToolRunContext

# Custom tool stub
def weather_stub_response(
    ctx: ToolStubContext,
) -> str:
    """Stub for weather tool to return deterministic weather."""
    city = ctx.kwargs.get("city", "").lower()
    if city == "sydney":
        return "33.28"
    if city == "london":
        return "2.00"

    return f"Unknown city: {city}"


# Custom evaluator
class EmojiEvaluator(Evaluator):
    def eval_test_case(self, test_case,plan, plan_run, metadata):
        out = plan_run.outputs.final_output.get_value() or ""
        count = out.count("🌞")
        return Metric(score=min(count / 2, 1.0), name="emoji_score", description="Emoji usage")

# Setup
config = Config.from_default()
st = SteelThread()
portia = Portia(
    config,
    tools=ToolStubRegistry(DefaultToolRegistry(config), {"weather_tool": weather_stub_response})
)

st.run_evals(
    portia,
    EvalConfig(
        eval_dataset_name="evals_v1",
        config=config,
        iterations=4,
    ),
)

🧪 Testing

Write tests for your metrics, plans, or evaluator logic using pytest:

uv run pytest tests/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.19

Sep 3, 2025

0.1.18

Sep 3, 2025

0.1.17

Sep 3, 2025

0.1.16

Aug 28, 2025

0.1.15

Aug 15, 2025

0.1.14

Aug 14, 2025

0.1.13

Aug 13, 2025

0.1.12

Aug 12, 2025

0.1.11

Aug 8, 2025

This version

0.1.10

Aug 8, 2025

0.1.9

Aug 8, 2025

0.1.8

Aug 7, 2025

0.1.7

Aug 7, 2025

0.1.6a0 pre-release

Aug 6, 2025

0.1.5a0 pre-release

Aug 5, 2025

0.1.4a0 pre-release

Jul 30, 2025

0.1.3a0 pre-release

Jul 30, 2025

0.1.2a0 pre-release

Jul 29, 2025

0.1.1a0 pre-release

Jul 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

steel_thread-0.1.10.tar.gz (21.3 kB view details)

Uploaded Aug 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

steel_thread-0.1.10-py3-none-any.whl (30.1 kB view details)

Uploaded Aug 8, 2025 Python 3

File details

Details for the file steel_thread-0.1.10.tar.gz.

File metadata

Download URL: steel_thread-0.1.10.tar.gz
Upload date: Aug 8, 2025
Size: 21.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.6

File hashes

Hashes for steel_thread-0.1.10.tar.gz
Algorithm	Hash digest
SHA256	`05a3262c0b2a5f9880f1fe83435e7c546bf9b503fd66db8b78ef7035aad954e6`
MD5	`e35bc3f2049f66cb4a6f8a551939b3a3`
BLAKE2b-256	`6faca74ab6540754a018a8706b1a430baa679b051fd634e0b92bdc70f458b86b`

See more details on using hashes here.

File details

Details for the file steel_thread-0.1.10-py3-none-any.whl.

File metadata

Download URL: steel_thread-0.1.10-py3-none-any.whl
Upload date: Aug 8, 2025
Size: 30.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.6

File hashes

Hashes for steel_thread-0.1.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8be27c0fb5728a8275d6954aaa655158d10f5d57707510cc3657c166376091f`
MD5	`8b59940ae891c64f98afe49e8ef7173b`
BLAKE2b-256	`3313eac6297ccaffeb19b5de0e7c2c8e1f159bc451001ee4fa4f46583b8d2fbb`

See more details on using hashes here.

steel-thread 0.1.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧵 SteelThread: Agent Evaluation Framework

🚀 Getting Started

1. Install using your framework of choice

`pip`

`poetry`

`uv`

2. Create your datasets

3. Basic Usage

🛠️ Features

🧪 Custom Metrics

🧩 Tool Stubbing

📊 `Metric Reporting`

🧪 Example: End-to-End Test Script

🧪 Testing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

steel-thread 0.1.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧵 SteelThread: Agent Evaluation Framework

🚀 Getting Started

1. Install using your framework of choice

pip

poetry

uv

2. Create your datasets

3. Basic Usage

🛠️ Features

🧪 Custom Metrics

🧩 Tool Stubbing

📊 Metric Reporting

🧪 Example: End-to-End Test Script

🧪 Testing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`pip`

`poetry`

`uv`

📊 `Metric Reporting`