Topological runtime for declarative evaluations

These details have not been verified by PyPI

Project links

Project description

Peven

Elevator Pitch

Peven is a rich topological engine for multi-agent evaluations.

Inspiration

Peven is inspired by a couple different things. For starters the name is taken from Patricia A. McKillip's Riddle-Master trilogy. Peven of Aum is a king, a ghost, and a master riddler who has only ever lost once. In the Riddle-Master trilogy, riddles are made up of three parts: questions, answers, and strictures. My hope for Peven is that it can help you explore evaluations by providing a runtime where you can ask a question, iterate based on the stricture, and, eventually, get to an answer. "Beware the unanswered Riddle."

My second point of inspiration comes from my time working at The LLM Data Company, where I had the chance to learn and experiment to my heart's content. A lot of my work centered around environments and benchmarks. I often wished I had a reusable framework or package to support my work here, something like a pydantic (which I love) but for evaluations.

Most of the architectural decisions I made regarding the engine are because I thought the math was cool. Peven should give you a pretty clear sense of (1) how I think about evaluations and (2) what types of evaluations I'm interested in.

This is my first contribution to the open source ecosystem (not sure I can claim that before I even have 1 star so please star!) but selfishly, I do hope it evolves into something more meaningful. If this package is useful, or better yet interesting, to one person then I'll be happy.

Bets

Petri nets are the best way to express multi-agent evaluations. Every agent loop, adversarial interaction, and multi-turn evaluation is a concurrent stateful system with shared resources. That's what Petri nets were invented to model.
Structure where you want it, chaos where you need it. Acyclic nets give you deterministic parallel experiments. Cyclic nets give you dynamic agent loops. One engine, one formalism.

Heritage

An older version of this package was tested extensively with frontier OpenAI and Anthropic models and used complex DAGs instead of Petri nets. Some examples of the tests include: Supreme Court simulations where multiple justices drafted memos and voted, code repair leagues where solutions were executed and critiqued in loops, red-team tournaments with adversarial pairwise comparison, diplomacy crisis negotiations between multi-agent panels, and incident response exercises with staged evidence packets. Those experiments shaped key design decisions in this engine: why colored tokens for batch isolation, why async guards for judge gates, why the consume-eagerly pattern avoids locks, and why the harness itself is the experiment.

What matters in serious evaluations, in my opinion, is the shape of interaction: who sees what, in what order, what actions they can take, and how we judge those actions across individual or shared states. Peven favors radical explicitness: evaluation work should never hide inside implicit assumptions.

Petri Nets

A Petri net is a mathematical model for concurrent systems. The key concepts:

Places — containers that hold tokens. Think of them as states or buffers.
Transitions — actions that consume tokens from input places and produce tokens in output places. In peven, transitions are LLM calls (agents, judges).
Arcs — directed edges connecting places to transitions (inputs) and transitions to places (outputs). A net is always bipartite: places only connect to transitions, never directly to each other. Each arc has a weight (default 1): an input arc with weight $w$ means the transition needs $w$ tokens from that place to fire; an output arc with weight $w$ deposits $w$ copies of the result.
Tokens — data flowing through the net. In peven, tokens carry text (GenerateOutput) or scores (JudgeOutput).
Firing rule — a transition $t$ is enabled when every input place $p$ has at least $w(p, t)$ tokens. Firing consumes those tokens and deposits outputs. Multiple transitions can fire concurrently if they have independent inputs.
Colored tokens — tokens tagged with a run_id for batch isolation. Multiple independent evaluations run through the same net simultaneously without interfering.

A marking $M$ maps each place to its token count. Transition $t$ is enabled when every input place has enough:

$$M(p) \geq w(p, t) \quad \forall , p \in \bullet t$$

$\bullet t$ is the set of input places. $w(p, t)$ is the arc weight from place $p$ to transition $t$ (0 if no arc). When $t$ fires, every place updates:

$$M'(p) = M(p) - w(p, t) + w(t, p)$$

What was there, minus what the transition consumed, plus what it produced. $w(t, p)$ is the arc weight from $t$ back to $p$ (0 if no arc).

Example: A simple generate net with two places and one transition:

prompt = n.place("prompt")      # holds the input token
response = n.place("response")  # receives the agent's output
generate = n.transition("generate", agent(model="openai:gpt-4o", prompt="{text}"))
prompt >> generate >> response   # arc weight = 1 (default)
prompt.token(GenerateOutput(text="hello"))  # initial marking: 1 token in prompt

Before:  prompt=1  response=0
         generate is enabled: prompt has 1 token >= arc weight 1 ✓

generate fires (calls gpt-4o, gets a response):
         prompt   = 1 - 1 + 0 = 0  (consumed 1 token, generate doesn't output here)
         response = 0 - 0 + 1 = 1  (generate deposits its output here)

After:   prompt=0  response=1

Most arcs in peven have weight 1, so it's just "take one, put one." Weights > 1 are for when a transition needs multiple tokens to fire, like a join that requires evidence from two places.

The engine uses an event-driven loop: transitions fire as soon as they're enabled (asyncio.wait(FIRST_COMPLETED)), tokens are consumed eagerly at spawn time and deposited on completion, and all marking mutations happen in a single-threaded central loop without locks.

Tokens are the unit of inter-node communication, not agent state. An agent can run multi-turn conversations, use tools, and accumulate context internally. The net only sees the final result as a token passed to the next transition. This is sub-optimal for any actor that needs to perceive internal agent state: the experiment designer debugging a run, a monitor agent screening intermediate reasoning, or a judge that needs to evaluate process rather than output. This is something I will be thinking about improving immediately.

Install

uv add peven

Quickstart

from peven import NetBuilder, agent, judge, execute, GenerateOutput

n = NetBuilder()
prompt = n.place("prompt")
response = n.place("response")
scored = n.place("scored")

gen = n.transition("gen", agent(model="openai:gpt-4o", prompt="Write about {text}"))
jdg = n.transition("jdg", judge(model="openai:gpt-4o", rubric=[{"weight": 1.0, "requirement": "clear and engaging"}]))

prompt >> gen >> response >> jdg >> scored
prompt.token(GenerateOutput(text="the importance of testing"))

net = n.build()
results = await execute(net)  # in an async function

# Or from a script:
# import asyncio
# results = asyncio.run(execute(net))

If a net has exactly one judge transition, RunResult.score is inferred from it. If a net has multiple judges, designate the scalar score explicitly with n.score_from("transition_id") or n.score_from(transition_proxy).

Nodes

Every transition in a net is either an agent (generates text) or a judge (scores text). Each node gets its own model and configuration.

agent

agent(model, prompt, system=None, tools=None, model_settings=None)

gen = n.transition("gen", agent(
    model="openai:gpt-4o",
    prompt="Write a poem about {text}",
    system="You are a poet.",
))

revise = n.transition("revise", agent(
    model="anthropic:claude-sonnet-4-20250514",
    prompt="Revise this: {text}",
))

Models use pydantic-ai routing: "openai:gpt-4o", "anthropic:claude-sonnet-4-20250514", "ollama:qwen2.5:0.5b", etc.

judge

judge(model, rubric, strategy="per_criterion")

jdg = n.transition("jdg", judge(
    model="openai:gpt-4o",
    rubric=[
        {"weight": 1.0, "requirement": "clear and well-structured"},
        {"weight": 0.5, "requirement": "uses specific examples"},
    ],
))

Judges use the rubric package, built at The LLM Data Company. Three grading strategies:

Strategy	LLM calls	Output
`per_criterion`	One per criterion (default)	Per-criterion MET/UNMET with reasons
`oneshot`	One for all criteria	Per-criterion MET/UNMET in a single call
`rubric_as_judge`	One holistic call	Single score 0-100, no per-criterion breakdown

Different judges in the same net can use different models and strategies:

jdg_deep = n.transition("deep", judge(model="openai:gpt-4o", rubric=rubric))
jdg_fast = n.transition("fast", judge(model="openai:gpt-4o-mini", rubric=rubric, strategy="oneshot"))

Multiple judges are allowed. n.score_from(...) only designates which judge supplies RunResult.score; other judge outputs still remain in the trace and can continue to drive routing through places and guards. If the designated scorer fires multiple times in one run, RunResult.score is the mean of those scorer emissions.

Routing remains explicit in the topology via when(...). Thresholds live in the net.

For common score-gated branches, use the package guard helper:

from peven import score_at_least

scored >> accept.when(score_at_least(0.8)) >> accepted
scored >> revise.when(lambda tokens: not score_at_least(0.8)(tokens)) >> prompt

For cross-judge or richer aggregation patterns, write a custom guard or an explicit aggregator transition.

Trust Model

peven run and peven validate both execute the target eval file as Python. Only run trusted eval files.
Agent tools= are raw Python callables executed with the same permissions as the current process. Sandbox them yourself if needed.
peven run persists results to ~/.peven/runs.db by default. Use --no-save if you do not want local token payloads written to disk.

CLI

# Run an eval file
peven run eval.py
peven run eval.py --concurrency 5 --fuse 500
peven run eval.py --no-save

# Show per-transition execution trace (which transitions fired, outputs, scores)
peven run eval.py --trace

# Validate and inspect topology
peven validate eval.py

# Review stored runs
peven review all              # every run
peven review last 10          # last N runs
peven review <run_id>
peven review <run_id> --trace

By default peven run persists results to ~/.peven/runs.db (created on first run, no setup needed) so you can review them later with peven review. Pass --no-save to disable local persistence for a run.

Runs can finish as:

completed — execution reached quiescence with no active tokens outside sink places
failed — a transition executor failed after retries
incomplete — execution stopped with active tokens remaining, such as a guard error, deadlock, or fuse exhaustion

Repo Examples

If you are working from a repo checkout, the examples/ folder has toy nets to get started:

simple.py — Single generate. Minimal net.
refine.py — Generate-judge-revise loop. Cycles back until score passes threshold.
debate.py — Two agents argue in parallel, judge scores the result. Fork-join topology.

peven run examples/refine.py --trace
peven validate examples/debate.py

Releases

This is v0.1.1. See ROADMAP.md for what's next.

Tests

Repo checkout only:

# Unit + integration
uv run pytest tests/ --ignore=tests/test_e2e.py -v

# Optional live E2E with Ollama
uv run pytest tests/test_e2e.py -v

Contributing

See CONTRIBUTING.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Apr 28, 2026

0.2.0

Apr 23, 2026

This version

0.1.1

Apr 15, 2026

0.1.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

peven-0.1.1.tar.gz (75.1 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

peven-0.1.1-py3-none-any.whl (35.0 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file peven-0.1.1.tar.gz.

File metadata

Download URL: peven-0.1.1.tar.gz
Upload date: Apr 15, 2026
Size: 75.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for peven-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`46117de17d84ada4b1bcc7517a3b3195307144d52b1be1bf1a526d16a9f810d2`
MD5	`339535158bb1005140de4a6629151a0d`
BLAKE2b-256	`4a529a90ad6ed853b8f3bbf70e7d039a2951347ef279f87a499bf49a94c42648`

See more details on using hashes here.

File details

Details for the file peven-0.1.1-py3-none-any.whl.

File metadata

Download URL: peven-0.1.1-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 35.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for peven-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d53e07a76861293e46878c9e51bf01be769e7bc64c654d195b905d31ed513134`
MD5	`be88d4152055679b2bbe490e24b751ab`
BLAKE2b-256	`7ec8b7bd46b8cb31c21729a88938ed96ecbb19d533daf934cfbd4f93a4dc9a05`

See more details on using hashes here.

peven 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Peven

Elevator Pitch

Inspiration

Bets

Heritage

Petri Nets

Install

Quickstart

Nodes

agent

judge

Trust Model

CLI

Repo Examples

Releases

Tests

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes