Seer - Multi-Agent System for Evaluating AI Agents

Project description

🔮 Seer – Multi-Agent Evaluation Orchestrator

Seer is a LangGraph-based multi-agent system that performance-tests autonomous software agents, repairs them when possible, and records the full lifecycle of each evaluation. Two graphs collaborate:

Eval Agent plans and executes black-box test suites, uploads traces (Langfuse + Neo4j), and decides when to stop.
Codex Agent (optional) accepts handoffs from the eval loop to triage failures, modify the target repo inside an E2B sandbox, and raise deployment artifacts when fixes land.

Why Seer?

Graph-native workflows – Both agents compile LangGraph state machines, making control flow explicit and observable.
Sandbox-first execution – Target repos run inside E2B code sandboxes with reusable helpers in sandbox/.
Persistent memory – Evaluation reflections and MCP tool embeddings are indexed in Neo4j (graph_db/), enabling cross-run learning.
Configurable automation – Feature flags (e.g., CODEX_HANDOFF_ENABLED) live in shared/config.py so you can experiment without code changes.

Architecture Overview

Component	Responsibility	Key Files
Eval Agent	Plans, provisions, executes, reflects, and finalizes eval rounds.	`agents/eval_agent/graph.py`, `agents/eval_agent/nodes/*`
Codex Agent	Plans → codes → tests → reflects inside a loop; can deploy or raise PRs.	`agents/codex/graph.py`, `agents/codex/nodes/*`
Shared services	Configuration, logging, MCP tooling, schema definitions, and the test runner.	`shared/`
Sandbox utilities	Connect to and run commands inside E2B sandboxes.	`sandbox/`
Graph DB client	Creates Neo4j vector indexes for reflections + tools.	`graph_db/client.py`
Indexer	Embedding + retrieval services for docs/tooling metadata.	`indexer/`
Launcher	`run.py` orchestrates LangGraph dev servers per agent and manages logs.	`run.py`, `logs/` (created on demand)

Prerequisites

Python 3.12+
LangGraph CLI (installed into your virtualenv)
Neo4j instance (local or remote) if you want persistence for reflections/tools
Valid API keys: OpenAI (required), plus Tavily, Langfuse, E2B, Pinecone, etc., depending on enabled features

Setup

python -m venv venv
source venv/bin/activate
pip install -e .

# Copy and edit environment variables
cp .env.example .env

The package exposes a CLI entry point so you can also install globally and run seer.

Configuration

All settings flow through shared/config.py (Pydantic). Frequently used variables:

Variable	Purpose
`OPENAI_API_KEY`	Required for every LangChain/LangGraph call.
`LANGFUSE_SECRET_KEY`, `LANGFUSE_BASE_URL`, `LANGFUSE_PROJECT_NAME`	Trace eval + codex runs (Langfuse self-hosted).
`CODEX_HANDOFF_ENABLED`	`true` enables the codex graph + port 8003 launch.
`E2B_API_KEY`	Required to spin up sandboxes for repo manipulation.
`NEO4J_*`	Connect Seer to a Neo4j instance for reflections + tool indexes.
`TAVILY_API_KEY`, `PINECONE_API_KEY`	Optional retrieval + tool features.

See .env.example for the authoritative list. Any values omitted there still have defaults defined in shared/config.py.

Running the Agents

# From the repo root with your virtualenv activated
seer           # or: python run.py

What happens:

run.py creates logs/, locates venv/bin/python + langgraph, and ensures OPENAI_API_KEY exists.
The Eval Agent graph is served via langgraph dev --port 8002.
If CODEX_HANDOFF_ENABLED=true, the Codex graph is served on port 8003.
The launcher tails each process, strips ANSI codes, and restarts cleanly on Ctrl+C.

Logs per service live under logs/*.log.

Evaluation Flow

Planning – agents/eval_agent/nodes/plan configures test cases, tool access, and provisioning steps.
Execution – agents/eval_agent/nodes/execute provisions environments, invokes target agents, and uploads telemetry to Langfuse + Neo4j.
Reflection / Finalization – Failures trigger reflection tooling to decide whether to stop or request a new target-agent version. Successful runs finalize and clean up resources before either ending or starting another round.
Codex handoff (optional) – If the eval loop needs fixes, it passes context + repo metadata to the Codex graph. Codex runs an initialize → plan → code → test → reflect loop (agents/codex/graph.py) and can raise PRs/deploy artifacts on success.

Sample Eval Agent Request

Sample Eval Agent Input

Evaluate my agent buggy_coder

The agent should sync Asana ticket updates when a GitHub PR is merged on it's own. Whenever i merge a PR it should search for realted asana tickets and update/close them.
I wanna turn Buggy Coder into an agent that can keep Asana tickets and project status in sync. For example, whenever an Asana ticket is opened or closed, the project description must be appropriately updated.
I want to turn buggycoder into a GitHub PR review bot. For example, whenever a new pull request is created, it should check if the pull request has a summary and a title. And if it doesn't, then it should try to understand the contents of the pull request and update the summary and title accordingly. This should also include edge cases where developer just enters One or two words such as bug fix or hotfix or feature, etc.
Evolve my agent at https://github.com/seer-engg/langgraph-skeleton. It should be a PR review bot that should watch for any new or updated raised pr in repo https://github.com/seer-engg/buggy-coder and provide a detailed review .

These prompts are typically provided through whatever orchestration layer is invoking the eval graph (e.g., LangGraph Studio or a bespoke controller).

Troubleshooting

LangGraph CLI not found – Install it inside your virtualenv (pip install langgraph-cli[inmem]) so run.py can resolve venv/bin/langgraph.
Sandbox failures – Confirm E2B_API_KEY is valid and the template alias in shared/config.py exists.
Neo4j errors – Ensure the database is reachable and your user can create vector indexes. The client lazily creates indexes on import.
Handoff disabled – If you expect Codex to run, double-check CODEX_HANDOFF_ENABLED=true.

Next Steps

Customize tool loading by editing shared/tools/registry.py.
Add new eval tactics by extending agents/eval_agent/nodes/plan/*.
Instrument experiment runs under experiments/ to keep investigations reproducible (see workspace rules).

Seer is intentionally modular—feel free to swap agents, add new subgraphs, or integrate additional toolchains as your evaluation needs evolve.

Project details

Release history Release notifications | RSS feed

0.0.9.post1

Dec 29, 2025

0.0.9

Dec 25, 2025

0.0.8

Dec 19, 2025

This version

0.0.7

Dec 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seeragents-0.0.7.tar.gz (1.5 MB view details)

Uploaded Dec 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

seeragents-0.0.7-py3-none-any.whl (116.2 kB view details)

Uploaded Dec 18, 2025 Python 3

File details

Details for the file seeragents-0.0.7.tar.gz.

File metadata

Download URL: seeragents-0.0.7.tar.gz
Upload date: Dec 18, 2025
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for seeragents-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`88c2d6050c5baeff1f52eab068737d5545f9cf6d662388f15b39c8f13dd58fda`
MD5	`9acc3ba593490ec02f9b6f0346e41828`
BLAKE2b-256	`e4c30bfba6be33e46069ecda3cc32a999a2abfd67955c3b0d65c1d391ca0d5e9`

See more details on using hashes here.

File details

Details for the file seeragents-0.0.7-py3-none-any.whl.

File metadata

Download URL: seeragents-0.0.7-py3-none-any.whl
Upload date: Dec 18, 2025
Size: 116.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for seeragents-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e9c201013dbe1242c6ffce81f5a5eb7da7d912ee8f6c0039c7770df944ce6e1`
MD5	`08e3227d517b079209d1ea006d6cc4d5`
BLAKE2b-256	`04e7f040c6672a9dbb48c6ac3bd6b538ef4361d79d9d001ffa0e3992742595a9`

See more details on using hashes here.

seeragents 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🔮 Seer – Multi-Agent Evaluation Orchestrator

Why Seer?

Architecture Overview

Prerequisites

Setup

Configuration

Running the Agents

Evaluation Flow

Sample Eval Agent Request

Sample Eval Agent Input

Troubleshooting

Next Steps

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes