Reliability infrastructure for AI agents — evaluation, observability, and regression testing
Project description
CortexOps
Reliability infrastructure for AI agents.
Evaluate · Observe · Operate — for LangGraph, CrewAI, and AutoGen.
What's New in v0.4.0
LLM-as-judge evaluation
from cortexops.judge import LLMJudge
judge = LLMJudge(api_key="sk-...")
result = judge.evaluate(
case_id="case-001",
input="Process refund for order #4821",
output="Refund of $49.99 approved and processed.",
rubric="task_completion",
)
print(result.score, result.passed, result.reasoning)
Golden dataset API
from cortexops.dataset import GoldenDataset
ds = GoldenDataset(name="refund-agent-v1")
ds.add(input="Refund order #4821", expected="refund_approved")
ds.add(input="Cancel subscription", expected="subscription_cancelled")
ds.save("datasets/refund_agent.yaml")
results = ds.run(agent=your_agent, fail_on="task_completion < 0.90")
CI/CD eval gate
cortexops eval run \
--dataset datasets/refund_agent.yaml \
--judge \
--fail-on "task_completion < 0.90"
# Exit code 1 if regression detected — drop into GitHub Actions
The problem
You deployed an agent. You have no idea if it regressed overnight.
No standard eval format. No failure traces. No CI gate before the next prompt change ships.
CortexOps fixes that.
Quickstart
pip install cortexops # v0.4.0
from cortexops import CortexTracer, EvalSuite
# Wrap your LangGraph app — zero refactor required
tracer = CortexTracer(project="payments-agent")
graph = tracer.wrap(your_langgraph_app)
# Run evaluations against a golden dataset
results = EvalSuite.run(
dataset="golden_v1.yaml",
agent=graph,
)
print(results.summary())
# CortexOps eval — payments-agent
# Cases : 9 (7 passed, 2 failed)
# Task completion : 91.4%
# Tool accuracy : 97.0/100
# Latency p50/p95 : 42ms / 187ms
# Failed cases:
# - escalation_router: tool_call_mismatch (score 41)
Golden dataset format
Define test cases in YAML. Run them locally or in CI.
# golden_v1.yaml
version: 1
project: payments-agent
cases:
- id: refund_lookup_01
input: "What is the status of refund REF-8821?"
expected_tool_calls: [lookup_refund]
expected_output_contains: ["approved", "REF-8821"]
max_latency_ms: 3000
- id: dispute_escalation_01
input: "I was charged twice — this is unauthorized"
expected_tool_calls: [classify_dispute, route_escalation]
expected_output_contains: ["escalated"]
max_latency_ms: 5000
CI eval gate
Add to .github/workflows/eval.yml:
- name: CortexOps eval gate
run: |
python examples/langgraph_payments/run_eval.py \
--dataset golden_v1.yaml \
--fail-on "task_completion < 0.90"
If the eval drops below threshold, the job exits non-zero and the PR is blocked.
Repo structure
cortexops/
├── sdk/ # pip install cortexops # v0.4.0
│ ├── cortexops/
│ │ ├── tracer.py # CortexTracer — wraps LangGraph / CrewAI
│ │ ├── eval.py # EvalSuite — golden dataset runner
│ │ ├── metrics.py # task_completion, tool_accuracy, latency, hallucination
│ │ ├── models.py # Pydantic data models
│ │ └── client.py # HTTP client for hosted API
│ └── tests/
├── backend/ # FastAPI + Celery + SQLite/Postgres
│ ├── app/
│ │ ├── main.py
│ │ ├── routers/ # /v1/evals, /v1/traces
│ │ ├── models/ # DB records + API schemas
│ │ └── worker/ # Celery async eval tasks
│ └── Dockerfile
├── frontend/ # React + TypeScript dashboard
├── examples/
│ └── langgraph_payments/ # Full runnable demo
│ ├── agent.py
│ ├── golden_v1.yaml
│ └── run_eval.py
└── docker-compose.yml
Run the full stack locally
git clone https://github.com/ashishodu2023/cortexops
cd cortexops
# Start API + worker + Redis
docker compose up --build
# In another terminal — run the demo eval
cd examples/langgraph_payments
pip install -e ../../sdk/
python run_eval.py
# API docs at http://localhost:8000/docs
# Dashboard at http://localhost:3000
Supported frameworks
| Framework | Status |
|---|---|
| LangGraph | Stable |
| CrewAI | Stable |
| AutoGen | Beta |
| LlamaIndex agents | Coming soon |
| Custom callables | Supported via CortexTracer.wrap() |
Built-in metrics
| Metric | What it checks |
|---|---|
task_completion |
Agent produced a valid, non-error output |
tool_accuracy |
Expected tool calls were actually made |
latency |
Response within max_latency_ms budget |
hallucination |
Detects fabrication signals in output |
Add custom metrics by subclassing cortexops.Metric.
Contributing
git clone https://github.com/ashishodu2023/cortexops
cd cortexops/sdk
pip install -e ".[dev]"
pytest tests/ -v
See CONTRIBUTING.md. Issues labeled good first issue are a great place to start.
Citation
@software{cortexops2025,
author = {Ashish, et al.},
title = {CortexOps: Reliability Infrastructure for AI Agents},
year = {2025},
url = {https://github.com/ashishodu2023/cortexops},
}
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cortexops-0.4.0.tar.gz.
File metadata
- Download URL: cortexops-0.4.0.tar.gz
- Upload date:
- Size: 33.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4ac9963a23d46ed59fcac751ae288bfde7aede0fa4ba0714b66a70bae71149a
|
|
| MD5 |
5b6dd07ff5ce6f7fd75bf3ef01ac8924
|
|
| BLAKE2b-256 |
322f6352e9967bf899efac521982c73fafc2ca7cf1b55881eee33b90a8d165cf
|
File details
Details for the file cortexops-0.4.0-py3-none-any.whl.
File metadata
- Download URL: cortexops-0.4.0-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd79eff52ff59de15e51cfabfb3fd08d1ad4ec1b3e575e97dbc33a320b27a4e6
|
|
| MD5 |
4bb5569c1c358d44d2da2b2f79fa4ebc
|
|
| BLAKE2b-256 |
3db933de60a397ed128c9a732f0814b0398d36bd3456ed70406f8579204df4b3
|