Skip to main content

Reliability infrastructure for AI agents — evaluation, observability, and regression testing

Project description

CortexOps

Reliability infrastructure for AI agents.
Evaluate · Observe · Operate — for LangGraph, CrewAI, and AutoGen.

PyPI version Python 3.10+ CI License: MIT


What's New in v0.4.0

LLM-as-judge evaluation

from cortexops.judge import LLMJudge

judge = LLMJudge(api_key="sk-...")
result = judge.evaluate(
    case_id="case-001",
    input="Process refund for order #4821",
    output="Refund of $49.99 approved and processed.",
    rubric="task_completion",
)
print(result.score, result.passed, result.reasoning)

Golden dataset API

from cortexops.dataset import GoldenDataset

ds = GoldenDataset(name="refund-agent-v1")
ds.add(input="Refund order #4821", expected="refund_approved")
ds.add(input="Cancel subscription", expected="subscription_cancelled")
ds.save("datasets/refund_agent.yaml")

results = ds.run(agent=your_agent, fail_on="task_completion < 0.90")

CI/CD eval gate

cortexops eval run \
  --dataset datasets/refund_agent.yaml \
  --judge \
  --fail-on "task_completion < 0.90"
# Exit code 1 if regression detected — drop into GitHub Actions

The problem

You deployed an agent. You have no idea if it regressed overnight.

No standard eval format. No failure traces. No CI gate before the next prompt change ships.
CortexOps fixes that.


Quickstart

pip install cortexops  # v0.4.0
from cortexops import CortexTracer, EvalSuite

# Wrap your LangGraph app — zero refactor required
tracer = CortexTracer(project="payments-agent")
graph  = tracer.wrap(your_langgraph_app)

# Run evaluations against a golden dataset
results = EvalSuite.run(
    dataset="golden_v1.yaml",
    agent=graph,
)

print(results.summary())
# CortexOps eval — payments-agent
#   Cases           : 9  (7 passed, 2 failed)
#   Task completion : 91.4%
#   Tool accuracy   : 97.0/100
#   Latency p50/p95 : 42ms / 187ms
#   Failed cases:
#     - escalation_router: tool_call_mismatch (score 41)

Golden dataset format

Define test cases in YAML. Run them locally or in CI.

# golden_v1.yaml
version: 1
project: payments-agent

cases:
  - id: refund_lookup_01
    input: "What is the status of refund REF-8821?"
    expected_tool_calls: [lookup_refund]
    expected_output_contains: ["approved", "REF-8821"]
    max_latency_ms: 3000

  - id: dispute_escalation_01
    input: "I was charged twice  this is unauthorized"
    expected_tool_calls: [classify_dispute, route_escalation]
    expected_output_contains: ["escalated"]
    max_latency_ms: 5000

CI eval gate

Add to .github/workflows/eval.yml:

- name: CortexOps eval gate
  run: |
    python examples/langgraph_payments/run_eval.py \
      --dataset golden_v1.yaml \
      --fail-on "task_completion < 0.90"

If the eval drops below threshold, the job exits non-zero and the PR is blocked.


Repo structure

cortexops/
├── sdk/                        # pip install cortexops  # v0.4.0
│   ├── cortexops/
│   │   ├── tracer.py           # CortexTracer — wraps LangGraph / CrewAI
│   │   ├── eval.py             # EvalSuite — golden dataset runner
│   │   ├── metrics.py          # task_completion, tool_accuracy, latency, hallucination
│   │   ├── models.py           # Pydantic data models
│   │   └── client.py           # HTTP client for hosted API
│   └── tests/
├── backend/                    # FastAPI + Celery + SQLite/Postgres
│   ├── app/
│   │   ├── main.py
│   │   ├── routers/            # /v1/evals, /v1/traces
│   │   ├── models/             # DB records + API schemas
│   │   └── worker/             # Celery async eval tasks
│   └── Dockerfile
├── frontend/                   # React + TypeScript dashboard
├── examples/
│   └── langgraph_payments/     # Full runnable demo
│       ├── agent.py
│       ├── golden_v1.yaml
│       └── run_eval.py
└── docker-compose.yml

Run the full stack locally

git clone https://github.com/ashishodu2023/cortexops
cd cortexops

# Start API + worker + Redis
docker compose up --build

# In another terminal — run the demo eval
cd examples/langgraph_payments
pip install -e ../../sdk/
python run_eval.py

# API docs at http://localhost:8000/docs
# Dashboard at http://localhost:3000

Supported frameworks

Framework Status
LangGraph Stable
CrewAI Stable
AutoGen Beta
LlamaIndex agents Coming soon
Custom callables Supported via CortexTracer.wrap()

Built-in metrics

Metric What it checks
task_completion Agent produced a valid, non-error output
tool_accuracy Expected tool calls were actually made
latency Response within max_latency_ms budget
hallucination Detects fabrication signals in output

Add custom metrics by subclassing cortexops.Metric.


Contributing

git clone https://github.com/ashishodu2023/cortexops
cd cortexops/sdk
pip install -e ".[dev]"
pytest tests/ -v

See CONTRIBUTING.md. Issues labeled good first issue are a great place to start.


Citation

@software{cortexops2025,
  author  = {Ashish, et al.},
  title   = {CortexOps: Reliability Infrastructure for AI Agents},
  year    = {2025},
  url     = {https://github.com/ashishodu2023/cortexops},
}

License

MIT — see LICENSE.


cortexops.ai · Issues · Discussions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cortexops-0.4.0.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cortexops-0.4.0-py3-none-any.whl (34.9 kB view details)

Uploaded Python 3

File details

Details for the file cortexops-0.4.0.tar.gz.

File metadata

  • Download URL: cortexops-0.4.0.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cortexops-0.4.0.tar.gz
Algorithm Hash digest
SHA256 d4ac9963a23d46ed59fcac751ae288bfde7aede0fa4ba0714b66a70bae71149a
MD5 5b6dd07ff5ce6f7fd75bf3ef01ac8924
BLAKE2b-256 322f6352e9967bf899efac521982c73fafc2ca7cf1b55881eee33b90a8d165cf

See more details on using hashes here.

File details

Details for the file cortexops-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: cortexops-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 34.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cortexops-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd79eff52ff59de15e51cfabfb3fd08d1ad4ec1b3e575e97dbc33a320b27a4e6
MD5 4bb5569c1c358d44d2da2b2f79fa4ebc
BLAKE2b-256 3db933de60a397ed128c9a732f0814b0398d36bd3456ed70406f8579204df4b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page