Reliability infrastructure for AI agents — evaluation, observability, and regression testing

These details have not been verified by PyPI

Project links

Project description

CortexOps

Reliability infrastructure for AI agents.
Evaluate · Observe · Operate — for LangGraph, CrewAI, and AutoGen.

What's New in v0.4.0

LLM-as-judge evaluation

from cortexops.judge import LLMJudge

judge = LLMJudge(api_key="sk-...")
result = judge.evaluate(
    case_id="case-001",
    input="Process refund for order #4821",
    output="Refund of $49.99 approved and processed.",
    rubric="task_completion",
)
print(result.score, result.passed, result.reasoning)

Golden dataset API

from cortexops.dataset import GoldenDataset

ds = GoldenDataset(name="refund-agent-v1")
ds.add(input="Refund order #4821", expected="refund_approved")
ds.add(input="Cancel subscription", expected="subscription_cancelled")
ds.save("datasets/refund_agent.yaml")

results = ds.run(agent=your_agent, fail_on="task_completion < 0.90")

CI/CD eval gate

cortexops eval run \
  --dataset datasets/refund_agent.yaml \
  --judge \
  --fail-on "task_completion < 0.90"
# Exit code 1 if regression detected — drop into GitHub Actions

The problem

You deployed an agent. You have no idea if it regressed overnight.

No standard eval format. No failure traces. No CI gate before the next prompt change ships.
CortexOps fixes that.

Quickstart

pip install cortexops  # v0.4.0

from cortexops import CortexTracer, EvalSuite

# Wrap your LangGraph app — zero refactor required
tracer = CortexTracer(project="payments-agent")
graph  = tracer.wrap(your_langgraph_app)

# Run evaluations against a golden dataset
results = EvalSuite.run(
    dataset="golden_v1.yaml",
    agent=graph,
)

print(results.summary())
# CortexOps eval — payments-agent
#   Cases           : 9  (7 passed, 2 failed)
#   Task completion : 91.4%
#   Tool accuracy   : 97.0/100
#   Latency p50/p95 : 42ms / 187ms
#   Failed cases:
#     - escalation_router: tool_call_mismatch (score 41)

Golden dataset format

Define test cases in YAML. Run them locally or in CI.

# golden_v1.yaml
version: 1
project: payments-agent

cases:
  - id: refund_lookup_01
    input: "What is the status of refund REF-8821?"
    expected_tool_calls: [lookup_refund]
    expected_output_contains: ["approved", "REF-8821"]
    max_latency_ms: 3000

  - id: dispute_escalation_01
    input: "I was charged twice — this is unauthorized"
    expected_tool_calls: [classify_dispute, route_escalation]
    expected_output_contains: ["escalated"]
    max_latency_ms: 5000

CI eval gate

Add to .github/workflows/eval.yml:

- name: CortexOps eval gate
  run: |
    python examples/langgraph_payments/run_eval.py \
      --dataset golden_v1.yaml \
      --fail-on "task_completion < 0.90"

If the eval drops below threshold, the job exits non-zero and the PR is blocked.

Repo structure

cortexops/
├── sdk/                        # pip install cortexops  # v0.4.0
│   ├── cortexops/
│   │   ├── tracer.py           # CortexTracer — wraps LangGraph / CrewAI
│   │   ├── eval.py             # EvalSuite — golden dataset runner
│   │   ├── metrics.py          # task_completion, tool_accuracy, latency, hallucination
│   │   ├── models.py           # Pydantic data models
│   │   └── client.py           # HTTP client for hosted API
│   └── tests/
├── backend/                    # FastAPI + Celery + SQLite/Postgres
│   ├── app/
│   │   ├── main.py
│   │   ├── routers/            # /v1/evals, /v1/traces
│   │   ├── models/             # DB records + API schemas
│   │   └── worker/             # Celery async eval tasks
│   └── Dockerfile
├── frontend/                   # React + TypeScript dashboard
├── examples/
│   └── langgraph_payments/     # Full runnable demo
│       ├── agent.py
│       ├── golden_v1.yaml
│       └── run_eval.py
└── docker-compose.yml

Run the full stack locally

git clone https://github.com/ashishodu2023/cortexops
cd cortexops

# Start API + worker + Redis
docker compose up --build

# In another terminal — run the demo eval
cd examples/langgraph_payments
pip install -e ../../sdk/
python run_eval.py

# API docs at http://localhost:8000/docs
# Dashboard at http://localhost:3000

Supported frameworks

Framework	Status
LangGraph	Stable
CrewAI	Stable
AutoGen	Beta
LlamaIndex agents	Coming soon
Custom callables	Supported via `CortexTracer.wrap()`

Built-in metrics

Metric	What it checks
`task_completion`	Agent produced a valid, non-error output
`tool_accuracy`	Expected tool calls were actually made
`latency`	Response within `max_latency_ms` budget
`hallucination`	Detects fabrication signals in output

Add custom metrics by subclassing cortexops.Metric.

Contributing

git clone https://github.com/ashishodu2023/cortexops
cd cortexops/sdk
pip install -e ".[dev]"
pytest tests/ -v

See CONTRIBUTING.md. Issues labeled good first issue are a great place to start.

Citation

@software{cortexops2025,
  author  = {Ashish, et al.},
  title   = {CortexOps: Reliability Infrastructure for AI Agents},
  year    = {2025},
  url     = {https://github.com/ashishodu2023/cortexops},
}

License

MIT — see LICENSE.

cortexops.ai · Issues · Discussions

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

May 25, 2026

0.3.0

Apr 18, 2026

0.2.0

Apr 11, 2026

0.1.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cortexops-0.4.0.tar.gz (33.0 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cortexops-0.4.0-py3-none-any.whl (34.9 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file cortexops-0.4.0.tar.gz.

File metadata

Download URL: cortexops-0.4.0.tar.gz
Upload date: May 25, 2026
Size: 33.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cortexops-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`d4ac9963a23d46ed59fcac751ae288bfde7aede0fa4ba0714b66a70bae71149a`
MD5	`5b6dd07ff5ce6f7fd75bf3ef01ac8924`
BLAKE2b-256	`322f6352e9967bf899efac521982c73fafc2ca7cf1b55881eee33b90a8d165cf`

See more details on using hashes here.

File details

Details for the file cortexops-0.4.0-py3-none-any.whl.

File metadata

Download URL: cortexops-0.4.0-py3-none-any.whl
Upload date: May 25, 2026
Size: 34.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cortexops-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd79eff52ff59de15e51cfabfb3fd08d1ad4ec1b3e575e97dbc33a320b27a4e6`
MD5	`4bb5569c1c358d44d2da2b2f79fa4ebc`
BLAKE2b-256	`3db933de60a397ed128c9a732f0814b0398d36bd3456ed70406f8579204df4b3`

See more details on using hashes here.

cortexops 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CortexOps

What's New in v0.4.0

LLM-as-judge evaluation

Golden dataset API

CI/CD eval gate

The problem

Quickstart

Golden dataset format

CI eval gate

Repo structure

Run the full stack locally

Supported frameworks

Built-in metrics

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes