Skip to main content

Reliability infrastructure for AI agents — evaluation, observability, and regression testing

Project description

CortexOps

Reliability infrastructure for AI agents.
Evaluate · Observe · Operate — for LangGraph, CrewAI, and AutoGen.

PyPI version Python 3.10+ CI License: MIT


The problem

You deployed an agent. You have no idea if it regressed overnight.

No standard eval format. No failure traces. No CI gate before the next prompt change ships.
CortexOps fixes that.


Install

pip install cortexops

# With HTTP client (for pushing traces to hosted API):
pip install cortexops[http]

# With LLM judge support:
pip install cortexops[llm]

Quickstart

from cortexops import CortexTracer, EvalSuite

# Wrap your LangGraph app — zero refactor required
tracer = CortexTracer(project="payments-agent")
graph  = tracer.wrap(your_langgraph_app)

# Run evaluations against a golden dataset
results = EvalSuite.run(
    dataset="golden_v1.yaml",
    agent=graph,
)
print(results.summary())

Golden dataset (YAML)

version: 1
project: payments-agent

cases:
  - id: refund_lookup_01
    input: "What is the status of refund REF-8821?"
    expected_tool_calls: [lookup_refund]
    expected_output_contains: ["approved", "REF-8821"]
    max_latency_ms: 3000

  - id: open_ended_explanation_01
    input: "Why was my refund rejected?"
    judge: llm
    judge_criteria: >
      The response must explain the rejection reason clearly,
      be empathetic, and offer a concrete next step. No jargon.

CI gate

cortexops eval run \
  --dataset golden_v1.yaml \
  --fail-on "task_completion < 0.90"

Exits non-zero if the threshold is not met — blocks the PR.


Built-in metrics

Metric What it checks
task_completion Non-empty, non-error output with expected content
tool_accuracy Expected tool calls were actually made
latency Response within max_latency_ms budget
hallucination Fabrication signals in output
llm_judge GPT-4o scores against natural-language criteria

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cortexops-0.1.0.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cortexops-0.1.0-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file cortexops-0.1.0.tar.gz.

File metadata

  • Download URL: cortexops-0.1.0.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for cortexops-0.1.0.tar.gz
Algorithm Hash digest
SHA256 beda30ec8b28123fae2035b395222c7585d6b1b01c9cabe02f6cecc9e4ca1591
MD5 ccd02fb9fdcba8535560137870c53a29
BLAKE2b-256 b4417505c5304425b1257c998e78cdf78c5c06c1966ca32cae4b9defd7ffb65f

See more details on using hashes here.

File details

Details for the file cortexops-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cortexops-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 41.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for cortexops-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c951e82b61d990d51aa25340c0c50501d4ff026b26a3099cd5583a67d7e2f299
MD5 f04aecceaff730e73c1b03c13b760bbf
BLAKE2b-256 72efc4cdbcc93a801e8c4b60c35e2cdad20d0c70ce200a51bee39feab7ded804

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page