Open-source toolkit for evaluating AI agents using AWS Bedrock AgentCore Evaluations

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

AgentQualify

Open-source Python toolkit for evaluating AI agents using AWS Bedrock AgentCore Evaluations. Run evaluations, compare models, gate CI/CD pipelines, and visualize results in CloudWatch dashboards — all from a single YAML config.

Why AgentQualify?

The problem without it

Teams building agents on AgentCore today face a painful integration gap. The Evaluate API exists, but turning it into a repeatable development workflow requires solving every piece yourself:

Manual glue code for every project. You need to write scripts that invoke your agent, wait for CloudWatch to ingest spans, query the right log groups, parse OTEL traces, construct the evaluationReferenceInputs payload, call the Evaluate API per evaluator, and aggregate the results. That's hundreds of lines of boilerplate before you've evaluated a single test case.
No standard way to define test suites. Without a shared format, every team invents their own way to store test prompts, expected responses, and assertions. These ad-hoc scripts are fragile, hard to review in PRs, and impossible to share across teams.
No CI/CD integration out of the box. The Evaluate API returns scores, but it doesn't tell your pipeline to pass or fail. You need to build threshold checking, regression detection against a baseline, and exit code logic yourself — and get it right for every agent.
Model comparison requires duplicated effort. Evaluating the same agent across Claude Sonnet, Haiku, and Nova Pro means tripling your invocation and evaluation code, then manually aligning results for comparison.
Ground truth is hard to wire up correctly. The evaluationReferenceInputs schema has different scoping rules for expectedResponse (trace-level), assertions (session-level), and expectedTrajectory (session-level). Getting the payload structure wrong means silent evaluation failures or misleading scores.
No regression tracking. Without a baseline stored somewhere and compared automatically, you can't answer "did this prompt change make the agent worse?" — the most common question in agent development.
Reporting is an afterthought. Scores come back as raw JSON. Building CloudWatch dashboards with model comparisons, per-test breakdowns, and trend lines is a separate project entirely.

The result: most teams either skip automated evaluation entirely or build fragile one-off scripts that break on the next agent update. Quality becomes a manual spot-check instead of a continuous signal.

What AgentQualify gives you

AgentQualify wraps that entire lifecycle into a single pip install and one YAML file:

Go from zero to CI/CD quality gate in minutes. Define your test cases, set score thresholds, and run agentqualify run. The CLI exits 0 on pass, 1 on failure — plug it into any pipeline.
Compare models without changing agent code. Evaluate the same agent backed by Claude Sonnet, Haiku, Nova Pro, or any Bedrock model side-by-side. Switch models via endpoint qualifiers, payload overrides, or separate runtime ARNs — all config-driven.
Catch regressions automatically. Store baselines in S3 and fail the build when scores drop beyond a configurable threshold. No more "the agent feels worse" — you'll have numbers.
Ground truth support built in. Define expected responses, assertions, and tool trajectories directly in your test suite. AgentQualify handles the complex scoping rules — expectedResponse is automatically scoped to the correct trace using trace IDs extracted from CloudWatch spans, while assertions and expectedTrajectory are scoped to the session level. No manual payload construction needed.
Resilient span handling. AgentQualify fetches spans and events from CloudWatch, deduplicates across log groups, and automatically filters out incomplete spans (missing log events) so a single delayed span doesn't block your entire evaluation.
Framework-agnostic. Works with any agent deployed on AgentCore Runtime regardless of framework — Strands, LangGraph, CrewAI, or custom. If it emits OpenTelemetry traces, AgentQualify can evaluate it.
Designed for open source. MIT-0 licensed, minimal dependencies, no vendor lock-in beyond the AWS APIs you're already using.

Features

YAML-driven — one config file controls everything: agent, models, evaluators, thresholds, baseline, reporting
Model comparison — evaluate the same agent backed by different Bedrock models side-by-side
16 built-in evaluators — Helpfulness, Correctness, GoalSuccessRate, three Trajectory matchers, ToolSelectionAccuracy, and more
Ground truth — expected responses (trace-level), natural-language assertions (session-level), and expected tool trajectories (session-level) — scoping handled automatically
Resilient evaluation — automatic span deduplication, incomplete span filtering, and detailed error reporting
Custom evaluators — reference your own AgentCore custom evaluator ARNs
CI/CD gate — threshold checks + regression checks vs S3 baseline; exits 0/1
CloudWatch dashboard — auto-generated detailed dashboard with scores, trends, token usage, per-test breakdowns
CLI + Python SDK — use from the command line or import in your own code

Architecture

YAML Config + Test Suite
        │
        ▼
  AgentQualify Core
  ┌──────────────────────────────────────────────────────────┐
  │  Agent Invoker (SigV4 or OAuth)                          │
  │  └─► invoke_agent_runtime() per model variant            │
  │       │                                                  │
  │       ▼                                                  │
  │  Span Collector                                          │
  │  ├── Fetch from aws/spans + runtime log group            │
  │  ├── Deduplicate by (traceId, spanId)                    │
  │  ├── Filter incomplete spans (missing log events)        │
  │  └── Extract trace IDs for ground truth scoping          │
  │       │                                                  │
  │       ▼                                                  │
  │  Evaluation Runner ──► AgentCore Evaluate API (boto3)    │
  │  └── Per-evaluator ground truth scoping:                 │
  │      ├── expectedResponse → trace-level (with traceId)   │
  │      ├── assertions → session-level                      │
  │      └── expectedTrajectory → session-level              │
  │       │                                                  │
  │       ▼                                                  │
  │  Results Aggregator                                      │
  │  ├── CloudWatch Metrics Publisher (put_metric_data)      │
  │  ├── Dashboard Generator (put_dashboard)                 │
  │  ├── CI/CD Gate (threshold + regression)                 │
  │  └── Baseline Manager (S3)                               │
  └──────────────────────────────────────────────────────────┘
        │
        ▼
  Exit 0 (pass) / Exit 1 (fail)

Installation

pip install agentqualify

Quickstart

1. Create your config file (agentqualify.yaml):

agent:
  runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/my-agent"
  region: "us-east-1"

models:
  strategy: "qualifier"
  variants:
    - name: "claude-sonnet"
      qualifier: "sonnet-endpoint"
    - name: "claude-haiku"
      qualifier: "haiku-endpoint"

evaluators:
  - Builtin.Helpfulness
  - Builtin.Correctness
  - Builtin.GoalSuccessRate

input:
  mode: "test_suite"
  test_suite: "agent_tests.yaml"

thresholds:
  Builtin.Helpfulness: 0.70
  Builtin.Correctness: 0.75

baseline:
  enabled: true
  s3_uri: "s3://my-bucket/agentqualify/baseline.json"
  max_regression: 0.05

reporting:
  cloudwatch:
    enabled: true
    namespace: "AgentQualify"
    dashboard_name: "AgentQualify-MyAgent"

2. Create your test suite (agent_tests.yaml):

tests:
  - name: "weather_query"
    prompt: "What's the weather in Seattle?"

  - name: "factual_check"
    prompt: "What is the capital of France?"
    expected_response: "The capital of France is Paris."

  # Multi-turn with per-turn expected responses
  - name: "multi_turn"
    turns:
      - prompt: "What is 15 + 27?"
        expected_response: "15 + 27 = 42"
      - prompt: "What's the weather?"
        expected_response: "The weather is sunny"
    assertions:
      - "Agent used the calculator tool for the math question"
    expected_trajectory: ["calculator", "weather"]

Ground truth fields (expected_response, assertions, expected_trajectory) are all optional. Omit them to run evaluations without ground truth — the framework is fully backward compatible.

3. Run:

agentqualify run --config agentqualify.yaml

CLI Reference

agentqualify run --config agentqualify.yaml                        # Run evaluations + CI/CD gate
agentqualify run --config agentqualify.yaml -o results.json        # Run and save results to JSON
agentqualify run --config agentqualify.yaml --update-baseline      # Run, check gate, and save baseline to S3 on pass
agentqualify baseline update --config agentqualify.yaml            # Run and save results as baseline (unconditional)
agentqualify baseline show --config agentqualify.yaml              # Print current S3 baseline
agentqualify list-evaluators                                       # List all built-in evaluators

Python SDK

from agentqualify import AgentQualify
import sys

result = AgentQualify("agentqualify.yaml").run()
print(result.summary())
sys.exit(0 if result.passed else 1)

# Update baseline programmatically
AgentQualify("agentqualify.yaml").update_baseline()

Config Reference

`agent`

Field	Type	Required	Description
`runtime_arn`	string	✅	AgentCore runtime ARN
`agent_id`	string		Agent ID for span lookup — derived from `runtime_arn` if omitted
`region`	string		AWS region (default: `us-east-1`)
`span_wait_seconds`	int		Wait after invocation for spans (default: `180`)
`auth.type`	`sigv4` \| `oauth`		Authorization method (default: `sigv4`)
`auth.oauth.token_url`	string	when `oauth`	OAuth2 token endpoint URL
`auth.oauth.client_id`	string	when `oauth`	OAuth2 client ID (supports `${ENV_VAR}`)
`auth.oauth.client_secret`	string	when `oauth`	OAuth2 client secret (supports `${ENV_VAR}`)
`auth.oauth.scopes`	list of strings		OAuth2 scopes to request (optional)

Authentication

AgentQualify supports two inbound authorization methods for invoking your agent runtime.

IAM SigV4 (default)

The default. Uses standard AWS credentials (environment variables, IAM role, SSO profile) via the boto3 SDK. No extra config needed — just make sure your IAM role has bedrock-agentcore:InvokeAgentRuntime permission.

agent:
  runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/my-agent"
  agent_id: "my-agent-id"
  # auth.type defaults to "sigv4" — no auth block needed

OAuth (JWT Bearer Token)

When your agent runtime is configured with a JWT inbound authorizer, the boto3 SDK (SigV4) cannot be used. AgentQualify handles this by automatically fetching an access token using the OAuth2 client credentials grant and attaching it as a Bearer token on each HTTPS invocation.

Tokens are cached in memory and refreshed automatically before they expire, so long-running evaluation suites work without interruption.

agent:
  runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/my-agent"
  agent_id: "my-agent-id"
  auth:
    type: "oauth"
    oauth:
      token_url: "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_XXXXX/oauth2/token"
      client_id: "${OAUTH_CLIENT_ID}"
      client_secret: "${OAUTH_CLIENT_SECRET}"
      scopes:
        - "agentcore/invoke"

Set the environment variables before running:

export OAUTH_CLIENT_ID="your-client-id"
export OAUTH_CLIENT_SECRET="your-client-secret"
agentqualify run --config agentqualify.yaml

All auth.oauth string fields support ${ENV_VAR} syntax so secrets stay out of your YAML files. You can also inline values directly if preferred (e.g., in a CI/CD secret-injected config).

Note: An AgentCore Runtime supports either IAM SigV4 or JWT Bearer Token inbound auth, not both simultaneously. Make sure auth.type matches how your runtime is configured.

`models`

The strategy field controls how AgentQualify switches between model variants during invocation. Choose based on how your agent is deployed.

Strategy: `qualifier` — one deployment, multiple endpoints

When to use: Your agent is deployed once on AgentCore Runtime with multiple endpoint qualifiers, each configured to use a different Bedrock model. This is the most common setup for model comparison — one codebase, one deployment, different model backends.

models:
  strategy: "qualifier"
  variants:
    - name: "claude-sonnet-4-5"
      qualifier: "sonnet-endpoint"    # endpoint configured with Claude Sonnet 4.5
    - name: "claude-haiku-3-5"
      qualifier: "haiku-endpoint"     # endpoint configured with Claude Haiku 3.5
    - name: "amazon-nova-pro"
      qualifier: "nova-pro-endpoint"  # endpoint configured with Amazon Nova Pro

Best for: "I want to find the best quality/cost tradeoff for my agent without changing any code."

See full example: examples/strategy_qualifier.yaml

Strategy: `payload` — model ID passed in the request

When to use: Your agent reads the model ID (or other model config like temperature) from the invocation payload and selects the Bedrock model dynamically at runtime. Use this when you've built a model-agnostic agent or want to test different inference parameters without redeploying.

models:
  strategy: "payload"
  variants:
    - name: "claude-sonnet-default-temp"
      payload_override:
        model_id: "anthropic.claude-sonnet-4-5-20250929-v1:0"
        temperature: 0.7
    - name: "claude-sonnet-low-temp"
      payload_override:
        model_id: "anthropic.claude-sonnet-4-5-20250929-v1:0"
        temperature: 0.1            # more deterministic
    - name: "nova-pro"
      payload_override:
        model_id: "amazon.nova-pro-v1:0"
        temperature: 0.7

The payload_override fields are merged into every test case's payload before invocation.

Best for: "My agent accepts model_id in the payload" or "I want to compare different temperature settings on the same model."

See full example: examples/strategy_payload.yaml

Strategy: `separate_runtimes` — independently deployed agents

When to use: Each variant is a completely separate AgentCore Runtime with its own ARN. Use this when different variants have different agent code, tool sets, or system prompts that can't be toggled via a qualifier or payload — or when you're comparing independently built agents (e.g., v1 vs v2, LangGraph vs Strands).

models:
  strategy: "separate_runtimes"
  variants:
    - name: "agent-v1-production"
      runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/agent-v1"
    - name: "agent-v2-candidate"
      runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/agent-v2"
    - name: "agent-langgraph-experiment"
      runtime_arn: "arn:aws:bedrock-agentcore:us-east-1:123456789012:agent-runtime/agent-langgraph"

Best for: "I rewrote my agent with a new tool set and want to compare v1 vs v2 before promoting to production" or "Two teams built separate agents and I want to benchmark them on the same test suite."

See full example: examples/strategy_separate_runtimes.yaml

Strategy decision guide

Scenario	Strategy
Same agent code, different Bedrock models	`qualifier`
Same agent code, different inference params (temperature, max_tokens)	`payload`
Agent reads model ID from request payload	`payload`
Different agent versions (v1 vs v2)	`separate_runtimes`
Different agent frameworks (Strands vs LangGraph)	`separate_runtimes`
Production vs canary agent comparison	`separate_runtimes`
A/B test between two live deployments	`separate_runtimes`

Field	Type	Description
`strategy`	`qualifier` \| `payload` \| `separate_runtimes`	How to switch models
`variants[].name`	string	Display name for this model variant
`variants[].qualifier`	string	Endpoint qualifier (strategy: `qualifier`)
`variants[].payload_override`	dict	Extra payload fields merged into every test (strategy: `payload`)
`variants[].runtime_arn`	string	Separate runtime ARN (strategy: `separate_runtimes`)

`evaluators`

List of evaluator IDs. Use Builtin.<Name> for built-in evaluators or a full ARN for custom evaluators.

Built-in evaluators:

ID	Level	Ground truth field	Description
`Builtin.Correctness`	trace	`expectedResponse`	Factual accuracy compared against expected answer (LLM-as-Judge). Works without ground truth too.
`Builtin.GoalSuccessRate`	session	`assertions`	Whether agent behavior satisfies natural-language assertions (LLM-as-Judge)
`Builtin.TrajectoryExactOrderMatch`	session	`expectedTrajectory`	Actual tool sequence must match expected exactly — same tools, same order, no extras
`Builtin.TrajectoryInOrderMatch`	session	`expectedTrajectory`	Expected tools must appear in order, extra tools allowed between them
`Builtin.TrajectoryAnyOrderMatch`	session	`expectedTrajectory`	All expected tools must be present, order doesn't matter, extras allowed
`Builtin.Helpfulness`	trace	—	How effectively the response helps users progress toward their goals
`Builtin.Coherence`	trace	—	Logical consistency of the response
`Builtin.Conciseness`	trace	—	Efficiency of information delivery
`Builtin.Faithfulness`	trace	—	Consistency with conversation history
`Builtin.Harmfulness`	trace	—	Detection of harmful content
`Builtin.InstructionFollowing`	trace	—	Adherence to explicit instructions
`Builtin.Refusal`	trace	—	Detection of declined requests
`Builtin.ResponseRelevance`	trace	—	How well the response addresses the request
`Builtin.Stereotyping`	trace	—	Detection of bias and stereotypical content
`Builtin.ToolSelectionAccuracy`	tool	—	Whether the appropriate tool was chosen
`Builtin.ToolParameterAccuracy`	tool	—	Whether tool parameters are correct

The first 5 evaluators accept ground truth. The remaining 11 evaluate based on conversation context alone — they never receive ground truth fields and are unaffected by their presence in your test suite.

`input`

Field	Type	Description
`mode`	`test_suite` \| `sessions` \| `both`	Input source
`test_suite`	string	Path to test suite YAML file
`sessions[].session_id`	string	Existing session ID to evaluate
`sessions[].name`	string	Display name for this session

Ground truth (test suite)

Ground truth fields are defined per test case in your test suite YAML. All fields are optional — omit them entirely to run evaluations without ground truth (backward compatible).

Field	Type	Scope	Used by	Description
`expected_response`	string	trace	`Builtin.Correctness`	Reference answer to compare against the agent's response (single-turn or last turn)
`turns[].expected_response`	string	trace	`Builtin.Correctness`	Per-turn reference answer for multi-turn tests
`assertions`	list of strings	session	`Builtin.GoalSuccessRate`	Natural-language conditions the session must satisfy
`expected_trajectory`	list of strings	session	`TrajectoryExactOrderMatch`, `TrajectoryInOrderMatch`, `TrajectoryAnyOrderMatch`	Ordered list of expected tool names

How ground truth scoping works

The AgentCore Evaluate API enforces strict scoping rules for ground truth fields:

Session-level (assertions, expectedTrajectory) — scoped by sessionId. These apply to the entire conversation and are sent to session-level evaluators like GoalSuccessRate and the trajectory matchers.
Trace-level (expectedResponse) — scoped by traceId. This applies to a specific turn in the conversation and is sent to Builtin.Correctness.

AgentQualify handles this automatically:

Trace ID extraction — after fetching spans from CloudWatch, the framework extracts trace IDs from root spans (one per turn) in chronological order.
Automatic scoping — expectedResponse is attached to the last trace ID (matching the convention that a single expected response applies to the final agent reply). assertions and expectedTrajectory are attached to the session ID.
Per-evaluator routing — ground truth reference inputs are only sent to evaluators that support them. Evaluators like Builtin.Helpfulness never receive ground truth fields, so they're unaffected.
Graceful fallback — if no trace IDs can be extracted (e.g., spans haven't propagated yet), expectedResponse is silently omitted and Builtin.Correctness runs in its ground-truth-free mode.

Evaluators that receive ground truth fields they don't use will report them in ignoredReferenceInputFields — this is informational, not an error.

Single-turn with ground truth:

tests:
  - name: "balance_check"
    prompt: "What is my checking account balance?"
    expected_response: "Your checking account balance is $5,420.50 as of today."
    assertions:
      - "Agent called check_balance before get_transaction_history"
      - "Response did not expose full account numbers"
    expected_trajectory: ["check_balance", "get_transaction_history"]

Multi-turn with per-turn expected responses (using turns format):

tests:
  - name: "math_then_weather"
    turns:
      - prompt: "What is 15 + 27?"
        expected_response: "15 + 27 = 42"
      - prompt: "What's the weather?"
        expected_response: "The weather is sunny"
    expected_trajectory: ["calculator", "weather"]

Multi-turn with prompts list (legacy format):

tests:
  - name: "booking_flow"
    prompts:
      - "Find flights from SEA to NYC"
      - "Book the cheapest one"
    expected_response: "Booked flight DL420 for $290"  # applies to last turn
    assertions:
      - "Agent called search_flights before book_flight"
    expected_trajectory: ["search_flights", "book_flight"]

Evaluation pipeline

When you run agentqualify run, the evaluation phase works as follows:

1. Span collection — For each session, AgentQualify queries two CloudWatch log groups:

aws/spans — span metadata (trace structure, timing, attributes)
/aws/bedrock-agentcore/runtimes/{agent_id}-DEFAULT — span events (input/output payloads)

Both are required by the Evaluate API. The framework merges results from both log groups.

2. Deduplication — Spans that appear in both log groups are deduplicated by (traceId, spanId, record type) so the same record is never sent twice.

3. Incomplete span filtering — The Evaluate API rejects the entire evaluation if any span with a supported scope (strands.telemetry.tracer, opentelemetry.instrumentation.langchain, openinference.instrumentation.langchain) is missing its corresponding event record. AgentQualify detects these orphaned spans and removes them, so the remaining complete data can still be evaluated. You'll see a warning in the logs:

WARNING Filtered 2 incomplete span(s) missing log events (572 → 570)

4. Trace ID extraction — Root spans (no parentSpanId) are identified and their traceId values extracted in chronological order. These are used to scope expectedResponse ground truth to the correct trace.

5. Evaluate API calls — Each evaluator is called separately via boto3. Ground truth reference inputs are constructed per-evaluator with correct scoping, and only attached to evaluators that support them.

`thresholds`

Map of evaluator_id → minimum_score. Evaluation fails if any score falls below its threshold.

`baseline`

Field	Type	Description
`enabled`	bool	Enable regression checks — load baseline from S3 and fail the gate if scores drop more than `max_regression`. Does not affect saving.
`s3_uri`	string	S3 URI for baseline JSON (`s3://bucket/path.json`). Required for both regression checks and saving.
`max_regression`	float	Max allowed score drop (default: `0.05`)

baseline.enabled only controls whether regression checks run during agentqualify run. Saving the baseline is controlled separately via --update-baseline or agentqualify baseline update — both only require s3_uri to be set.

`reporting.cloudwatch`

Field	Type	Description
`enabled`	bool	Publish metrics and create dashboard
`namespace`	string	CloudWatch namespace (default: `AgentQualify`)
`dashboard_name`	string	Dashboard name

CI/CD Integration

GitHub Actions

See examples/github-actions.yml for a complete workflow.

Key pattern:

- name: Run evaluations
  run: agentqualify run --config agentqualify.yaml
  # Exits 0 on pass, 1 on threshold/regression failure

- name: Update baseline (main branch only)
  if: github.ref == 'refs/heads/main' && success()
  run: agentqualify baseline update --config agentqualify.yaml

AWS CodePipeline

Add a CodeBuild step with:

pip install agentqualify[toolkit]
agentqualify run --config agentqualify.yaml

The non-zero exit code on failure will automatically fail the pipeline stage.

IAM Permissions

The IAM role running AgentQualify needs:

{
  "Effect": "Allow",
  "Action": [
    "bedrock-agentcore:InvokeAgentRuntime",
    "bedrock-agentcore:Evaluate",
    "logs:StartQuery",
    "logs:GetQueryResults",
    "cloudwatch:PutMetricData",
    "cloudwatch:PutDashboard",
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "*"
}

OAuth note: When using auth.type: "oauth", agent invocation bypasses IAM SigV4 and uses a Bearer token instead, so bedrock-agentcore:InvokeAgentRuntime is not needed in your IAM policy. The remaining permissions (Evaluate, CloudWatch, S3, Logs) are still required for evaluations, reporting, and baseline management.

Development

git clone https://github.com/agentqualify/agentqualify
cd agentqualify
pip install -e ".[dev]"
pytest

Lint and security checks (also run in CI):

ruff check src/ tests/        # lint
ruff format --check src/ tests/ # format check
bandit -r src/ -c pyproject.toml # static security analysis
pip-audit                       # dependency vulnerability scan

License

MIT-0 — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

govindhi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

May 5, 2026

0.3.0

May 5, 2026

0.2.4

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentqualify-0.4.0.tar.gz (44.3 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentqualify-0.4.0-py3-none-any.whl (32.4 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file agentqualify-0.4.0.tar.gz.

File metadata

Download URL: agentqualify-0.4.0.tar.gz
Upload date: May 5, 2026
Size: 44.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentqualify-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`ce35647389f0e418885f5a160a1a6d1145801d7d700927d3a8794321472b426d`
MD5	`5b711c2d490759b9639edc4ce8004398`
BLAKE2b-256	`14d4717cc0f719c6f0b7bd351d10eaea5d51ea2e1d7ba30459387c400ad41e34`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentqualify-0.4.0.tar.gz:

Publisher: release.yml on govindhi/agentqualify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentqualify-0.4.0.tar.gz
- Subject digest: ce35647389f0e418885f5a160a1a6d1145801d7d700927d3a8794321472b426d
- Sigstore transparency entry: 1440412465
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: govindhi/agentqualify@6faeee2ca626ee82b0812de714759e884e71cefc
- Branch / Tag: refs/heads/main
- Owner: https://github.com/govindhi
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6faeee2ca626ee82b0812de714759e884e71cefc
- Trigger Event: workflow_dispatch

File details

Details for the file agentqualify-0.4.0-py3-none-any.whl.

File metadata

Download URL: agentqualify-0.4.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 32.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentqualify-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80c574fa66c7da405562a9f3edb7400bcf987d1a83b6f1675406219113097088`
MD5	`974b1eed95243b8c1d7786ffb995e083`
BLAKE2b-256	`170190e0c6fadc6691994561b715a567b9fd7fbcf5abcc196e11bb2d5adfea85`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentqualify-0.4.0-py3-none-any.whl:

Publisher: release.yml on govindhi/agentqualify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentqualify-0.4.0-py3-none-any.whl
- Subject digest: 80c574fa66c7da405562a9f3edb7400bcf987d1a83b6f1675406219113097088
- Sigstore transparency entry: 1440412635
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: govindhi/agentqualify@6faeee2ca626ee82b0812de714759e884e71cefc
- Branch / Tag: refs/heads/main
- Owner: https://github.com/govindhi
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6faeee2ca626ee82b0812de714759e884e71cefc
- Trigger Event: workflow_dispatch

agentqualify 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AgentQualify

Why AgentQualify?

The problem without it

What AgentQualify gives you

Features

Architecture

Installation

Quickstart

CLI Reference

Python SDK

Config Reference

agent

Authentication

IAM SigV4 (default)

OAuth (JWT Bearer Token)

models

Strategy: qualifier — one deployment, multiple endpoints

Strategy: payload — model ID passed in the request

Strategy: separate_runtimes — independently deployed agents

Strategy decision guide

evaluators

input

Ground truth (test suite)

How ground truth scoping works

Evaluation pipeline

thresholds

baseline

reporting.cloudwatch

CI/CD Integration

GitHub Actions

AWS CodePipeline

IAM Permissions

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`agent`

`models`

Strategy: `qualifier` — one deployment, multiple endpoints

Strategy: `payload` — model ID passed in the request

Strategy: `separate_runtimes` — independently deployed agents

`evaluators`

`input`

`thresholds`

`baseline`

`reporting.cloudwatch`