Elasticdash AI test framework for Python

Project description

elasticdash_test

An AI-native test runner for ElasticDash workflow testing. Built for async AI pipelines — not a general-purpose test runner.

Trace-first: every test receives a ctx.trace to record and assert on LLM calls and tool invocations
Automatic interception for OpenAI, Gemini, and Grok via httpx/requests — no manual instrumentation required
AI-specific matchers: to_have_llm_step, to_call_tool, to_match_semantic_output, to_have_custom_step, to_have_prompt_where, to_evaluate_output_metric
Sequential execution, no parallelism overhead
No pytest dependency

Installation

pip install elasticdash_test

Requires Python 3.10+.

Quick Start

1. Write a test file (my_flow.ai_test.py):

from elasticdash_test import ai_test, expect

@ai_test("checkout flow")
async def test_checkout(ctx):
    await run_checkout(ctx)

    expect(ctx.trace).to_have_llm_step(model="gpt-4o", contains="order confirmed")
    expect(ctx.trace).to_call_tool("chargeCard")

2. Run it:

elasticdash test              # discover all *.ai_test.py files
elasticdash test ./ai_tests   # discover in a specific directory
elasticdash run my_flow.ai_test.py  # run a single file
elasticdash dashboard         # open workflows dashboard

3. Read the output:

  ✓ checkout flow (1.2s)
  ✗ refund flow (0.8s)
    → Expected tool "chargeCard" to be called, but no tool calls were recorded

2 passed
1 failed
Total: 3
Duration: 3.4s

Writing Tests

See the full guide in docs/test-writing-guidelines.md.

Decorators

Import from elasticdash_test and apply to functions — no global injection needed:

Decorator	Description
`@ai_test(name)`	Register a test
`@before_all`	Run once before all tests in the file
`@before_each`	Run before every test in the file
`@after_each`	Run after every test in the file (runs even if the test fails)
`@after_all`	Run once after all tests in the file

Test context

Each test function receives a ctx: AITestContext argument:

@ai_test("my test")
async def test_my_flow(ctx):
    # ctx.trace — record and inspect LLM steps and tool calls

Recording trace data

Automatic interception (recommended): Call install_ai_interceptor() once in @before_all and the runner patches httpx/requests to record LLM steps for OpenAI, Gemini, and Grok calls automatically. See Automatic AI Interception below.

Manual recording: Use this for providers not covered by the interceptor, when testing against stubs/mocks, or to capture custom workflow steps:

ctx.trace.record_llm_step(
    model="gpt-4o",
    prompt="What is the order status?",
    completion="The order has been confirmed.",
)

ctx.trace.record_tool_call(
    name="chargeCard",
    args={"amount": 99.99},
)

# Record custom workflow steps (RAG fetches, code/fixed steps, etc.)
ctx.trace.record_custom_step(
    kind="rag",              # 'rag' | 'code' | 'fixed' | 'custom'
    name="pokemon-search",
    tags=["sort:asc", "source:db"],
    payload={"query": "pikachu attack"},
    result={"ids": [25]},
    metadata={"latency_ms": 120},
)

Matchers

`to_have_llm_step(config?)`

Assert the trace contains at least one LLM step matching the given config. All fields are optional and combined with AND logic.

expect(ctx.trace).to_have_llm_step(model="gpt-4o")
expect(ctx.trace).to_have_llm_step(contains="order confirmed")        # searches prompt + completion
expect(ctx.trace).to_have_llm_step(prompt_contains="order status")    # searches prompt only
expect(ctx.trace).to_have_llm_step(output_contains="order confirmed") # searches completion only
expect(ctx.trace).to_have_llm_step(provider="openai")
expect(ctx.trace).to_have_llm_step(provider="openai", prompt_contains="order status")
expect(ctx.trace).to_have_llm_step(prompt_contains="retry", times=3)      # exactly 3 matching steps
expect(ctx.trace).to_have_llm_step(provider="openai", min_times=2)        # at least 2 matching steps
expect(ctx.trace).to_have_llm_step(output_contains="error", max_times=1)  # at most 1 matching step

Field	Description
`model`	Exact model name match (e.g. `'gpt-4o'`)
`contains`	Substring match across prompt + completion (case-insensitive)
`prompt_contains`	Substring match in prompt only (case-insensitive)
`output_contains`	Substring match in completion only (case-insensitive)
`provider`	Provider name: `'openai'`, `'gemini'`, or `'grok'`
`times`	Exact match count (fails unless exactly this many steps match)
`min_times`	Minimum match count (steps matching must be ≥ this value)
`max_times`	Maximum match count (steps matching must be ≤ this value)

`to_call_tool(tool_name)`

Assert the trace contains a tool call with the given name.

expect(ctx.trace).to_call_tool("chargeCard")

`to_match_semantic_output(expected, **options)`

LLM-judged semantic match of combined LLM output vs. the expected string. Defaults to OpenAI GPT-4.1 with OPENAI_API_KEY.

# Minimal, using default OpenAI model
await expect(ctx.trace).to_match_semantic_output("order confirmed")

# Use a different provider
await expect(ctx.trace).to_match_semantic_output(
    "attack stat",
    provider="claude",
    model="claude-3-opus-20240229",
)

# OpenAI-compatible endpoint (e.g., Moonshot/Kimi) via base_url + api_key
await expect(ctx.trace).to_match_semantic_output(
    "order confirmed",
    provider="openai",
    model="kimi-k2-turbo-preview",
    api_key=os.environ["KIMI_API_KEY"],
    base_url="https://api.moonshot.ai/v1",
)

Environment keys by provider: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY (or GOOGLE_API_KEY), GROK_API_KEY.

`to_evaluate_output_metric(config)`

Evaluate one LLM step's prompt or result using an LLM and assert a numeric metric condition in the range 0.0–1.0. Defaults: target=result, condition=at_least 0.7, provider=openai, model=gpt-4.1.

# Evaluate the last LLM result; default condition at_least 0.7
await expect(ctx.trace).to_evaluate_output_metric(
    evaluation_prompt="Rate how well this answers the user question.",
)

# Check a specific step (3rd LLM prompt), target the prompt text, require >= 0.8
await expect(ctx.trace).to_evaluate_output_metric(
    evaluation_prompt="Score coherence of this prompt between 0 and 1.",
    target="prompt",
    nth=3,
    condition={"at_least": 0.8},
    provider="claude",
    model="claude-3-opus-20240229",
)

# Custom comparator: score must be < 0.3
await expect(ctx.trace).to_evaluate_output_metric(
    evaluation_prompt="Rate hallucination risk (0=none, 1=high).",
    condition={"less_than": 0.3},
)

Options:

evaluation_prompt (required): your scoring instructions; model is asked to return only a number between 0 and 1.
target: 'result' (default) or 'prompt'. Evaluates that text only.
nth: pick which LLM step to score (1-based). Defaults to the last LLM step.
condition: one of greater_than, less_than, at_least, at_most, equals; default is {"at_least": 0.7}. Fails if the score is outside 0.0–1.0 or cannot be parsed.
provider / model / api_key / base_url: supports OpenAI, Claude, Gemini, Grok, and OpenAI-compatible endpoints via base_url.

`to_have_custom_step(config?)`

Assert a recorded custom step (RAG/code/fixed/custom) matches filters.

expect(ctx.trace).to_have_custom_step(kind="rag", name="pokemon-search")
expect(ctx.trace).to_have_custom_step(tag="sort:asc")
expect(ctx.trace).to_have_custom_step(contains="pikachu")
expect(ctx.trace).to_have_custom_step(result_contains="25")
expect(ctx.trace).to_have_custom_step(kind="rag", min_times=1, max_times=2)

`to_have_prompt_where(config)`

Filter prompts, then assert additional constraints. Example: "all prompts containing A must also contain B".

# Prompts that contain "order" must also contain "confirmed"
expect(ctx.trace).to_have_prompt_where(filter_contains="order", require_contains="confirmed")

# Prompts containing "retry" must NOT contain "cancel"
expect(ctx.trace).to_have_prompt_where(filter_contains="retry", require_not_contains="cancel")

# Control counts on the filtered subset
expect(ctx.trace).to_have_prompt_where(
    filter_contains="order",
    require_contains="confirmed",
    min_times=1,
    max_times=3,
)

# Check a specific prompt position (1-based nth)
expect(ctx.trace).to_have_prompt_where(
    filter_contains="order",
    require_contains="confirmed",
    nth=3,  # the 3rd prompt among those containing "order"
)

Automatic AI Interception

Call install_ai_interceptor() in a @before_all hook and the runner patches httpx and requests before tests run, automatically recording LLM steps for:

Provider	Endpoints intercepted
OpenAI	`api.openai.com/v1/chat/completions`, `/v1/completions`
Gemini	`generativelanguage.googleapis.com/.../models/...:generateContent`
Grok (xAI)	`api.x.ai/v1/chat/completions`

Each intercepted call records model, provider, prompt, and completion into ctx.trace automatically. Your workflow code needs no changes.

from elasticdash_test import ai_test, before_all, after_all, install_ai_interceptor, uninstall_ai_interceptor, expect

@before_all
def setup():
    install_ai_interceptor()

@after_all
def teardown():
    uninstall_ai_interceptor()

@ai_test("user lookup flow")
async def test_user_lookup(ctx):
    # This makes a real OpenAI call — intercepted automatically
    await my_workflow.run("Find all active users")

    # Works without any ctx.trace.record_llm_step() in your workflow
    expect(ctx.trace).to_have_llm_step(prompt_contains="Find all active users")
    expect(ctx.trace).to_have_llm_step(provider="openai")

Streaming: When stream=True is set on a request, the completion is recorded as "(streamed)" — the prompt and model are still captured.

Recording trace steps without passing `ctx.trace` (contextvars)

The runner sets a per-test current_trace using Python's contextvars, so your app code can record steps without threading ctx.trace through every function:

# In your test
from elasticdash_test import ai_test, set_current_trace, expect

@ai_test("flow test")
async def test_flow(ctx):
    set_current_trace(ctx.trace)        # bind the trace to the current async context
    await run_flow_without_trace_arg()  # your existing code
    expect(ctx.trace).to_have_custom_step(kind="rag", name="pokemon-search")

# In your app/flow code (called during the test)
from elasticdash_test import get_current_trace

async def run_flow_without_trace_arg():
    trace = get_current_trace()
    if trace:
        trace.record_custom_step(
            kind="rag",
            name="pokemon-search",
            payload={"query": "pikachu attack"},
            result={"ids": [25]},
            tags=["source:db", "sort:asc"],
        )

Configuration

Create an optional elasticdash.config.py at the project root:

config = {
    "test_match": ["**/*.ai_test.py"],
    "trace_mode": "local",
}

Option	Default	Description
`test_match`	`['*/.ai_test.py']`	Glob patterns for test discovery
`trace_mode`	`'local'`	`'local'` (stub) or `'remote'` (future ElasticDash backend)

`ed_agents.py`, `ed_workflows.py`, `ed_tools.py`

These optional files are thin wrappers that bundle and re-export existing functions from your codebase. Load them automatically during test runs to provide agents, workflows, and tools to your test environment.

`ed_agents.py`

Re-export agent functions or create a config dict for easy reference:

# ed_agents.py — import from your app
from my_app.agents import checkout_agent, payment_agent

config = {
    "checkout": checkout_agent,
    "payment": payment_agent,
}

Access in tests:

@ai_test("checkout flow")
async def test_checkout(ctx, config):
    agents = config.get("agents", {})
    result = await agents["checkout"]("order-123")

`ed_workflows.py`

Re-export workflow functions from your application:

# ed_workflows.py
from my_app.workflows import order_workflow, refund_workflow

# Re-export directly — the runner will import this module

Access in tests:

@ai_test("full order workflow")
async def test_workflow(ctx):
    from ed_workflows import order_workflow
    result = await order_workflow("order-123", "cust-456")
    expect(ctx.trace).to_call_tool("chargeCard")

`ed_tools.py`

Re-export tool functions that agents or workflows can invoke:

# ed_tools.py
from my_app.tools import charge_card, fetch_order_status, send_notification

Access in tests or workflows:

@ai_test("tool integration")
async def test_tools(ctx):
    from ed_tools import fetch_order_status
    status = await fetch_order_status("order-123")
    expect(ctx.trace).to_have_custom_step(kind="external", name="fetch_order_status")

These files are loaded automatically if present in the project root.

Workflows Dashboard

Browse and search all available workflow functions in your project:

elasticdash dashboard         # open dashboard at http://localhost:4573
elasticdash dashboard --port 4572  # use custom port
elasticdash dashboard --no-open    # skip auto-opening browser

The dashboard scans ed_workflows.py and displays:

Function names — all callable functions in the module
Signatures — function parameters and return types
Async indicator — marks async vs sync functions
Source module — where the function is imported from (if not locally defined)
File path — location of ed_workflows.py

Use the search field to filter workflows by:

Name — find workflow by function name (e.g., checkout_flow)
Source module — find all workflows from a specific module (e.g., app_workflows)
File path — filter by location in your codebase

This is useful for discovering available workflows, understanding their signatures, and identifying where functions are defined before calling them in tests.

Project Structure

elasticdash_test/
  cli.py                CLI entry point (click + glob)
  runner.py             Sequential test runner engine
  reporter.py           Color-coded terminal output
  registry.py           ai_test / before_all / after_all registry
  trace.py              TraceHandle, AITestContext, contextvars support
  matchers.py           Custom expect matchers
  interceptors/
    ai_interceptor.py   Automatic httpx/requests interceptor for OpenAI / Gemini / Grok

Programmatic API

from elasticdash_test import install_ai_interceptor, uninstall_ai_interceptor
from elasticdash_test.runner import run_files
from elasticdash_test.reporter import print_results

install_ai_interceptor()  # patch httpx/requests for automatic LLM tracing

results = await run_files(["./tests/flow.ai_test.py"])
print_results(results)

uninstall_ai_interceptor()  # restore original transports when done

Non-Goals

This runner intentionally does not support:

Parallel execution
Watch mode
Snapshot testing
Coverage reporting
pytest compatibility

License

MIT

Project details

Release history Release notifications | RSS feed

0.1.3

Jun 8, 2026

0.1.2a8 pre-release

Jun 5, 2026

0.1.2a7 pre-release

Jun 4, 2026

0.1.2a6 pre-release

May 29, 2026

0.1.2a4 pre-release

May 29, 2026

0.1.2a3 pre-release

May 27, 2026

This version

0.1.2a2 pre-release

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elasticdash_sdk-0.1.2a2.tar.gz (89.7 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

elasticdash_sdk-0.1.2a2-py3-none-any.whl (101.4 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file elasticdash_sdk-0.1.2a2.tar.gz.

File metadata

Download URL: elasticdash_sdk-0.1.2a2.tar.gz
Upload date: May 26, 2026
Size: 89.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for elasticdash_sdk-0.1.2a2.tar.gz
Algorithm	Hash digest
SHA256	`4ec4a6c0862260cae87c169536a8c85786e8f3c585ec9bc693b288145a5d4d9d`
MD5	`7cbe9bea6a36b4f60e65da022e02acd5`
BLAKE2b-256	`31a6f6eace6df84719faae7fbaecf9d676c2f1db2617dc9c4c418238e00aa568`

See more details on using hashes here.

File details

Details for the file elasticdash_sdk-0.1.2a2-py3-none-any.whl.

File metadata

Download URL: elasticdash_sdk-0.1.2a2-py3-none-any.whl
Upload date: May 26, 2026
Size: 101.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for elasticdash_sdk-0.1.2a2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d49506061015baca594b5cf4179698ad93bb824124ef3e1aff7bf489c80f0462`
MD5	`45ddb10373cda168e457fe8ad5671bc9`
BLAKE2b-256	`ae74c6a24aede52dc69977ec0d02ef16db23ce09e0e7de48f2e3451e77f673a4`

See more details on using hashes here.

elasticdash-sdk 0.1.2a2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

elasticdash_test

Installation

Quick Start

Writing Tests

Decorators

Test context

Recording trace data

Matchers

to_have_llm_step(config?)

to_call_tool(tool_name)

to_match_semantic_output(expected, **options)

to_evaluate_output_metric(config)

to_have_custom_step(config?)

to_have_prompt_where(config)

Automatic AI Interception

Recording trace steps without passing ctx.trace (contextvars)

Configuration

ed_agents.py, ed_workflows.py, ed_tools.py

ed_agents.py

ed_workflows.py

ed_tools.py

Workflows Dashboard

Project Structure

Programmatic API

Non-Goals

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`to_have_llm_step(config?)`

`to_call_tool(tool_name)`

`to_match_semantic_output(expected, **options)`

`to_evaluate_output_metric(config)`

`to_have_custom_step(config?)`

`to_have_prompt_where(config)`

Recording trace steps without passing `ctx.trace` (contextvars)

`ed_agents.py`, `ed_workflows.py`, `ed_tools.py`

`ed_agents.py`

`ed_workflows.py`

`ed_tools.py`