Skip to main content

AI-powered test data generator for QA engineers

Project description

testdata-ai

Most test data looks like this:

email="test@test.com"
name="John Doe"
age=30

This causes unrealistic tests and hides edge cases.

testdata-ai generates realistic, culturally diverse, and behaviorally coherent data using modern LLMs.

CI Coverage PyPI Python License


testdata-ai CLI demo

Quick start

pip install "testdata-ai[openai]"
testdata-ai generate --context ecommerce_customer --count 10
from testdata_ai import (
    generate, generate_from_model, generate_with_relationships,
    generate_parallel, async_generate, GenerateSpec,
)
from pydantic import BaseModel

# Built-in context
users = generate("ecommerce_customer", count=50)

# Your own Pydantic model — no ContextSchema needed
class Order(BaseModel):
    customer_name: str
    total: float
    status: str

orders = generate_from_model(Order, count=10)

# Multi-entity datasets with referential integrity
result = generate_with_relationships({
    "customers": {"context": "ecommerce_customer", "count": 5},
    "orders": {
        "context": "restaurant_order", "count": 20,
        "parent": "customers", "fk_field": "customer_email", "parent_pk": "email",
    },
})
# result["orders"][*]["customer_email"] is always a real customer email

# Async parallel: generate multiple contexts simultaneously
import asyncio

results = asyncio.run(generate_parallel([
    GenerateSpec("ecommerce_customer", count=500, label="customers"),
    GenerateSpec("banking_user",        count=500, label="accounts"),
    GenerateSpec("iot_device",          count=500, label="devices"),
]))
# all 3 AI calls run concurrently — much faster than sequential generate()

# Or generate one context in parallel batches
records = asyncio.run(async_generate("ecommerce_customer", count=3000, parallelism=5))

Why testdata-ai?

  • 13 built-in domains — e-commerce, banking, healthcare, HR, IoT, travel, and more
  • 6 AI providers — OpenAI, Anthropic, Google Gemini, Mistral, Cohere, or a local Ollama model (no API cost)
  • pytest plugin — session-scoped fixtures with caching, named seeds, and xdist support, auto-loaded
  • Pydantic / JSON Schema support — generate data directly from your existing models
  • Faker hybrid mode — mark fields as faker:email / faker:iban to get format-guaranteed values while AI handles the semantic context
  • Unique field constraints — add unique_fields=["email", "user_id"] to any context and those fields will never duplicate within a batch
  • Multi-entity datasetsgenerate_with_relationships() generates customers → orders → shipments with guaranteed FK integrity and semantic coherence (child records make sense given parent records)
  • Async / parallel generationgenerate_parallel() and async_generate() run multiple AI calls concurrently via asyncio, dramatically reducing wall-clock time for large datasets; cross-call uniqueness guaranteed via Faker dedup
Faker testdata-ai
Realistic emails test123@example.com aisha.patel.2024@gmail.com
Cultural diversity Limited Names from many cultures
Behavioral coherence None Age, location, and habits match
Edge-case variety Manual AI generates it automatically
Use your own Pydantic model Not possible generate_from_model(MyModel, count=10)
Format-safe critical fields ✅ Faker's domain field_providers={"email": "faker:email"}
Unique values across records Requires manual set tracking unique_fields=["email", "user_id"]
Multi-entity FK datasets Sequential, no semantic link generate_with_relationships(graph) — child records contextually match parents
Large dataset throughput Single-threaded generate_parallel() / async_generate() — concurrent AI calls, N× speedup

Why not just use Faker?

Faker is excellent for generating syntactically valid values (emails, UUIDs, phone numbers), but it lacks semantic coherence.

Example Faker output:

name="John Smith"
email="random42@example.com"
country="Japan"

testdata-ai generates consistent records:

name="Yuki Tanaka"
email="yuki.tanaka@gmail.com"
country="Japan"

Table of Contents


Installation

pip install "testdata-ai[openai]"       # OpenAI only
pip install "testdata-ai[anthropic]"    # Anthropic only
pip install "testdata-ai[ollama]"       # Ollama only (no extra packages — uses stdlib)
pip install "testdata-ai[gemini]"       # Google Gemini only
pip install "testdata-ai[mistral]"      # Mistral only
pip install "testdata-ai[cohere]"       # Cohere only
pip install "testdata-ai[faker]"        # Faker hybrid mode (format-safe fields)
pip install "testdata-ai[all]"          # All providers + Faker

Development install (from source)

git clone https://github.com/testcraft-ai/testdata-ai.git
cd testdata-ai
python -m venv venv && source venv/bin/activate
pip install -e ".[all]"

Configuration

Create a .env file in the project root:

# Provider selection
AI_PROVIDER=openai          # openai | anthropic | ollama | gemini | mistral | cohere

# OpenAI
OPENAI_API_KEY=sk-proj-...
OPENAI_MODEL=gpt-4o-mini    # default; gpt-4o for higher quality
OPENAI_MAX_TOKENS=4096
OPENAI_TEMPERATURE=0.7

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-haiku-4-5-20251001   # default
ANTHROPIC_MAX_TOKENS=4096
ANTHROPIC_TEMPERATURE=0.7

# Ollama (local, no API key required)
OLLAMA_BASE_URL=http://localhost:11434  # default
OLLAMA_MODEL=qwen2.5:14b               # default
OLLAMA_MAX_TOKENS=4096
OLLAMA_TEMPERATURE=0.7

# Google Gemini
GEMINI_API_KEY=...
GEMINI_MODEL=gemini-2.0-flash          # default
GEMINI_MAX_TOKENS=4096
GEMINI_TEMPERATURE=0.7

# Mistral
MISTRAL_API_KEY=...
MISTRAL_MODEL=mistral-small-latest     # default
MISTRAL_MAX_TOKENS=4096
MISTRAL_TEMPERATURE=0.7

# Cohere
COHERE_API_KEY=...
COHERE_MODEL=command-r                 # default
COHERE_MAX_TOKENS=4096
COHERE_TEMPERATURE=0.7
# Locale (optional — applies to all providers)
AI_LOCALE=pl   # BCP 47 tag; overridden by --locale or locale= per call

All env vars are optional except *_API_KEY (Ollama requires no API key). Defaults: gpt-4o-mini / claude-haiku-4-5-20251001 / qwen2.5:14b / gemini-2.0-flash / mistral-small-latest / command-r, temperature 0.7, max_tokens 4096.


CLI

After installation, use the testdata-ai command (or python -m testdata_ai):

generate

Generate test data records and output as JSON, JSONL, CSV, YAML, or SQL.

testdata-ai generate --context <name> [OPTIONS]
testdata-ai generate --schema-file <path> [OPTIONS]
Option Default Description
--context TEXT Context name (see Available Contexts). Mutually exclusive with --schema-file
--schema-file PATH JSON or YAML file containing a JSON Schema definition. Mutually exclusive with --context
--count INTEGER 10 Number of records to generate
--batch-size INTEGER 10 Records per AI call. For count > batch-size, records are output progressively
-o, --output [json|jsonl|csv|yaml|sql] json Output format. Write to file via shell redirection: -o csv > data.csv
--table TEXT context name Table name for SQL output
--provider TEXT from env AI provider override (openai / anthropic / ollama / gemini / mistral / cohere)
--model TEXT from env Model name override
--max-tokens INTEGER from env Max tokens per AI call (auto-adjusted to batch-size by default)
--temperature FLOAT from env Sampling temperature 0.0–1.0
--locale TEXT from env Locale/language for generated values (e.g. pl, ja, de). Overrides AI_LOCALE env var
--no-validate off Skip schema validation
--context-file PATH YAML or JSON file with custom context definitions (repeatable)
-q, --quiet off Suppress status messages (data only to stdout)

Examples:

# 10 e-commerce customers to stdout (JSON)
testdata-ai generate --context ecommerce_customer --count 10

# 50 SaaS trial users saved as CSV
testdata-ai generate --context saas_trial --count 50 -o csv > trials.csv

# SQL INSERT statements for direct database seeding
testdata-ai generate --context ecommerce_customer --count 100 -o sql > seed.sql

# SQL with a custom table name
testdata-ai generate --context banking_user --count 20 -o sql --table bank_accounts > accounts.sql

# 100 records in batches of 20 — JSONL lines appear after each batch
testdata-ai generate --context ecommerce_customer --count 100 --batch-size 20 -o jsonl

# Use Anthropic instead of the default provider
testdata-ai generate --context banking_user --count 5 --provider anthropic

# Use Google Gemini
testdata-ai generate --context ecommerce_customer --count 10 --provider gemini

# Use Mistral
testdata-ai generate --context saas_trial --count 10 --provider mistral

# Use a local Ollama model
testdata-ai generate --context ecommerce_customer --count 10 --provider ollama

# Generate data in Polish
testdata-ai generate --context ecommerce_customer --count 5 --locale pl

# Generate data in Japanese, save as CSV
testdata-ai generate --context banking_user --count 10 --locale ja -o csv > data.csv

# Use a specific model with higher token budget
testdata-ai generate --context hr_employee --count 30 --model gpt-4o --max-tokens 8192

# Machine-readable output (no status messages, plain JSON)
testdata-ai generate --context iot_device --count 20 -q | jq '.[0]'

# Use as Python module (same interface)
python -m testdata_ai generate --context ecommerce_customer --count 5

# Load a custom context from a YAML file and generate data for it
testdata-ai generate --context game_character --context-file my_contexts.yaml --count 5

# Quiet: suppress all status messages including the "Loaded context(s)..." line
testdata-ai generate --context game_character --context-file my_contexts.yaml -q

# Generate from a JSON Schema file (no built-in context needed)
testdata-ai generate --schema-file product_schema.json --count 10
testdata-ai generate --schema-file order_schema.yaml --count 5 -o csv > orders.csv
testdata-ai generate --schema-file ticket_schema.json --count 20 --locale pl

Batch generation / streaming: Large counts are split into multiple AI calls of --batch-size records each. Progress is reported per batch in stderr. With -o jsonl, records are written to stdout as each batch completes — output starts immediately rather than waiting for all records. With -o yaml, each batch is appended as it arrives. With -o json, -o csv, or -o sql, all records are accumulated and written at the end.

Token auto-adjustment: When --max-tokens is not set, the CLI estimates the required token budget per batch and automatically increases it if needed, printing a yellow notice to stderr.

CSV output: Nested dicts are flattened with dot notation (e.g., location.city); lists are serialized as JSON strings.

JSONL output: One JSON object per line — records appear progressively as batches complete.

YAML output: Records are appended batch-by-batch as generation progresses.

SQL output: Emits a CREATE TABLE IF NOT EXISTS DDL statement followed by INSERT INTO statements — compatible with SQLite and most major databases. Column types are inferred per field (INTEGER, REAL, TEXT). Nested dicts are flattened with underscore separators (e.g., address_city); lists are serialized as JSON strings. The table name defaults to the context name and can be overridden with --table.


list-contexts

List all available contexts.

testdata-ai list-contexts [--category CATEGORY] [--context-file PATH]...
# List all contexts
testdata-ai list-contexts

# Filter by category
testdata-ai list-contexts --category finance
testdata-ai list-contexts --category healthcare

# Include custom contexts from a file
testdata-ai list-contexts --context-file my_contexts.yaml

show-context

Show full details of a context: fields, sample record, and prompt hints.

testdata-ai show-context <context> [--context-file PATH]...
testdata-ai show-context ecommerce_customer
testdata-ai show-context logistics_shipment

# Show a custom context defined in a file
testdata-ai show-context game_character --context-file my_contexts.yaml

list-models (Ollama only)

List models available in the running Ollama instance.

testdata-ai list-models [--provider ollama]
# Requires AI_PROVIDER=ollama in .env, or pass --provider explicitly
testdata-ai list-models
testdata-ai list-models --provider ollama

If no models are found, the command prints a hint to run ollama pull <model>.


generate-related

Generate multiple related entity datasets with guaranteed referential integrity. Unlike running generate separately for each entity, child prompts include sample parent records so the AI produces semantically coherent data — order amounts match the parent customer's income tier, shipment addresses match the order destination, etc.

testdata-ai generate-related --graph-file <path> [OPTIONS]
Option Default Description
--graph-file PATH YAML or JSON relationship graph file (required)
-o, --output [json|jsonl-per-entity] json Output format
--batch-size INTEGER 10 Records per AI call (applied to all nodes unless overridden in graph)
--provider TEXT from env AI provider override
--model TEXT from env Model name override
--max-tokens INTEGER from env Max tokens per AI call (auto-adjusted per node by default)
--temperature FLOAT from env Sampling temperature
--locale TEXT from env Locale/language for all generated values
--no-validate off Skip schema validation
-q, --quiet off Suppress status messages (data only to stdout)

Graph file format (relationships.yaml):

customers:
  context: ecommerce_customer
  count: 5

orders:
  context: restaurant_order
  count: 20
  parent: customers        # must be generated before orders
  fk_field: customer_email # field injected into each order record
  parent_pk: email         # field taken from parent customer records
  parent_sample_size: 3    # how many parent records shown in AI prompt (default 3)
  batch_size: 10           # records per AI call for this node (default 10)

Multi-level chains work too — just reference the right parent at each level:

customers:
  context: ecommerce_customer
  count: 3

orders:
  context: restaurant_order
  count: 9
  parent: customers
  fk_field: customer_email
  parent_pk: email

shipments:
  context: logistics_shipment
  count: 9
  parent: orders
  fk_field: reference_order_id
  parent_pk: order_id

Examples:

# Generate from a graph file, output JSON
testdata-ai generate-related --graph-file examples/ecommerce_graph.yaml

# JSONL format — one line per entity, useful for streaming / jq
testdata-ai generate-related --graph-file examples/ecommerce_graph.yaml -o jsonl-per-entity

# Pipe orders to jq
testdata-ai generate-related --graph-file examples/ecommerce_graph.yaml -q \
    | jq '.orders[] | {id: .order_id, email: .customer_email}'

# Polish locale, 5-record batches
testdata-ai generate-related --graph-file examples/ecommerce_graph.yaml \
    --locale pl --batch-size 5

Output formats:

  • json (default) — {"customers": [...], "orders": [...]}
  • jsonl-per-entity — one line per entity: {"entity": "customers", "records": [...]}

Python API

DataGenerator

from testdata_ai import DataGenerator

# Default provider from .env
gen = DataGenerator()

# Explicit provider
gen = DataGenerator(provider="anthropic")
gen = DataGenerator(provider="gemini")
gen = DataGenerator(provider="mistral")
gen = DataGenerator(provider="cohere")

# Local Ollama model (no API key needed)
gen = DataGenerator(provider="ollama")
gen = DataGenerator(provider="ollama", model="llama3.2:latest")

# Full control
gen = DataGenerator(
    provider="openai",
    model="gpt-4o",
    temperature=0.9,
    max_tokens=8192,
)

# Pass API key directly (provider required when using api_key)
gen = DataGenerator(provider="openai", api_key="sk-proj-...")

# Generate data in a specific locale
gen = DataGenerator(locale="pl")   # Polish names, addresses, etc.
gen = DataGenerator(locale="ja")   # Japanese

# Generate records
customers = gen.generate("ecommerce_customer", count=10)
patients  = gen.generate("healthcare_patient", count=5)

# Large counts — automatically split into batches of 20 AI calls each
many = gen.generate("banking_user", count=100, batch_size=20)

# Skip schema validation
records = gen.generate("banking_user", count=20, validate=False)

DataGenerator.generate() returns List[Dict[str, Any]] — a list of plain Python dicts. For count > batch_size, it automatically splits the work into multiple AI calls and combines the results.

Raises:

  • ValueError — unknown context, invalid JSON from AI, or bad arguments
  • testdata_ai.contexts.ValidationError — one or more records missing required fields (when validate=True)

generate() convenience function

For one-off use without instantiating the class:

from testdata_ai import generate

customers = generate("ecommerce_customer", count=20)

# Generate in a specific locale
polish_customers = generate("ecommerce_customer", count=20, locale="pl")

# Large counts split automatically into 20-record batches
many = generate("ecommerce_customer", count=100, batch_size=20)

Configuration (provider, model, etc.) is read from environment variables. For explicit control use DataGenerator directly.


generate_batched() — streaming / incremental output

When you want to process or display records as they arrive rather than waiting for the full result:

from testdata_ai.generator import generate_batched

# Process records in batches of 10 as each batch completes
for batch in generate_batched("ecommerce_customer", count=50, batch_size=10):
    print(f"Got {len(batch)} records")
    save_to_db(batch)       # commit each batch immediately
    send_to_pipeline(batch) # or stream to a downstream system

# Or use DataGenerator directly for repeated use
gen = DataGenerator(provider="anthropic")
for batch in gen.generate_batched("banking_user", count=100, batch_size=20):
    process(batch)

generate_batched() / DataGenerator.generate_batched() yield List[Dict[str, Any]] — one batch per iteration.


generate_with_relationships() — Multi-entity datasets

Generate multiple related entity datasets in a single call. The graph is executed in dependency order (topological sort), and child prompts include sample parent records so the AI produces contextually consistent data — not just FK injection after the fact.

from testdata_ai import DataGenerator

gen = DataGenerator()

result = gen.generate_with_relationships({
    "customers": {
        "context": "ecommerce_customer",
        "count": 5,
    },
    "orders": {
        "context": "restaurant_order",
        "count": 20,
        "parent": "customers",
        "fk_field": "customer_email",   # field to inject into each order
        "parent_pk": "email",           # field from parent used as FK value
        "parent_sample_size": 3,        # parent records shown in AI prompt
        "batch_size": 10,               # records per AI call (default 10)
    },
})

# result["customers"] → List[Dict] — 5 customers
# result["orders"]    → List[Dict] — 20 orders, each with customer_email set to a real customer email

# FK integrity is always guaranteed (safety-net injection after AI generation)
customer_emails = {c["email"] for c in result["customers"]}
assert all(o["customer_email"] in customer_emails for o in result["orders"])

Three-level chain — customers → orders → shipments:

result = gen.generate_with_relationships({
    "customers": {"context": "ecommerce_customer", "count": 3},
    "orders": {
        "context": "restaurant_order", "count": 9,
        "parent": "customers", "fk_field": "customer_email", "parent_pk": "email",
    },
    "shipments": {
        "context": "logistics_shipment", "count": 9,
        "parent": "orders", "fk_field": "reference_order_id", "parent_pk": "order_id",
    },
})

Locale support:

# All entities generated in Polish
gen = DataGenerator(locale="pl")
result = gen.generate_with_relationships({...})

Module-level convenience function:

from testdata_ai import generate_with_relationships

result = generate_with_relationships(
    {
        "customers": {"context": "ecommerce_customer", "count": 2},
        "orders": {
            "context": "restaurant_order", "count": 6,
            "parent": "customers", "fk_field": "customer_email", "parent_pk": "email",
        },
    },
    validate=True,
    locale="ja",
)

Graph node fields:

Field Required Default Description
context yes Registered context identifier
count yes Number of records to generate
parent no Parent node name (makes this a child node)
fk_field when parent set Field to inject into each child record
parent_pk when parent set Field from parent records used as FK value
parent_sample_size no 3 Parent records embedded in child AI prompt
batch_size no 10 Records per AI call for this node

Raises:

  • ValueError — missing required fields, unknown parent reference, or cycle in graph
  • testdata_ai.contexts.ValidationError — records missing required fields (when validate=True)

Graph YAML files — save any graph dict as YAML and use it with the CLI:

testdata-ai generate-related --graph-file relationships.yaml

See examples/ecommerce_graph.yaml and examples/relationships.py for full examples.


generate_from_model() — Schema from Pydantic / JSON Schema

If you already have Pydantic models, pass them directly — no need to write a ContextSchema by hand. The field names, types, descriptions, and constraints are extracted automatically and used to guide the AI.

from pydantic import BaseModel, Field
from testdata_ai import generate_from_model

class Customer(BaseModel):
    name: str
    email: str = Field(description="Valid email address")
    age: int = Field(ge=18, le=99, description="Age in years")
    is_active: bool

data = generate_from_model(Customer, count=10)
# [{"name": "Aisha Patel", "email": "aisha@...", "age": 34, "is_active": True}, ...]

Nested models work too:

class Address(BaseModel):
    street: str
    city: str
    country: str

class Order(BaseModel):
    order_id: str
    customer_name: str
    total: float
    shipping_address: Address

data = generate_from_model(Order, count=5)

Raw JSON Schema dict — no Pydantic needed:

schema = {
    "title": "Product",
    "properties": {
        "sku":      {"type": "string"},
        "name":     {"type": "string", "description": "Display name"},
        "price":    {"type": "number", "minimum": 0},
        "category": {"enum": ["electronics", "clothing", "food"]},
        "in_stock": {"type": "boolean"},
    },
}
data = generate_from_model(schema, count=5)

All the usual options apply:

# Locale
data = generate_from_model(Customer, count=5, locale="pl")

# Skip validation (useful for models with many optional fields)
data = generate_from_model(Customer, count=10, validate=False)

# Via DataGenerator (reuse across multiple models)
gen = DataGenerator(provider="anthropic")
customers = gen.generate_from_model(Customer, count=5)
orders    = gen.generate_from_model(Order, count=3)

Inspect the derived schema without calling the AI:

from testdata_ai.schema_adapter import model_to_context_schema

cs = model_to_context_schema(Customer)
print(cs.description)    # "Auto-generated from Customer schema"
print(cs.fields)         # ['name', 'email', 'age', 'is_active']
print(cs.sample)         # {'name': 'example_name', 'email': 'user@example.com', ...}
print(cs.prompt_hints)   # ['email: Valid email address', 'age: Age in years', 'age: min=18, max=99']

Supported schema features: $ref / $defs, anyOf / oneOf (null-safe), enum, const, string format (email, date, date-time, uri), numeric minimum / maximum, minLength / maxLength, nested objects and arrays, Pydantic v1 (.schema()) and v2 (.model_json_schema()). No new dependencies — Pydantic is detected by duck-typing.


Async / Parallel Generation

Run multiple AI calls concurrently using asyncio. Blocking provider calls are offloaded to a thread pool via asyncio.to_thread (Python 3.9+), so the standard synchronous providers work unchanged.

import asyncio
from testdata_ai import generate_parallel, async_generate, GenerateSpec

generate_parallel() — multiple contexts at once

results = await generate_parallel([
    GenerateSpec("ecommerce_customer", count=500, label="customers"),
    GenerateSpec("banking_user",        count=500, label="accounts"),
    GenerateSpec("iot_device",          count=500, label="devices"),
])
# All 3 AI calls run concurrently
# results["customers"] → List[Dict] (500 records)
# results["accounts"]  → List[Dict] (500 records)
# results["devices"]   → List[Dict] (500 records)

asyncio.run(main())  # or await inside an existing async context

Result keying:

  • When label is set, results are stored under that label.
  • When label is None and multiple specs share the same context, their results are merged under the context name:
results = await generate_parallel([
    GenerateSpec("ecommerce_customer", 1000),
    GenerateSpec("ecommerce_customer", 1000),
    GenerateSpec("ecommerce_customer", 1000),
])
records = results["ecommerce_customer"]  # ~3000 merged records

Cross-call uniqueness — requires pip install testdata-ai[faker]:

results = await generate_parallel(
    [
        GenerateSpec("ecommerce_customer", count=500, label="segment_a"),
        GenerateSpec("ecommerce_customer", count=500, label="segment_b"),
    ],
    global_unique_fields=["email"],   # no duplicate emails across both segments
)

Two uniqueness layers:

  1. Prompt injection (statistical): each task gets a unique batch_id injected into its prompt
  2. Faker dedup (guaranteed): when global_unique_fields is set, confirmed duplicates are replaced with Faker-generated values

GenerateSpec fields:

Field Required Default Description
context yes Context identifier
count yes Number of records
locale no None BCP 47 locale tag (overrides AI_LOCALE)
validate no False Run schema validation on results
label no None Custom key in the results dict; None → merge by context name

async_generate() — single context, parallel batches

Convenience wrapper for generating many records from one context by splitting into parallel batches:

# 3000 records, 3 concurrent batches (default: ceil(count/parallelism) per batch)
records = await async_generate("ecommerce_customer", count=3000, parallelism=3)

# 9000 records: batches of 1000, max 3 concurrent (3 waves of 3)
records = await async_generate(
    "ecommerce_customer",
    count=9000,
    parallelism=3,
    batch_size=1000,
    global_unique_fields=["email"],   # unique emails across all batches
)

# Locale-aware
records = await async_generate("ecommerce_customer", count=500, parallelism=5, locale="pl")

async_generate() parameters:

Parameter Default Description
context Context identifier
count Total records to generate
parallelism 3 Max concurrent AI calls (semaphore limit)
batch_size ceil(count/parallelism) Records per AI call
locale None BCP 47 locale tag
global_unique_fields None Fields to deduplicate across all batches (requires Faker)
provider from env AI provider name

Full working example:

import asyncio
from testdata_ai import generate_parallel, async_generate, GenerateSpec

async def main():
    # Multi-context parallel
    results = await generate_parallel([
        GenerateSpec("ecommerce_customer", count=100, label="buyers"),
        GenerateSpec("banking_user",        count=50,  label="accounts"),
    ], global_unique_fields=["email"])

    print(f"buyers:   {len(results['buyers'])} records")
    print(f"accounts: {len(results['accounts'])} records")

    # Single-context high-throughput
    records = await async_generate("hr_employee", count=1000, parallelism=5)
    print(f"employees: {len(records)} records")

asyncio.run(main())

See examples/async_generation.py for more patterns including explicit labels, locale-aware parallel generation, and concurrency wave control.

Raises:

  • ValueError — empty specs list, count < 1, or parallelism < 1
  • ImportErrorglobal_unique_fields set but faker not installed
  • RuntimeError / ValidationError — propagated from any failed task

list_contexts() / get_context_schema()

from testdata_ai import list_contexts, get_context_schema

# All context names
names = list_contexts()

# Filter by category
finance_contexts = list_contexts(category="finance")

# Inspect a schema
schema = get_context_schema("ecommerce_customer")
print(schema.fields)       # ['name', 'email', 'age', ...]
print(schema.description)  # 'e-commerce customer profiles'
print(schema.category)     # 'ecommerce'
print(schema.sample)       # full sample dict
print(schema.prompt_hints) # list of generation hints

Sample output

{
  "name": "Aisha Patel",
  "email": "aisha.patel.2024@gmail.com",
  "age": 28,
  "location": {
    "city": "Mumbai",
    "country": "India",
    "timezone": "Asia/Kolkata"
  },
  "shopping_behavior": {
    "frequency": "weekly",
    "avg_order_value": "$45-80",
    "preferred_categories": ["electronics", "books"],
    "device": "mobile",
    "payment_method": "upi"
  },
  "joined_date": "2023-04-15",
  "loyalty_tier": "silver"
}

Custom Contexts

The 13 built-in contexts cover common domains, but you can define your own for any data shape your project needs.

File-based (YAML or JSON)

Create a YAML file where each top-level key is a context name:

# my_contexts.yaml
game_character:
  description: "RPG game character profiles"
  category: "gaming"
  sample:
    character_id: "CHAR-0042"
    name: "Theron Blackwood"
    class: "Ranger"
    level: 15
    gold: 340
  prompt_hints:
    - "Fantasy names from diverse real-world cultures"
    - "Classes: Warrior, Mage, Ranger, Rogue, Cleric, Paladin, Druid, Bard"
    - "Level range 1-20; gold 10-5000 depending on level"

Load it with --context-file on any CLI command:

testdata-ai generate --context game_character --context-file my_contexts.yaml --count 5
testdata-ai list-contexts --context-file my_contexts.yaml
testdata-ai show-context game_character --context-file my_contexts.yaml

The flag is repeatable — pass multiple files to load several context collections at once.

JSON files are also supported (same structure, .json extension).

Programmatic (register_context)

Register contexts at runtime from Python — useful in conftest.py or application setup:

from testdata_ai import register_context, ContextSchema

# Using ContextSchema
register_context("game_npc", ContextSchema(
    description="RPG non-player character profiles",
    category="gaming",
    sample={
        "npc_id": "NPC-0011",
        "name": "Mira Dawnwhisper",
        "role": "innkeeper",
        "disposition": "friendly",
        "gold": 80,
    },
    prompt_hints=[
        "Fantasy names from diverse real-world cultures",
        "Roles: innkeeper, blacksmith, guard, merchant, quest-giver",
        "Gold: 10-500 depending on role",
    ],
))

# Using a plain dict (no import of ContextSchema needed)
register_context("game_item", {
    "description": "RPG inventory items",
    "category": "gaming",
    "sample": {"item_id": "ITM-099", "name": "Elven Cloak", "rarity": "rare", "value_gold": 250},
    "prompt_hints": ["Rarities: common, uncommon, rare, epic, legendary"],
})

Both approaches register the context globally for the current process — DataGenerator and the pytest plugin pick it up immediately.

Loading from Python

from testdata_ai import load_contexts_from_file

names = load_contexts_from_file("my_contexts.yaml")  # returns ['game_character']

Schema rules

Field Required Notes
description yes Non-empty string
sample yes Non-empty dict; keys become the required field names
prompt_hints yes List of strings (empty list is allowed but reduces output quality)
category no Defaults to "custom"
field_providers no Dict mapping field name → "faker:method_name". Requires pip install testdata-ai[faker]
unique_fields no List of field names (must be a subset of field_providers keys) that will be unique within a batch

Name rules: context names must start with a letter or underscore and contain only letters, digits, and underscores (snake_case recommended).

Warnings: register_context and load_contexts_from_file emit a UserWarning when prompt_hints is empty or when the sample contains nested dicts/lists (nested types are not validated at runtime).

Overwriting: pass overwrite=True to replace an existing context (including built-ins). A warning is emitted when a built-in is shadowed.

Atomicity: if a file contains multiple contexts and one fails validation, none of them are registered.


Faker Hybrid Mode

AI excels at semantic coherence — names, locations, and behaviors that feel like real people. Faker excels at format correctness — emails that pass regex checks, IBANs with valid checksums, UUIDs that are actually valid.

Faker hybrid mode combines both: AI generates the full record, then Faker overwrites specific fields with guaranteed-valid values.

pip install "testdata-ai[faker]"

Add field_providers to any ContextSchema:

from testdata_ai import register_context, ContextSchema, DataGenerator

register_context("banking_pl", ContextSchema(
    description="Polish retail banking customer",
    sample={
        "name": "Jan Kowalski",
        "email": "jan.kowalski@bank.pl",
        "iban": "PL61109010140000071219812874",
        "phone": "+48 123 456 789",
        "balance": 4250.00,
    },
    prompt_hints=["Realistic Polish names", "Balance 500–50000 PLN"],
    field_providers={
        "email": "faker:email",
        "iban":  "faker:iban",
        "phone": "faker:phone_number",
    },
))

gen = DataGenerator(locale="pl_PL")
records = gen.generate("banking_pl", count=10)
# → AI generates name + balance (semantically coherent)
# → Faker generates email + iban + phone (format guaranteed)

Works with generate_from_model too:

from testdata_ai import generate_from_model

data = generate_from_model(
    Customer,
    count=10,
    field_providers={"email": "faker:email", "phone": "faker:phone_number"},
)

How it works:

  1. AI generates the complete record (all fields, semantically coherent)
  2. Faker overwrites only the listed fields with format-guaranteed values
  3. Schema validation runs on the final combined record

Faker locale follows DataGenerator.localeDataGenerator(locale="pl_PL") gives Polish phone numbers and emails automatically.

Common providers:

Spec Example output
faker:email anna.kowalska@example.com
faker:phone_number +48 123 456 789
faker:iban PL61 1090 1014 0000 0712 1981 2874
faker:uuid4 550e8400-e29b-41d4-a716-446655440000
faker:url https://example.com/path
faker:ipv4 192.168.1.42
faker:date 2024-03-15
faker:postcode 00-001
faker:company Kowalski & Synowie Sp. z o.o.

Full list: faker.readthedocs.io → Providers


Unique Field Constraints

unique_fields works with any field backed by a Faker method — emails, UUIDs, usernames, phone numbers, IBANs, IP addresses, and more. Add it to any ContextSchema to guarantee no duplicates within a generated batch:

register_context("saas_user", ContextSchema(
    description="SaaS trial user",
    sample={
        "name": "Alice Chen",
        "email": "alice@startup.io",
        "company": "Acme Inc",
        "plan": "trial",
    },
    prompt_hints=["Diverse professional names", "Plans: trial / starter / pro / enterprise"],
    field_providers={
        "email": "faker:email",
    },
    unique_fields=["email"],   # no duplicate emails in the batch
))

gen = DataGenerator()
records = gen.generate("saas_user", count=100)
emails = [r["email"] for r in records]
assert len(emails) == len(set(emails))  # always passes

Multiple unique fields at once:

register_context("order", ContextSchema(
    ...,
    field_providers={
        "order_id": "faker:uuid4",
        "customer_email": "faker:email",
    },
    unique_fields=["order_id", "customer_email"],
))

Works with generate_from_model too:

records = generate_from_model(
    UserSchema,
    count=50,
    field_providers={"user_id": "faker:uuid4", "email": "faker:email"},
    unique_fields=["user_id", "email"],
)

And in YAML context files:

employee_unique:
  description: "HR employee with unique email"
  sample:
    name: "Fatima Al-Rashid"
    email: "f.alrashid@corp.com"
    department: "Engineering"
    salary: 125000
  prompt_hints:
    - "Diverse names from different cultures"
    - "Salary 50k–250k depending on seniority"
  field_providers:
    email: "faker:email"
  unique_fields:
    - email

Rules:

  • unique_fields must be a subset of field_providers keys — validated at schema construction time with a clear error
  • Works with any Faker method that has sufficient cardinality: faker:email, faker:uuid4, faker:user_name, faker:phone_number, faker:iban, faker:ipv4, faker:company, etc.
  • Avoid low-cardinality methods (e.g. faker:boolean has only 2 values) — Faker raises UniquenessException if it exhausts all possible distinct values
  • Uniqueness is guaranteed within a single generate() call (one batch). Across multiple generate_batched() iterations, each batch is internally unique but values can repeat between batches

Pytest Plugin

The plugin ships with the package and is auto-loaded via the pytest11 entry point — no import or conftest setup needed.

Marker fixture: testdata

Function-scoped. Use with @pytest.mark.testdata to generate any context at any count. count defaults to 1 if omitted.

import pytest

@pytest.mark.testdata(context="ecommerce_customer", count=5)
def test_checkout_flow(testdata):
    assert len(testdata) == 5
    assert all("email" in row for row in testdata)

@pytest.mark.testdata(context="banking_user", count=1)
def test_single_bank_user(testdata):
    user = testdata[0]
    assert 300 <= user["credit_score"] <= 850

# Generate data in a specific locale
@pytest.mark.testdata(context="ecommerce_customer", count=3, locale="pl")
def test_polish_customers(testdata):
    assert len(testdata) == 3

Auto-generated context fixtures

For every context, the plugin auto-generates two session-scoped fixtures:

Fixture name Returns Example
<context> Single dict (1 record) ecommerce_customer
<context>s List of 10 dicts ecommerce_customers
def test_single(ecommerce_customer):
    assert "email" in ecommerce_customer

def test_list(ecommerce_customers):
    assert len(ecommerce_customers) == 10

def test_patient(healthcare_patient):
    assert "blood_type" in healthcare_patient

def test_employees(hr_employees):
    assert all("salary" in e for e in hr_employees)

Caching and seeds

The plugin caches AI responses to avoid redundant API calls within and across test runs. Cache files live in .testdata_ai_cache/. Add .testdata_ai_cache/ and .testdata_ai.log to your .gitignore.

Seed = a named cache snapshot. Use --testdata-seed to name and reuse a cache:

# First run: generate data and save under "smoke-seed"
pytest --testdata-seed smoke-seed

# Subsequent runs: reuse the cached data (no AI calls)
pytest --testdata-seed smoke-seed

# Reuse the most recently used named seed
pytest --testdata-last-seed

Without --testdata-seed, a temporary seed is created per run and deleted automatically when the session ends.

Seed and cache management

These options perform an admin action and exit without running tests:

# List all available seeds
pytest --testdata-list-seeds

# Show what's cached in the current (or a specific) seed
pytest --testdata-show-cache
pytest --testdata-show-cache smoke-seed

# Delete a specific seed
pytest --testdata-delete-seed smoke-seed

# Delete the last used seed
pytest --testdata-delete-last

# Clear all seeds and reset the last-seeds queue
pytest --testdata-clear-cache

pytest-xdist support

When running with pytest-xdist, each worker will make its own AI calls unless you specify a shared named seed:

# Recommended: share one cache across all workers
pytest -n 4 --testdata-seed my-seed

Without --testdata-seed, a warning is printed per worker.

Manual fixture pattern

If you prefer explicit control in conftest.py:

# conftest.py
import pytest
from testdata_ai import DataGenerator

@pytest.fixture(scope="session")
def test_customers():
    gen = DataGenerator()
    return gen.generate("ecommerce_customer", count=10)

# test_checkout.py
def test_checkout_flow(test_customers):
    customer = test_customers[0]
    assert customer["email"]
    assert customer["age"] >= 18

Logging

The plugin writes structured logs to .testdata_ai.log (rotating, max 5 MB × 3 backups) and to stderr. Log entries include seed name and xdist worker ID.


Available Contexts

Context Category Key Fields
ecommerce_customer ecommerce name, email, age, location, shopping_behavior, joined_date, loyalty_tier
banking_user finance name, email, age, account_type, balance, monthly_income, credit_score, branch, account_opened
saas_trial saas name, email, company, role, plan, signup_date, trial_expires, usage_stats
healthcare_patient healthcare patient_id, name, date_of_birth, gender, blood_type, primary_diagnosis, medications, allergies, insurance_provider, last_visit, attending_physician
education_student education student_id, name, email, age, major, minor, year, gpa, enrollment_status, courses, advisor
b2b_lead b2b lead_id, contact_name, email, phone, company, industry, company_size, job_title, lead_source, lead_score, deal_value, stage, notes
hr_employee hr employee_id, name, email, department, job_title, hire_date, salary, employment_type, manager, location, performance_rating
real_estate_listing real_estate listing_id, address, property_type, bedrooms, bathrooms, sqft, year_built, list_price, status, days_on_market, agent, features
iot_device iot device_id, device_type, manufacturer, firmware_version, location, status, battery_level, last_reading, alert_threshold, installed_date
social_media_profile social_media username, display_name, bio, followers, following, posts, verified, joined, category, engagement_rate, top_hashtags
travel_booking travel booking_id, passenger_name, email, trip_type, origin, destination, departure_date, return_date, cabin_class, total_price, currency, travelers, status, add_ons
restaurant_order food order_id, customer_name, restaurant, cuisine, items, subtotal, delivery_fee, tip, total, payment_method, order_type, status, ordered_at
logistics_shipment logistics tracking_number, carrier, origin, destination, ship_date, estimated_delivery, actual_delivery, weight_kg, dimensions_cm, contents, status, last_checkpoint

Run testdata-ai list-contexts to see all contexts, or testdata-ai show-context <name> for full field details and a sample record.


Development Roadmap

Done:

  • OpenAI + Anthropic + Ollama + Gemini + Mistral + Cohere provider-agnostic architecture
  • 13 built-in contexts across 13 categories
  • Schema validation with missing-field reporting
  • CLI (generate, list-contexts, show-context, list-models) with JSON, JSONL, CSV, YAML, and SQL output
  • Auto token estimation and adjustment
  • Spinner with elapsed time (animated on TTY, static on non-TTY)
  • python -m testdata_ai support
  • Pytest plugin: marker fixture, auto-context fixtures, seed/cache system
  • Seed cache management CLI options (list, show, delete, clear)
  • TEMP seed auto-cleanup after session
  • pytest-xdist support with shared named seeds
  • Rotating log file (.testdata_ai.log)
  • Batch generation / streaming — generate_batched(), --batch-size, progressive JSONL/YAML output
  • Custom contexts — register_context(), load_contexts_from_file(), --context-file CLI option
  • PyPI publish — pip install testdata-ai · py.typed marker for fully typed public API
  • Locale / language support — --locale pl / DataGenerator(locale="ja") / AI_LOCALE env var; pytest plugin marker support
  • Schema-from-model — generate_from_model(MyPydanticModel) / generate_from_model(json_schema_dict) / --schema-file CLI option
  • Faker hybrid mode — field_providers={"email": "faker:email"} in ContextSchema; optional testdata-ai[faker] extra; locale-aware
  • Unique field constraints — unique_fields=["email", "user_id"] in ContextSchema; uses Faker's uniqueness proxy; per-batch guarantee
  • SQL output format — -o sql with CREATE TABLE IF NOT EXISTS + INSERT INTO; type inference; --table override
  • Relationship generation — generate_with_relationships() / generate-related CLI; graph YAML files; semantic coherence (parent records in child prompt); guaranteed FK integrity; topological sort; batch generation
  • Async / parallel generation — generate_parallel() / async_generate() / GenerateSpec; asyncio + thread pool; cross-call Faker dedup via global_unique_fields; semaphore concurrency cap
  • More providers — Google Gemini (gemini-2.0-flash), Mistral (mistral-small-latest), Cohere (command-r)

Next:

  • /docs folder — installation, quickstart, CLI reference, API reference, custom contexts, pytest integration
  • pandas output — DataGenerator.to_dataframe() convenience method

Contributing

Contributions welcome — see CONTRIBUTING.md for the full guide. See CHANGELOG.md for version history.


Related

  • qa-ai-prompts — 100+ battle-tested AI prompts for QA engineers. Copy, paste, customize — get results in seconds.

License

MIT License — see LICENSE


Built by TestCraft AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

testdata_ai-0.10.0.tar.gz (91.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

testdata_ai-0.10.0-py3-none-any.whl (66.0 kB view details)

Uploaded Python 3

File details

Details for the file testdata_ai-0.10.0.tar.gz.

File metadata

  • Download URL: testdata_ai-0.10.0.tar.gz
  • Upload date:
  • Size: 91.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for testdata_ai-0.10.0.tar.gz
Algorithm Hash digest
SHA256 6e88ac0ecf705d9a5b8e448ad07234bfc508017ccdffcfee3b19a4d9d6611b12
MD5 59d2fef9fd1ca0f50e00e1c44b5c1ee8
BLAKE2b-256 3e07cce3d21e6335808fc180dcdcba8dd40900f51f54ca6032457e7f18dcc033

See more details on using hashes here.

Provenance

The following attestation bundles were made for testdata_ai-0.10.0.tar.gz:

Publisher: publish.yml on testcraft-ai/testdata-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file testdata_ai-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: testdata_ai-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 66.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for testdata_ai-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3194ec45ffd4f2b0fd04b4f9456f9e25be51c0543ec4f00a1383e41e8ff4e773
MD5 b78515a30b08b94cd50b1615e7080a02
BLAKE2b-256 8ad4919fd86b75d68be58eeafd88396352757b4bda5802b61e2e6eac2bae0d03

See more details on using hashes here.

Provenance

The following attestation bundles were made for testdata_ai-0.10.0-py3-none-any.whl:

Publisher: publish.yml on testcraft-ai/testdata-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page