Skip to main content

Generate realistic synthetic data from a simple JSON schema — library + CLI

Project description

Schematico

Schematico

Describe the data you want. Get it back as clean JSON.
Find real public records on the live web, or synthesize realistic ones — from one tiny schema.

PyPI Python versions Tests License: MIT


Why Schematico exists

Schematico started with a frustration: the public data I needed was out there, but never in a form I could use. Congressional race results, public filings, sports stats, niche reference tables — scattered across a dozen sites, trapped in HTML, or sitting behind a paywall. Getting a clean table meant hours of copy-paste, fragile scrapers, and manual cleanup.

So I built a tool where you describe the shape of the data you want — once, in a few lines — and let an AI agent go find it on the live web and hand it back as structured JSON.

One schema. Two ways to fill it.

Mode What it does Needs
discover An AI agent searches the live web (via Tavily) and returns real records matching your schema. LLM key + TAVILY_API_KEY
generate An LLM synthesizes realistic, coherent records from your schema — no web access, fully made-up data. LLM key

Both modes are LLM-backed. Output is always validated against your schema and de-duplicated before you get it.


Install

pipx install schematico   # as a CLI tool
uv add schematico         # as a library
pip install schematico    # the classic way

Library usage

Define a schema as a Pydantic model and call run_generation or run_discovery:

from pydantic import BaseModel, Field
from schematico import run_generation

class User(BaseModel):
    id: str = Field(description="UUID v4")
    full_name: str = Field(description="realistic full name")
    email: str = Field(description="work email matching the name")
    role: str = Field(description="one of: admin, editor, viewer")

records = run_generation(
    User,
    samples=10,
    instructions="EU-based users only. Emails must match the full_name.",
)
# -> list[dict], validated and de-duplicated

Prefer JSON schemas? Load one and run it:

from schematico import model_from_json, run_generation

model, rows, instructions = model_from_json("schema.json")
records = run_generation(model, samples=rows, instructions=instructions)

To find real data on the web instead, swap in run_discovery (needs TAVILY_API_KEY):

from schematico import run_discovery
records = run_discovery(User, samples=25, instructions="...")

Both functions accept an optional progress_cb(found, total, event) callback and a logfire_token for tracing.


Bring your own models

Schematico runs on pydantic-ai, so you can point it at virtually any model — hosted, gateway-routed, or local — and even build a failover chain that tries each in order.

from schematico import SchematicoModel, get_llm_model, run_discovery

model = get_llm_model([
    # try the gateway first…
    SchematicoModel(model="gateway/anthropic:claude-sonnet-4-6"),
    # …fall back to a direct provider…
    SchematicoModel(model="openai:gpt-4.1", api_key="sk-..."),
    # …then a local, keyless model.
    SchematicoModel(model="ollama:llama3.2", base_url="http://localhost:11434/v1"),
])

records = run_discovery(MySchema, samples=50, model=model)
  • A bare model string ("anthropic:claude-sonnet-4-6") reads credentials from the provider's usual env var.
  • A SchematicoModel lets you pin api_key and base_url per model.
  • A list becomes an automatic failover chain.

From the CLI, set the model per project (schematico new, or schematico <mode> use model <id>) and the env var that holds its key.


Quick start (CLI)

# 1. Point Schematico at a model. The default routes Claude through the
#    Pydantic AI Gateway — set its key (or see "Bring your own models" below).
export PYDANTIC_AI_GATEWAY_API_KEY=...

# 2. For discover mode, add a Tavily key (free tier at https://tavily.com).
export TAVILY_API_KEY=...

# 3. Create a project interactively. You'll be prompted for mode, schema,
#    output dir, count, and model. State lives in ./.schematico/.
schematico new

# 4. Run it.
schematico discover     # find real records on the web
# or
schematico generate     # synthesize records

Output is written to ./.schematico/output/<project>_<timestamp>.json by default. Override per run with --output FILE_OR_DIR and --count N.

Command reference

schematico new                     # interactive project wizard
schematico list                    # all saved project configs
schematico generate [--config N]   # synthesize records (uses default project)
schematico discover [--config N]   # find real records on the web
schematico delete NAME             # delete a config (-m to disambiguate mode)
schematico help                    # the full command tree, every flag

Common flags on generate / discover: --config/-c, --output/-o, --count/-n, --model/-m.


Schema format

A schema is a small JSON object describing the table you want:

{
  "table": "congressional_elections",
  "rows": 50,
  "instructions": "U.S. House races in the 2026 midterms.",
  "fields": [
    { "name": "district",        "type": "string", "description": "state and district, e.g. 'CA-12'" },
    { "name": "election_date",   "type": "string", "description": "ISO 8601 date" },
    { "name": "incumbent_party", "type": "enum",   "values": ["D", "R", "I"] },
    { "name": "is_open_seat",    "type": "bool" }
  ]
}
Top-level key Required Meaning
table Name of the table (also names the output model).
fields List of field definitions (see below).
rows How many records to produce (default 25).
instructions Free-text guidance passed to the agent.

Field types

Types are deliberately minimal — the description does the heavy lifting.

Type Python Notes
string str Any text. Shape it with description, e.g. "UUID v4", "ISO 8601 timestamp", "ISO 3166 country code".
int int Optional min / max.
float float Optional min / max.
bool bool true / false.
enum one of values Requires a non-empty values list.

There's no dedicated uuid / email / timestamp type on purpose. Use string and say what you want in description — the model fills it in accordingly, and you're never boxed in by a fixed type list.


Configuration

Env var Purpose
PYDANTIC_AI_GATEWAY_API_KEY Key for the default gateway-routed model.
PAI_MODEL Override the default model id (gateway/anthropic:claude-sonnet-4-6).
TAVILY_API_KEY Required for discover mode (live web search).
LOGFIRE_TOKEN Optional. Send traces, tool calls, and token usage to Logfire.
LOG_LEVEL WARNING (default), INFO, or DEBUG.

Schematico auto-loads a .env file from the current directory — see .env.example. Project configs live in ./.schematico/ as <name>.<mode>.toml files.


Coming soon

Schematico is just getting started. On the roadmap:

  • 📦 More output formats — CSV, Excel, SQL inserts, and Parquet, not just JSON.
  • 🔎 Smarter discovery — source citations per record, deeper crawling, and a second-pass agent that verifies and de-duplicates findings.
  • 🧩 Richer schemas — nested objects, relationships between tables, and reusable field presets.
  • 🗄️ Direct sinks — write straight into a database or a dataframe.
  • Offline generation — a fast, keyless synthesis mode for when you don't want to call a model at all.

Have an idea? Open an issue — this is the moment to shape where it goes.


Contributing

Contributions are very welcome — issues, docs, and PRs all help. See CONTRIBUTING.md to get from clone to PR in a couple of minutes. The test suite mocks the LLM, so uv run pytest needs no API keys.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schematico-0.1.2.tar.gz (218.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schematico-0.1.2-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file schematico-0.1.2.tar.gz.

File metadata

  • Download URL: schematico-0.1.2.tar.gz
  • Upload date:
  • Size: 218.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for schematico-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3f4d9030e03238a934194097c4afcb08b1fdc8bb8fb7710a3859cec65c11e683
MD5 4b1159ba793d6c21dfe2856307a472dc
BLAKE2b-256 3909bf77bea042cb9f66384fe6498952d6669b21f5df24b790d878da3bda53af

See more details on using hashes here.

Provenance

The following attestation bundles were made for schematico-0.1.2.tar.gz:

Publisher: publish.yml on Sententia-Lab/schematico

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file schematico-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: schematico-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for schematico-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3549a0e8570188476e8528d31c820b006727669c8167f73918a0a54c1d4429d2
MD5 32f4e8fa05bad1866ddf793cd72d841c
BLAKE2b-256 0a02464cbe27251460401216ff9e881bba567b0e17d941ea81153faeef0754b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for schematico-0.1.2-py3-none-any.whl:

Publisher: publish.yml on Sententia-Lab/schematico

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page