Generate realistic synthetic data from a simple JSON schema — library + CLI
Project description
Schematico
Describe the data you want. Get it back as clean JSON.
Find real public records on the live web, or synthesize realistic ones — from one tiny schema.
Why Schematico exists
Schematico started with a frustration: the public data I needed was out there, but never in a form I could use. Congressional race results, public filings, sports stats, niche reference tables — scattered across a dozen sites, trapped in HTML, or sitting behind a paywall. Getting a clean table meant hours of copy-paste, fragile scrapers, and manual cleanup.
So I built a tool where you describe the shape of the data you want — once, in a few lines — and let an AI agent go find it on the live web and hand it back as structured JSON.
One schema. Two ways to fill it.
| Mode | What it does | Needs |
|---|---|---|
discover |
An AI agent searches the live web (via Tavily) and returns real records matching your schema. | LLM key + TAVILY_API_KEY |
generate |
An LLM synthesizes realistic, coherent records from your schema — no web access, fully made-up data. | LLM key |
Both modes are LLM-backed. Output is always validated against your schema and de-duplicated before you get it.
Install
pipx install schematico # as a CLI tool
uv add schematico # as a library
pip install schematico # the classic way
Library usage
Define a schema as a Pydantic model and call run_generation or run_discovery:
from pydantic import BaseModel, Field
from schematico import run_generation
class User(BaseModel):
id: str = Field(description="UUID v4")
full_name: str = Field(description="realistic full name")
email: str = Field(description="work email matching the name")
role: str = Field(description="one of: admin, editor, viewer")
records = run_generation(
User,
samples=10,
instructions="EU-based users only. Emails must match the full_name.",
)
# -> list[dict], validated and de-duplicated
Prefer JSON schemas? Load one and run it:
from schematico import model_from_json, run_generation
model, rows, instructions = model_from_json("schema.json")
records = run_generation(model, samples=rows, instructions=instructions)
To find real data on the web instead, swap in run_discovery (needs
TAVILY_API_KEY):
from schematico import run_discovery
records = run_discovery(User, samples=25, instructions="...")
Both functions accept an optional progress_cb(found, total, event) callback and
a logfire_token for tracing.
Bring your own models
Schematico runs on pydantic-ai, so you can point it at virtually any model — hosted, gateway-routed, or local — and even build a failover chain that tries each in order.
from schematico import SchematicoModel, get_llm_model, run_discovery
model = get_llm_model([
# try the gateway first…
SchematicoModel(model="gateway/anthropic:claude-sonnet-4-6"),
# …fall back to a direct provider…
SchematicoModel(model="openai:gpt-4.1", api_key="sk-..."),
# …then a local, keyless model.
SchematicoModel(model="ollama:llama3.2", base_url="http://localhost:11434/v1"),
])
records = run_discovery(MySchema, samples=50, model=model)
- A bare model string (
"anthropic:claude-sonnet-4-6") reads credentials from the provider's usual env var. - A
SchematicoModellets you pinapi_keyandbase_urlper model. - A list becomes an automatic failover chain.
From the CLI, set the model per project (schematico new, or
schematico <mode> use model <id>) and the env var that holds its key.
Quick start (CLI)
# 1. Point Schematico at a model. The default routes Claude through the
# Pydantic AI Gateway — set its key (or see "Bring your own models" below).
export PYDANTIC_AI_GATEWAY_API_KEY=...
# 2. For discover mode, add a Tavily key (free tier at https://tavily.com).
export TAVILY_API_KEY=...
# 3. Create a project interactively. You'll be prompted for mode, schema,
# output dir, count, and model. State lives in ./.schematico/.
schematico new
# 4. Run it.
schematico discover # find real records on the web
# or
schematico generate # synthesize records
Output is written to ./.schematico/output/<project>_<timestamp>.json by default.
Override per run with --output FILE_OR_DIR and --count N.
Command reference
schematico new # interactive project wizard
schematico list # all saved project configs
schematico generate [--config N] # synthesize records (uses default project)
schematico discover [--config N] # find real records on the web
schematico delete NAME # delete a config (-m to disambiguate mode)
schematico help # the full command tree, every flag
Common flags on generate / discover: --config/-c, --output/-o,
--count/-n, --model/-m.
Schema format
A schema is a small JSON object describing the table you want:
{
"table": "congressional_elections",
"rows": 50,
"instructions": "U.S. House races in the 2026 midterms.",
"fields": [
{ "name": "district", "type": "string", "description": "state and district, e.g. 'CA-12'" },
{ "name": "election_date", "type": "string", "description": "ISO 8601 date" },
{ "name": "incumbent_party", "type": "enum", "values": ["D", "R", "I"] },
{ "name": "is_open_seat", "type": "bool" }
]
}
| Top-level key | Required | Meaning |
|---|---|---|
table |
✅ | Name of the table (also names the output model). |
fields |
✅ | List of field definitions (see below). |
rows |
— | How many records to produce (default 25). |
instructions |
— | Free-text guidance passed to the agent. |
Field types
Types are deliberately minimal — the description does the heavy lifting.
| Type | Python | Notes |
|---|---|---|
string |
str |
Any text. Shape it with description, e.g. "UUID v4", "ISO 8601 timestamp", "ISO 3166 country code". |
int |
int |
Optional min / max. |
float |
float |
Optional min / max. |
bool |
bool |
true / false. |
enum |
one of values |
Requires a non-empty values list. |
There's no dedicated
uuid/timestamptype on purpose. Usestringand say what you want indescription— the model fills it in accordingly, and you're never boxed in by a fixed type list.
Configuration
| Env var | Purpose |
|---|---|
PYDANTIC_AI_GATEWAY_API_KEY |
Key for the default gateway-routed model. |
PAI_MODEL |
Override the default model id (gateway/anthropic:claude-sonnet-4-6). |
TAVILY_API_KEY |
Required for discover mode (live web search). |
LOGFIRE_TOKEN |
Optional. Send traces, tool calls, and token usage to Logfire. |
LOG_LEVEL |
WARNING (default), INFO, or DEBUG. |
Schematico auto-loads a .env file from the current directory — see
.env.example. Project configs live in ./.schematico/ as
<name>.<mode>.toml files.
Coming soon
Schematico is just getting started. On the roadmap:
- 📦 More output formats — CSV, Excel, SQL inserts, and Parquet, not just JSON.
- 🔎 Smarter discovery — source citations per record, deeper crawling, and a second-pass agent that verifies and de-duplicates findings.
- 🧩 Richer schemas — nested objects, relationships between tables, and reusable field presets.
- 🗄️ Direct sinks — write straight into a database or a dataframe.
- ⚡ Offline generation — a fast, keyless synthesis mode for when you don't want to call a model at all.
Have an idea? Open an issue — this is the moment to shape where it goes.
Contributing
Contributions are very welcome — issues, docs, and PRs all help. See
CONTRIBUTING.md to get from clone to PR in a couple of minutes.
The test suite mocks the LLM, so uv run pytest needs no API keys.
License
MIT. See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schematico-0.1.3.tar.gz.
File metadata
- Download URL: schematico-0.1.3.tar.gz
- Upload date:
- Size: 221.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db28bd6ab98e15f1b12cf9cabd0918c170e1ecdd98a7cc79556b0d00e0f3c33f
|
|
| MD5 |
7b210a73b915208f52f8e48cb12b244f
|
|
| BLAKE2b-256 |
6d8e3323bd13fd63eeef629117694ff48616d80584a5b8db8ec32cc7ff6bb783
|
Provenance
The following attestation bundles were made for schematico-0.1.3.tar.gz:
Publisher:
publish.yml on Sententia-Lab/schematico
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schematico-0.1.3.tar.gz -
Subject digest:
db28bd6ab98e15f1b12cf9cabd0918c170e1ecdd98a7cc79556b0d00e0f3c33f - Sigstore transparency entry: 1959610686
- Sigstore integration time:
-
Permalink:
Sententia-Lab/schematico@b74c8cd8f563f071e02c82a479423bff229b730c -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/Sententia-Lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b74c8cd8f563f071e02c82a479423bff229b730c -
Trigger Event:
push
-
Statement type:
File details
Details for the file schematico-0.1.3-py3-none-any.whl.
File metadata
- Download URL: schematico-0.1.3-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99cecf4fa8f09f92d360062f818dbf3e94d8f7d969611f1c829cf6cb49a5ec76
|
|
| MD5 |
c1ee26d188e7a28910898a07b94c8f56
|
|
| BLAKE2b-256 |
84c40ab2b1f334ec367b7b839a083d9e63fa5edea8f213bad7294dbc151b7824
|
Provenance
The following attestation bundles were made for schematico-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on Sententia-Lab/schematico
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schematico-0.1.3-py3-none-any.whl -
Subject digest:
99cecf4fa8f09f92d360062f818dbf3e94d8f7d969611f1c829cf6cb49a5ec76 - Sigstore transparency entry: 1959610796
- Sigstore integration time:
-
Permalink:
Sententia-Lab/schematico@b74c8cd8f563f071e02c82a479423bff229b730c -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/Sententia-Lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b74c8cd8f563f071e02c82a479423bff229b730c -
Trigger Event:
push
-
Statement type: