High-fidelity synthetic data generation for Snowflake
Project description
sf-synth
High-fidelity synthetic data generation for Snowflake.
A Snowpark-first Python library and CLI that generates realistic synthetic data inside Snowflake using auto-discovered schema, distribution statistics, Faker-based rules, and a DAG-driven referential-integrity engine. All generation runs server-side, so PII never leaves the account.
Features
- Snowpark-first execution: Data is generated entirely within Snowflake using Snowpark. No data egress required.
- Auto-discovery: Automatically detects tables, columns, types, constraints (PK, FK, UNIQUE, NOT NULL) from
INFORMATION_SCHEMA. - Referential integrity: DAG-based generation ensures parent tables are populated before children. FK values are sampled from actual parent keys.
- Self-referential tables: Handles self-referential FKs (e.g.,
employees.manager_id → employees.id) via two-pass generation. - Multi-schema support: Reference tables across different schemas within the same database (e.g.,
SALES.CUSTOMERS→CORE.COUNTRIES). - Distribution-preserving: Sample from real column statistics (
APPROX_TOP_K,APPROX_PERCENTILE,HLL) to preserve data distributions without exposing PII. - Skewed FK distributions: Support for Zipf-weighted FK sampling (e.g., 80% of orders belong to 20% of customers).
- Semantic inference: Automatically infers generators based on column names (e.g.,
email,phone,created_at). - Deterministic output: Seed-based generation for reproducible results.
- YAML configuration: Simple, validated config with Pydantic.
Installation
pip install sf-synth
Or install from source:
git clone https://github.com/apareek/snowflake-synthesizer.git
cd snowflake-synthesizer
pip install -e ".[dev]"
Quick Start
1. Discover your schema
Generate a starter config by discovering your existing Snowflake schema:
sf-synth discover MY_DATABASE --output config.yaml
2. Edit the config
Customize row counts, add generators, and define relationships:
defaults:
seed: 42
database: MY_DATABASE
schema: PUBLIC
tables:
- name: CUSTOMERS
rows: 10000
columns:
EMAIL:
generator: faker
provider: email
unique: true
MEMBERSHIP:
generator: choice
values: [Gold, Silver, Bronze]
weights: [0.1, 0.3, 0.6]
- name: ORDERS
rows: 50000
relationships:
- column: CUSTOMER_ID
references: CUSTOMERS.ID
skew: zipf
3. Preview the plan
See the generation order and dependencies without executing:
sf-synth plan config.yaml
4. Generate data
sf-synth generate config.yaml
5. Clean up
Remove temporary tables created during generation:
sf-synth clean config.yaml
Configuration Reference
Defaults
defaults:
seed: 42 # Random seed for reproducibility
locale: en_US # Faker locale
database: MY_DB # Default database
schema: PUBLIC # Default schema
null_ratio: 0.0 # Default null ratio for all columns
Generator Types
| Generator | Description | Required Parameters |
|---|---|---|
seq |
Sequential integers | start, step |
uniform |
Uniform random numbers | min_value, max_value |
choice |
Random selection from list | values, weights (optional) |
range |
Values in numeric/date range | min_value, max_value |
faker |
Faker provider | provider, locale (optional) |
distribution |
Sample from source column stats | source (FQN: DB.SCHEMA.TABLE.COL) |
regex |
Pattern-based strings | pattern |
Faker Providers
Common providers: email, name, first_name, last_name, phone_number, address, city, state, zipcode, country, company, job, date, date_time, uuid4, url, ipv4, ssn, credit_card_number.
Relationships
relationships:
- column: CUSTOMER_ID # FK column in this table
references: CUSTOMERS.ID # Parent table.column
null_ratio: 0.05 # 5% null FKs
skew: zipf # Distribution: uniform or zipf
skew_param: 1.5 # Zipf exponent (higher = more skewed)
Python API
from sf_synth import SynthConfig, SynthEngine, discover_schema
from sf_synth.backend import SnowparkBackend
# Connect to Snowflake
backend = SnowparkBackend(connection_name="my_connection")
backend.connect()
# Discover schema
schema = backend.discover_schema("MY_DATABASE")
# Load config
from sf_synth.config import load_config
config = load_config("config.yaml")
# Generate
engine = SynthEngine(backend.session, config, schema_model=schema)
result = engine.generate()
print(f"Generated {result.total_rows} rows in {result.total_elapsed_seconds:.2f}s")
# Cleanup
engine.cleanup()
backend.disconnect()
Examples
The examples/ directory contains ready-to-use configurations:
| Example | Description |
|---|---|
ecommerce.yaml |
E-commerce schema with customers, products, orders, and reviews. Demonstrates FK relationships, Zipf-skewed distributions, and various generators. |
selfref_employees.yaml |
HR schema with self-referential manager_id FK. Shows how sf-synth handles circular references via two-pass generation. |
multi_schema.yaml |
Enterprise schema spanning CORE, HR, SALES, and FINANCE schemas. Demonstrates cross-schema FK relationships within a single database. |
Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ CLI │────▶│ Config │────▶│ Discovery │
│ (Typer) │ │ (Pydantic) │ │ (INFO_SCH) │
└─────────────┘ └──────────────┘ └─────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ DAG Builder│────▶│ Schema │
│ (networkx) │ │ Model │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Stats │────▶│ Engine │────▶│ RI Manager │
│ Sampler │ │ (Snowpark) │ │ (Parent Keys│
│ (APPROX_*) │ └─────────────┘ └─────────────┘
└─────────────┘ │
▼
┌─────────────┐
│ Snowflake │
│ Tables │
└─────────────┘
Connection Configuration
sf-synth uses standard Snowflake connection methods:
- Named connection (recommended):
~/.snowflake/connections.toml - Environment variables:
SNOWFLAKE_ACCOUNT,SNOWFLAKE_USER, etc. - CLI parameters:
--connection,--account, etc.
Example ~/.snowflake/connections.toml:
[my_connection]
account = "myaccount"
user = "myuser"
authenticator = "externalbrowser"
database = "MY_DB"
schema = "PUBLIC"
warehouse = "COMPUTE_WH"
Performance Notes
- SQL-first generators (seq, uniform, choice, range) are fast and scale to billions of rows.
- Faker UDFs are slower due to Python UDF overhead. Use them only when SQL alternatives don't exist.
- Distribution sampling requires one-time stats queries per column but generates data efficiently.
- For very large tables (>100M rows), consider chunked generation or Snowflake-native
GENERATOR()patterns.
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/unit/
# Run integration tests (requires Snowflake credentials)
SF_SYNTH_INTEGRATION_TESTS=1 pytest tests/integration/
# Lint
ruff check src/ tests/
# Type check
mypy src/
License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sf_synth-0.1.0.tar.gz.
File metadata
- Download URL: sf_synth-0.1.0.tar.gz
- Upload date:
- Size: 46.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
613ce4f488936c35e34c59f50c4515d8e878d94687e6d02a5c3e56b4e9956167
|
|
| MD5 |
88267c7ad2f937effc1d01b2a9a0e2c0
|
|
| BLAKE2b-256 |
b37098c97f2d15e9a2d0916f95911cf984578fb93b6e0e2abb40197525caa978
|
File details
Details for the file sf_synth-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sf_synth-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ad6f1ec56e7d45c26ec765123810fca9560192e8ce79f6ffa76cc7047ac60fb
|
|
| MD5 |
c4731019dc148986cc8b0f95b90d0823
|
|
| BLAKE2b-256 |
40e2c3b60e39f592e3f714b0e86437caeca1d3e39f78da69e9f42452aaebcf6d
|