High-fidelity synthetic data generation for Snowflake

These details have not been verified by PyPI

Project links

Project description

sf-synth

High-fidelity synthetic data generation for Snowflake.

A Snowpark-first Python library and CLI that generates realistic synthetic data inside Snowflake using auto-discovered schema, distribution statistics, Faker-based rules, and a DAG-driven referential-integrity engine. All generation runs server-side, so PII never leaves the account.

Features

Snowpark-first execution: Data is generated entirely within Snowflake using Snowpark. No data egress required.
Auto-discovery: Automatically detects tables, columns, types, constraints (PK, FK, UNIQUE, NOT NULL) from INFORMATION_SCHEMA.
Referential integrity: DAG-based generation ensures parent tables are populated before children. FK values are sampled from actual parent keys.
Self-referential tables: Handles self-referential FKs (e.g., employees.manager_id → employees.id) via two-pass generation.
Multi-schema support: Reference tables across different schemas within the same database (e.g., SALES.CUSTOMERS → CORE.COUNTRIES).
Distribution-preserving: Sample from real column statistics (APPROX_TOP_K, APPROX_PERCENTILE, HLL) to preserve data distributions without exposing PII.
Skewed FK distributions: Support for Zipf-weighted FK sampling (e.g., 80% of orders belong to 20% of customers).
Semantic inference: Automatically infers generators based on column names (e.g., email, phone, created_at).
Deterministic output: Seed-based generation for reproducible results.
YAML configuration: Simple, validated config with Pydantic.

Installation

pip install sf-synth

Or install from source:

git clone https://github.com/apareek/snowflake-synthesizer.git
cd snowflake-synthesizer
pip install -e ".[dev]"

Quick Start

1. Discover your schema

Generate a starter config by discovering your existing Snowflake schema:

sf-synth discover MY_DATABASE --output config.yaml

2. Edit the config

Customize row counts, add generators, and define relationships:

defaults:
  seed: 42
  database: MY_DATABASE
  schema: PUBLIC

tables:
  - name: CUSTOMERS
    rows: 10000
    columns:
      EMAIL:
        generator: faker
        provider: email
        unique: true
      MEMBERSHIP:
        generator: choice
        values: [Gold, Silver, Bronze]
        weights: [0.1, 0.3, 0.6]

  - name: ORDERS
    rows: 50000
    relationships:
      - column: CUSTOMER_ID
        references: CUSTOMERS.ID
        skew: zipf

3. Preview the plan

See the generation order and dependencies without executing:

sf-synth plan config.yaml

4. Generate data

sf-synth generate config.yaml

5. Clean up

Remove temporary tables created during generation:

sf-synth clean config.yaml

Configuration Reference

Defaults

defaults:
  seed: 42                    # Random seed for reproducibility
  locale: en_US               # Faker locale
  database: MY_DB             # Default database
  schema: PUBLIC              # Default schema
  null_ratio: 0.0             # Default null ratio for all columns

Generator Types

Generator	Description	Required Parameters
`seq`	Sequential integers	`start`, `step`
`uniform`	Uniform random numbers	`min_value`, `max_value`
`choice`	Random selection from list	`values`, `weights` (optional)
`range`	Values in numeric/date range	`min_value`, `max_value`
`faker`	Faker provider	`provider`, `locale` (optional)
`distribution`	Sample from source column stats	`source` (FQN: DB.SCHEMA.TABLE.COL)
`regex`	Pattern-based strings	`pattern`

Faker Providers

Common providers: email, name, first_name, last_name, phone_number, address, city, state, zipcode, country, company, job, date, date_time, uuid4, url, ipv4, ssn, credit_card_number.

Relationships

relationships:
  - column: CUSTOMER_ID           # FK column in this table
    references: CUSTOMERS.ID       # Parent table.column
    null_ratio: 0.05              # 5% null FKs
    skew: zipf                    # Distribution: uniform or zipf
    skew_param: 1.5               # Zipf exponent (higher = more skewed)

Python API

from sf_synth import SynthConfig, SynthEngine, discover_schema
from sf_synth.backend import SnowparkBackend

# Connect to Snowflake
backend = SnowparkBackend(connection_name="my_connection")
backend.connect()

# Discover schema
schema = backend.discover_schema("MY_DATABASE")

# Load config
from sf_synth.config import load_config
config = load_config("config.yaml")

# Generate
engine = SynthEngine(backend.session, config, schema_model=schema)
result = engine.generate()

print(f"Generated {result.total_rows} rows in {result.total_elapsed_seconds:.2f}s")

# Cleanup
engine.cleanup()
backend.disconnect()

Examples

The examples/ directory contains ready-to-use configurations:

Example	Description
`ecommerce.yaml`	E-commerce schema with customers, products, orders, and reviews. Demonstrates FK relationships, Zipf-skewed distributions, and various generators.
`selfref_employees.yaml`	HR schema with self-referential `manager_id` FK. Shows how sf-synth handles circular references via two-pass generation.
`multi_schema.yaml`	Enterprise schema spanning CORE, HR, SALES, and FINANCE schemas. Demonstrates cross-schema FK relationships within a single database.

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   CLI       │────▶│   Config     │────▶│  Discovery  │
│   (Typer)   │     │  (Pydantic)  │     │  (INFO_SCH) │
└─────────────┘     └──────────────┘     └─────────────┘
                           │                    │
                           ▼                    ▼
                    ┌─────────────┐     ┌─────────────┐
                    │  DAG Builder│────▶│   Schema    │
                    │  (networkx) │     │   Model     │
                    └─────────────┘     └─────────────┘
                           │
                           ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Stats     │────▶│   Engine    │────▶│ RI Manager  │
│   Sampler   │     │  (Snowpark) │     │ (Parent Keys│
│ (APPROX_*)  │     └─────────────┘     └─────────────┘
└─────────────┘            │
                           ▼
                    ┌─────────────┐
                    │  Snowflake  │
                    │   Tables    │
                    └─────────────┘

Connection Configuration

sf-synth uses standard Snowflake connection methods:

Named connection (recommended): ~/.snowflake/connections.toml
Environment variables: SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, etc.
CLI parameters: --connection, --account, etc.

Example ~/.snowflake/connections.toml:

[my_connection]
account = "myaccount"
user = "myuser"
authenticator = "externalbrowser"
database = "MY_DB"
schema = "PUBLIC"
warehouse = "COMPUTE_WH"

Performance Notes

SQL-first generators (seq, uniform, choice, range) are fast and scale to billions of rows.
Faker UDFs are slower due to Python UDF overhead. Use them only when SQL alternatives don't exist.
Distribution sampling requires one-time stats queries per column but generates data efficiently.
For very large tables (>100M rows), consider chunked generation or Snowflake-native GENERATOR() patterns.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/unit/

# Run integration tests (requires Snowflake credentials)
SF_SYNTH_INTEGRATION_TESTS=1 pytest tests/integration/

# Lint
ruff check src/ tests/

# Type check
mypy src/

License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 8, 2026

This version

0.1.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sf_synth-0.1.0.tar.gz (46.7 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sf_synth-0.1.0-py3-none-any.whl (44.9 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file sf_synth-0.1.0.tar.gz.

File metadata

Download URL: sf_synth-0.1.0.tar.gz
Upload date: May 8, 2026
Size: 46.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.10

File hashes

Hashes for sf_synth-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`613ce4f488936c35e34c59f50c4515d8e878d94687e6d02a5c3e56b4e9956167`
MD5	`88267c7ad2f937effc1d01b2a9a0e2c0`
BLAKE2b-256	`b37098c97f2d15e9a2d0916f95911cf984578fb93b6e0e2abb40197525caa978`

See more details on using hashes here.

File details

Details for the file sf_synth-0.1.0-py3-none-any.whl.

File metadata

Download URL: sf_synth-0.1.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 44.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.10

File hashes

Hashes for sf_synth-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ad6f1ec56e7d45c26ec765123810fca9560192e8ce79f6ffa76cc7047ac60fb`
MD5	`c4731019dc148986cc8b0f95b90d0823`
BLAKE2b-256	`40e2c3b60e39f592e3f714b0e86437caeca1d3e39f78da69e9f42452aaebcf6d`

See more details on using hashes here.

sf-synth 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

sf-synth

Features

Installation

Quick Start

1. Discover your schema

2. Edit the config

3. Preview the plan

4. Generate data

5. Clean up

Configuration Reference

Defaults

Generator Types

Faker Providers

Relationships

Python API

Examples

Architecture

Connection Configuration

Performance Notes

Development

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes