Skip to main content

Automatically infer star/snowflake schemas from DataFrames or CSVs and generate production-ready DDL for Snowflake, Redshift, BigQuery, and PostgreSQL — in seconds.

Project description

schema-genie

Automatically infer optimal star and snowflake schemas from raw CSVs or DataFrames and generate production-ready DDL for Redshift, BigQuery, Snowflake, and PostgreSQL.


Why schema-genie?

Data engineers and analysts spend a disproportionate amount of time doing work that should be automated. Schema design is the single biggest pre-modelling bottleneck in any data warehouse project — and it is almost entirely mechanical.

schema-genie runs a 6-stage statistical inference pipeline that turns raw DataFrames or CSVs into deployment-ready DDL in seconds. No BI tools. No data modelling consultants. No manual ERD sessions.


Impact at a Glance

Metric Without schema-genie With schema-genie
Schema design time (5-table project) 4 – 8 hours < 5 seconds
DDL correctness iterations 3 – 6 review cycles 1 (auto-generated)
SCD Type 2 audit columns per dim 30 – 60 min manual coding Automatic
FK relationship discovery (10 tables) Full cross-table review Value-intersection scoring, instant
Warehouse-specific DDL dialects supported Requires per-dialect expertise 4 targets, zero config
Cost to onboard a net-new dataset Hours of eng time One function call

The Real Cost of Manual Schema Work

Engineer time

A senior data engineer in the US earns roughly $130,000 – $160,000/year (~$65–80/hr fully loaded).

Task Manual time schema-genie time Hours saved
Type detection across 20 columns 20 – 40 min Instant ~0.5 hr
Cardinality + normalization analysis 30 – 60 min Instant ~0.75 hr
FK/relationship mapping (5 tables) 1 – 3 hrs Instant ~2 hrs
Fact vs. dimension classification 30 – 90 min Instant ~1 hr
Star vs. snowflake recommendation 30 – 60 min Instant ~0.75 hr
DDL authoring (4 dialects) 3 – 6 hrs per dialect Instant (all 4) ~15 hrs
SCD Type 2 columns (3 dim tables) 2 – 4 hrs Automatic ~3 hrs
Total per project ~8 – 25 hrs < 10 seconds ~23 hrs

At $75/hr, one project saves ~$1,725 in engineering time that was previously spent on repetitive, error-prone schema work.


Cloud Warehouse Cost Savings

Poor schema design directly inflates your monthly cloud bill. schema-genie generates warehouse-optimised DDL from the start.

BigQuery — partition pruning

BigQuery charges $5 per TB scanned. schema-genie automatically detects the first date column on your fact table and emits a PARTITION BY clause.

-- Generated automatically for BigQuery
CREATE TABLE IF NOT EXISTS orders (
    orders_key    INT64,
    order_date    DATE,
    revenue       NUMERIC,
    ...
)
PARTITION BY DATE(order_date);

On a 1 TB/day table queried 50 times/day:

  • Without partitioning: 50 × 1 TB = 50 TB scanned = $250/day in query costs
  • With partitioning (typical 80% pruning): 50 × 0.2 TB = 10 TB = $50/day
  • Monthly saving: ~$6,000 on that single table alone

Redshift — DISTKEY + SORTKEY

Redshift performance degrades sharply on large fact tables without distribution and sort keys. schema-genie automatically adds DISTKEY and SORTKEY on the first ID column of every generated fact table.

-- Generated automatically for Redshift
CREATE TABLE orders (
    orders_key  INTEGER IDENTITY(1,1) PRIMARY KEY,
    customer_id VARCHAR(64) DISTKEY,
    order_date  TIMESTAMP,
    ...
)
SORTKEY (customer_id);

Properly keyed Redshift tables execute join queries 2×–5× faster, which translates directly to fewer node-hours consumed and lower DC2/RA3 costs.

Snowflake — column ordering and types

Snowflake charges per Credit (~$2–4/credit depending on tier). Queries over wide VARCHAR columns on fact tables consume far more credits than typed FLOAT/NUMBER columns. schema-genie maps every column to the tightest correct SQL type rather than defaulting everything to VARCHAR(MAX).

Column pattern Naive DDL schema-genie DDL
revenue (float) VARCHAR(255) NUMBER(18,2)
created_at (datetime) VARCHAR(255) TIMESTAMP_NTZ
status (low cardinality) VARCHAR(255) VARCHAR(128)
description (free text) VARCHAR(255) VARCHAR(4096)

Correct typing eliminates implicit casts at query time, reduces storage footprint, and enables micro-partition pruning.


Data Science Impact

Faster iteration on analytics models

Data scientists blocked on schema work cannot build models. Every hour a schema is wrong or missing is an hour of blocked ML/analytics work. schema-genie turns schema provisioning from a multi-day ticket into a same-session command.

Correct semantic types feed better models

The type detector (type_detector.py) distinguishes six semantic types: id, date, currency, measure, category, text. This prevents classic data leakage bugs:

  • Currency columns are never treated as free-form measures
  • ID columns are never factored into statistical aggregations
  • Low-cardinality strings are automatically flagged as categoricals — the correct dtype for tree models, embeddings, and one-hot encoders

SCD Type 2 — auditable datasets for time-series modelling

Every dimension table gets four audit columns generated automatically:

_valid_from   TIMESTAMP_NTZ  DEFAULT CURRENT_TIMESTAMP()
_valid_to     TIMESTAMP_NTZ
_is_current   BOOLEAN        DEFAULT TRUE
_loaded_at    TIMESTAMP_NTZ  DEFAULT CURRENT_TIMESTAMP()

Without these, historical point-in-time joins are impossible, which rules out large classes of churn, CLV, and forecasting models. Most teams add these columns only after realising they need them — months into a project.

Relationship graph as a feature engineering map

The FK relationships detected by the value-intersection algorithm (relationships.py) double as a feature engineering roadmap. Every score >= 0.8 link is a join path that can be exploited in feature stores, dbt models, or graph neural networks.


Pipeline Complexity — What Runs in Seconds

For a schema design task that takes a human engineer hours, schema-genie executes the following automatically:

Stage 1 — Type Detection
  Per column: name regex match → dtype check → cardinality ratio
  O(n_cols × n_rows) — negligible on pandas DataFrames

Stage 2 — Cardinality Analysis
  unique_values / n_rows per column
  Threshold: 0.05 (configurable via normalize_threshold)

Stage 3 — Relationship Detection
  For every pair of tables, for every pair of name-stem-matching columns:
    intersection_score = |A ∩ B| / min(|A|, |B|)
  O(n_tables² × n_cols²) — pre-filtered by name stem to stay practical

Stage 4 — Fact Table Selection
  fact_score = (measure_cols / total_cols)
             + (id_col_count × 0.2)
             + (log10(n_rows) / 10)

Stage 5 — Schema Type Recommendation
  Snowflake schema triggered when any dimension has > 40% category columns

Stage 6 — DDL Generation
  Target-specific CREATE TABLE with PKs, FKs, SCD columns, indexes, DISTKEY/SORTKEY/PARTITION BY

Total wall-clock time on a 5-table, 100k-row dataset: typically under 2 seconds.


Quantified Savings Summary

Estimates based on industry-standard data engineering hourly rates and major cloud pricing as of 2025/2026. Actual savings vary by team size, dataset scale, and query patterns.

Scenario Monthly saving
1 engineer, 2 new datasets/month ~$3,500 in eng time
BigQuery, 1 TB/day partitioned table ~$6,000 in query costs
Redshift, correct DISTKEY on 10B-row fact 2×–5× query speedup → fewer node-hours
Eliminating 4-dialect DDL maintenance ~15 hrs/project × $75/hr = $1,125
SCD audit columns auto-generated 2–4 hrs saved per dimension table

Installation

pip install schema-genie

With optional warehouse connectors:

pip install schema-genie[snowflake]
pip install schema-genie[redshift]
pip install schema-genie[bigquery]
pip install schema-genie[diagram]

Quick Start

import pandas as pd
from schema_genie import SchemaGenie

df = pd.read_csv("sales_data.csv")
genie = SchemaGenie(target="snowflake")
schema = genie.infer(df, table_name="sales")

print(schema.recommended_type)   # "star" or "snowflake"
print(schema.ddl)
schema.export_ddl("schema.sql")

Multi-table inference

genie = SchemaGenie(target="redshift")
schema = genie.infer_multi({
    "orders":    orders_df,
    "customers": customers_df,
    "products":  products_df,
})

print(schema.fact_table.name)         # "orders"
print([t.name for t in schema.dimension_tables])
print(schema.scd_candidates)
schema.export_diagram("er_diagram")   # requires pip install schema-genie[diagram]

Load config from YAML

# genie_config.yaml
target: postgres
schema_type: auto
detect_scd: true
normalize_threshold: 0.05
genie = SchemaGenie.from_config("genie_config.yaml")

How It Works

schema-genie runs a 6-stage statistical inference pipeline:

Raw DataFrames / CSVs
        │
        ▼
┌────────────────────────────┐
│    Type Detector            │  Maps each column to a semantic type
│  (measure/date/id/          │  using dtype + cardinality + name heuristics
│   category/text/currency)   │
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│    Cardinality Analyzer     │  Measures unique value ratio per column
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Relationship Detector     │  FK-like overlaps via value intersection scoring
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Fact Table Selector       │  Picks table with highest measure density
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Schema Type Recommender   │  Star vs. Snowflake based on dimension depth
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│     DDL Generator           │  CREATE TABLE statements for target warehouse
└────────────────────────────┘

API Reference

SchemaGenie

SchemaGenie(
    target: str = "snowflake",          # "snowflake" | "redshift" | "bigquery" | "postgres"
    schema_type: str = "auto",          # "auto" | "star" | "snowflake"
    primary_key_strategy: str = "surrogate",
    detect_scd: bool = True,
    normalize_threshold: float = 0.05
)
Method Description
genie.infer(df, table_name) Infer schema from a single DataFrame
genie.infer_multi(dict_of_dfs) Infer schema across multiple related tables
genie.deploy(connection, schema) Execute DDL against a live warehouse
SchemaGenie.from_config(path) Load configuration from a YAML file

InferredSchema

schema.recommended_type     # "star" | "snowflake"
schema.fact_table           # TableDefinition
schema.dimension_tables     # list[TableDefinition]
schema.relationships        # list[dict] — detected FK relationships
schema.ddl                  # str — full DDL ready to execute
schema.scd_candidates       # list[str] — SCD Type 2 flagged columns
schema.confidence_score     # float — pipeline confidence [0, 1]
schema.export_ddl(path)     # Save DDL to a .sql file
schema.export_diagram(path) # Export ER diagram (requires graphviz extra)
schema.summary()            # Human-readable summary string

Supported Targets

Target Surrogate Key Currency Type Partitioning / Distribution Direct Deploy
snowflake AUTOINCREMENT NUMBER(18,2) Micro-partition (automatic) Yes
redshift IDENTITY(1,1) DECIMAL(18,2) DISTKEY + SORTKEY on fact PK Yes
bigquery INT64 NUMERIC PARTITION BY on first date col Yes
postgres SERIAL NUMERIC(18,2) CREATE INDEX on all ID columns Yes

All dimension tables automatically receive SCD Type 2 audit columns: _valid_from, _valid_to, _is_current, _loaded_at


Development

git clone https://github.com/yourusername/schema-genie
cd schema-genie
pip install -e ".[dev]"
pytest tests/ -v

License

MIT — See LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_genie-1.1.0.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schema_genie-1.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file schema_genie-1.1.0.tar.gz.

File metadata

  • Download URL: schema_genie-1.1.0.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for schema_genie-1.1.0.tar.gz
Algorithm Hash digest
SHA256 381b85bf9965246ec7782da9e174ee39608f9fbb10693fe6d253c840d512bd95
MD5 4e7db5b5c9bfa3f5d74f43e2d8b7e2eb
BLAKE2b-256 19e8c6adc3242e43dc0d57ae6cb06804edd9a69d51dd5d1c1939ea150cc7fe48

See more details on using hashes here.

File details

Details for the file schema_genie-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: schema_genie-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for schema_genie-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf45c5b4692dae95f35b7242c56b92f6b701c90548904c6a4b3e1a10281c70cc
MD5 c83d7186cea89fadf2f0c1a0074f4daa
BLAKE2b-256 2b5262e5e3b136ae1fe87b21bf1ba3ed0ff6a4eefe28e7cfd5d13bf0a7ce2b78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page