Automatically infer star/snowflake schemas from DataFrames or CSVs and generate production-ready DDL for Snowflake, Redshift, BigQuery, and PostgreSQL — in seconds.
Project description
schema-genie
Automatically infer optimal star and snowflake schemas from raw CSVs or DataFrames and generate production-ready DDL for Redshift, BigQuery, Snowflake, and PostgreSQL.
Why schema-genie?
Data engineers and analysts spend a disproportionate amount of time doing work that should be automated. Schema design is the single biggest pre-modelling bottleneck in any data warehouse project — and it is almost entirely mechanical.
schema-genie runs a 6-stage statistical inference pipeline that turns raw DataFrames or CSVs into deployment-ready DDL in seconds. No BI tools. No data modelling consultants. No manual ERD sessions.
Impact at a Glance
| Metric | Without schema-genie | With schema-genie |
|---|---|---|
| Schema design time (5-table project) | 4 – 8 hours | < 5 seconds |
| DDL correctness iterations | 3 – 6 review cycles | 1 (auto-generated) |
| SCD Type 2 audit columns per dim | 30 – 60 min manual coding | Automatic |
| FK relationship discovery (10 tables) | Full cross-table review | Value-intersection scoring, instant |
| Warehouse-specific DDL dialects supported | Requires per-dialect expertise | 4 targets, zero config |
| Cost to onboard a net-new dataset | Hours of eng time | One function call |
The Real Cost of Manual Schema Work
Engineer time
A senior data engineer in the US earns roughly $130,000 – $160,000/year (~$65–80/hr fully loaded).
| Task | Manual time | schema-genie time | Hours saved |
|---|---|---|---|
| Type detection across 20 columns | 20 – 40 min | Instant | ~0.5 hr |
| Cardinality + normalization analysis | 30 – 60 min | Instant | ~0.75 hr |
| FK/relationship mapping (5 tables) | 1 – 3 hrs | Instant | ~2 hrs |
| Fact vs. dimension classification | 30 – 90 min | Instant | ~1 hr |
| Star vs. snowflake recommendation | 30 – 60 min | Instant | ~0.75 hr |
| DDL authoring (4 dialects) | 3 – 6 hrs per dialect | Instant (all 4) | ~15 hrs |
| SCD Type 2 columns (3 dim tables) | 2 – 4 hrs | Automatic | ~3 hrs |
| Total per project | ~8 – 25 hrs | < 10 seconds | ~23 hrs |
At $75/hr, one project saves ~$1,725 in engineering time that was previously spent on repetitive, error-prone schema work.
Cloud Warehouse Cost Savings
Poor schema design directly inflates your monthly cloud bill. schema-genie generates warehouse-optimised DDL from the start.
BigQuery — partition pruning
BigQuery charges $5 per TB scanned. schema-genie automatically detects the first date column on your fact table and emits a PARTITION BY clause.
-- Generated automatically for BigQuery
CREATE TABLE IF NOT EXISTS orders (
orders_key INT64,
order_date DATE,
revenue NUMERIC,
...
)
PARTITION BY DATE(order_date);
On a 1 TB/day table queried 50 times/day:
- Without partitioning: 50 × 1 TB = 50 TB scanned = $250/day in query costs
- With partitioning (typical 80% pruning): 50 × 0.2 TB = 10 TB = $50/day
- Monthly saving: ~$6,000 on that single table alone
Redshift — DISTKEY + SORTKEY
Redshift performance degrades sharply on large fact tables without distribution and sort keys. schema-genie automatically adds DISTKEY and SORTKEY on the first ID column of every generated fact table.
-- Generated automatically for Redshift
CREATE TABLE orders (
orders_key INTEGER IDENTITY(1,1) PRIMARY KEY,
customer_id VARCHAR(64) DISTKEY,
order_date TIMESTAMP,
...
)
SORTKEY (customer_id);
Properly keyed Redshift tables execute join queries 2×–5× faster, which translates directly to fewer node-hours consumed and lower DC2/RA3 costs.
Snowflake — column ordering and types
Snowflake charges per Credit (~$2–4/credit depending on tier). Queries over wide VARCHAR columns on fact tables consume far more credits than typed FLOAT/NUMBER columns. schema-genie maps every column to the tightest correct SQL type rather than defaulting everything to VARCHAR(MAX).
| Column pattern | Naive DDL | schema-genie DDL |
|---|---|---|
revenue (float) |
VARCHAR(255) |
NUMBER(18,2) |
created_at (datetime) |
VARCHAR(255) |
TIMESTAMP_NTZ |
status (low cardinality) |
VARCHAR(255) |
VARCHAR(128) |
description (free text) |
VARCHAR(255) |
VARCHAR(4096) |
Correct typing eliminates implicit casts at query time, reduces storage footprint, and enables micro-partition pruning.
Data Science Impact
Faster iteration on analytics models
Data scientists blocked on schema work cannot build models. Every hour a schema is wrong or missing is an hour of blocked ML/analytics work. schema-genie turns schema provisioning from a multi-day ticket into a same-session command.
Correct semantic types feed better models
The type detector (type_detector.py) distinguishes six semantic types: id, date, currency, measure, category, text. This prevents classic data leakage bugs:
- Currency columns are never treated as free-form measures
- ID columns are never factored into statistical aggregations
- Low-cardinality strings are automatically flagged as categoricals — the correct dtype for tree models, embeddings, and one-hot encoders
SCD Type 2 — auditable datasets for time-series modelling
Every dimension table gets four audit columns generated automatically:
_valid_from TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
_valid_to TIMESTAMP_NTZ
_is_current BOOLEAN DEFAULT TRUE
_loaded_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
Without these, historical point-in-time joins are impossible, which rules out large classes of churn, CLV, and forecasting models. Most teams add these columns only after realising they need them — months into a project.
Relationship graph as a feature engineering map
The FK relationships detected by the value-intersection algorithm (relationships.py) double as a feature engineering roadmap. Every score >= 0.8 link is a join path that can be exploited in feature stores, dbt models, or graph neural networks.
Pipeline Complexity — What Runs in Seconds
For a schema design task that takes a human engineer hours, schema-genie executes the following automatically:
Stage 1 — Type Detection
Per column: name regex match → dtype check → cardinality ratio
O(n_cols × n_rows) — negligible on pandas DataFrames
Stage 2 — Cardinality Analysis
unique_values / n_rows per column
Threshold: 0.05 (configurable via normalize_threshold)
Stage 3 — Relationship Detection
For every pair of tables, for every pair of name-stem-matching columns:
intersection_score = |A ∩ B| / min(|A|, |B|)
O(n_tables² × n_cols²) — pre-filtered by name stem to stay practical
Stage 4 — Fact Table Selection
fact_score = (measure_cols / total_cols)
+ (id_col_count × 0.2)
+ (log10(n_rows) / 10)
Stage 5 — Schema Type Recommendation
Snowflake schema triggered when any dimension has > 40% category columns
Stage 6 — DDL Generation
Target-specific CREATE TABLE with PKs, FKs, SCD columns, indexes, DISTKEY/SORTKEY/PARTITION BY
Total wall-clock time on a 5-table, 100k-row dataset: typically under 2 seconds.
Quantified Savings Summary
Estimates based on industry-standard data engineering hourly rates and major cloud pricing as of 2025/2026. Actual savings vary by team size, dataset scale, and query patterns.
| Scenario | Monthly saving |
|---|---|
| 1 engineer, 2 new datasets/month | ~$3,500 in eng time |
| BigQuery, 1 TB/day partitioned table | ~$6,000 in query costs |
| Redshift, correct DISTKEY on 10B-row fact | 2×–5× query speedup → fewer node-hours |
| Eliminating 4-dialect DDL maintenance | ~15 hrs/project × $75/hr = $1,125 |
| SCD audit columns auto-generated | 2–4 hrs saved per dimension table |
Installation
pip install schema-genie
With optional warehouse connectors:
pip install schema-genie[snowflake]
pip install schema-genie[redshift]
pip install schema-genie[bigquery]
pip install schema-genie[diagram]
Quick Start
import pandas as pd
from schema_genie import SchemaGenie
df = pd.read_csv("sales_data.csv")
genie = SchemaGenie(target="snowflake")
schema = genie.infer(df, table_name="sales")
print(schema.recommended_type) # "star" or "snowflake"
print(schema.ddl)
schema.export_ddl("schema.sql")
Multi-table inference
genie = SchemaGenie(target="redshift")
schema = genie.infer_multi({
"orders": orders_df,
"customers": customers_df,
"products": products_df,
})
print(schema.fact_table.name) # "orders"
print([t.name for t in schema.dimension_tables])
print(schema.scd_candidates)
schema.export_diagram("er_diagram") # requires pip install schema-genie[diagram]
Load config from YAML
# genie_config.yaml
target: postgres
schema_type: auto
detect_scd: true
normalize_threshold: 0.05
genie = SchemaGenie.from_config("genie_config.yaml")
How It Works
schema-genie runs a 6-stage statistical inference pipeline:
Raw DataFrames / CSVs
│
▼
┌────────────────────────────┐
│ Type Detector │ Maps each column to a semantic type
│ (measure/date/id/ │ using dtype + cardinality + name heuristics
│ category/text/currency) │
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ Cardinality Analyzer │ Measures unique value ratio per column
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ Relationship Detector │ FK-like overlaps via value intersection scoring
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ Fact Table Selector │ Picks table with highest measure density
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ Schema Type Recommender │ Star vs. Snowflake based on dimension depth
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ DDL Generator │ CREATE TABLE statements for target warehouse
└────────────────────────────┘
API Reference
SchemaGenie
SchemaGenie(
target: str = "snowflake", # "snowflake" | "redshift" | "bigquery" | "postgres"
schema_type: str = "auto", # "auto" | "star" | "snowflake"
primary_key_strategy: str = "surrogate",
detect_scd: bool = True,
normalize_threshold: float = 0.05
)
| Method | Description |
|---|---|
genie.infer(df, table_name) |
Infer schema from a single DataFrame |
genie.infer_multi(dict_of_dfs) |
Infer schema across multiple related tables |
genie.deploy(connection, schema) |
Execute DDL against a live warehouse |
SchemaGenie.from_config(path) |
Load configuration from a YAML file |
InferredSchema
schema.recommended_type # "star" | "snowflake"
schema.fact_table # TableDefinition
schema.dimension_tables # list[TableDefinition]
schema.relationships # list[dict] — detected FK relationships
schema.ddl # str — full DDL ready to execute
schema.scd_candidates # list[str] — SCD Type 2 flagged columns
schema.confidence_score # float — pipeline confidence [0, 1]
schema.export_ddl(path) # Save DDL to a .sql file
schema.export_diagram(path) # Export ER diagram (requires graphviz extra)
schema.summary() # Human-readable summary string
Supported Targets
| Target | Surrogate Key | Currency Type | Partitioning / Distribution | Direct Deploy |
|---|---|---|---|---|
snowflake |
AUTOINCREMENT |
NUMBER(18,2) |
Micro-partition (automatic) | Yes |
redshift |
IDENTITY(1,1) |
DECIMAL(18,2) |
DISTKEY + SORTKEY on fact PK |
Yes |
bigquery |
INT64 |
NUMERIC |
PARTITION BY on first date col |
Yes |
postgres |
SERIAL |
NUMERIC(18,2) |
CREATE INDEX on all ID columns |
Yes |
All dimension tables automatically receive SCD Type 2 audit columns:
_valid_from, _valid_to, _is_current, _loaded_at
Development
git clone https://github.com/yourusername/schema-genie
cd schema-genie
pip install -e ".[dev]"
pytest tests/ -v
License
MIT — See LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schema_genie-1.1.0.tar.gz.
File metadata
- Download URL: schema_genie-1.1.0.tar.gz
- Upload date:
- Size: 28.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
381b85bf9965246ec7782da9e174ee39608f9fbb10693fe6d253c840d512bd95
|
|
| MD5 |
4e7db5b5c9bfa3f5d74f43e2d8b7e2eb
|
|
| BLAKE2b-256 |
19e8c6adc3242e43dc0d57ae6cb06804edd9a69d51dd5d1c1939ea150cc7fe48
|
File details
Details for the file schema_genie-1.1.0-py3-none-any.whl.
File metadata
- Download URL: schema_genie-1.1.0-py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf45c5b4692dae95f35b7242c56b92f6b701c90548904c6a4b3e1a10281c70cc
|
|
| MD5 |
c83d7186cea89fadf2f0c1a0074f4daa
|
|
| BLAKE2b-256 |
2b5262e5e3b136ae1fe87b21bf1ba3ed0ff6a4eefe28e7cfd5d13bf0a7ce2b78
|