Automatically infer star/snowflake schemas from DataFrames or CSVs and generate production-ready DDL for Snowflake, Redshift, BigQuery, and PostgreSQL — in seconds.

These details have not been verified by PyPI

Project links

Project description

schema-genie

Automatically infer optimal star and snowflake schemas from raw CSVs or DataFrames and generate production-ready DDL for Redshift, BigQuery, Snowflake, and PostgreSQL.

Why schema-genie?

Data engineers and analysts spend a disproportionate amount of time doing work that should be automated. Schema design is the single biggest pre-modelling bottleneck in any data warehouse project — and it is almost entirely mechanical.

schema-genie runs a 6-stage statistical inference pipeline that turns raw DataFrames or CSVs into deployment-ready DDL in seconds. No BI tools. No data modelling consultants. No manual ERD sessions.

Impact at a Glance

Metric	Without schema-genie	With schema-genie
Schema design time (5-table project)	4 – 8 hours	< 5 seconds
DDL correctness iterations	3 – 6 review cycles	1 (auto-generated)
SCD Type 2 audit columns per dim	30 – 60 min manual coding	Automatic
FK relationship discovery (10 tables)	Full cross-table review	Value-intersection scoring, instant
Warehouse-specific DDL dialects supported	Requires per-dialect expertise	4 targets, zero config
Cost to onboard a net-new dataset	Hours of eng time	One function call

The Real Cost of Manual Schema Work

Engineer time

A senior data engineer in the US earns roughly $130,000 – $160,000/year (~$65–80/hr fully loaded).

Task	Manual time	schema-genie time	Hours saved
Type detection across 20 columns	20 – 40 min	Instant	~0.5 hr
Cardinality + normalization analysis	30 – 60 min	Instant	~0.75 hr
FK/relationship mapping (5 tables)	1 – 3 hrs	Instant	~2 hrs
Fact vs. dimension classification	30 – 90 min	Instant	~1 hr
Star vs. snowflake recommendation	30 – 60 min	Instant	~0.75 hr
DDL authoring (4 dialects)	3 – 6 hrs per dialect	Instant (all 4)	~15 hrs
SCD Type 2 columns (3 dim tables)	2 – 4 hrs	Automatic	~3 hrs
Total per project	~8 – 25 hrs	< 10 seconds	~23 hrs

At $75/hr, one project saves ~$1,725 in engineering time that was previously spent on repetitive, error-prone schema work.

Cloud Warehouse Cost Savings

Poor schema design directly inflates your monthly cloud bill. schema-genie generates warehouse-optimised DDL from the start.

BigQuery — partition pruning

BigQuery charges $5 per TB scanned. schema-genie automatically detects the first date column on your fact table and emits a PARTITION BY clause.

-- Generated automatically for BigQuery
CREATE TABLE IF NOT EXISTS orders (
    orders_key    INT64,
    order_date    DATE,
    revenue       NUMERIC,
    ...
)
PARTITION BY DATE(order_date);

On a 1 TB/day table queried 50 times/day:

Without partitioning: 50 × 1 TB = 50 TB scanned = $250/day in query costs
With partitioning (typical 80% pruning): 50 × 0.2 TB = 10 TB = $50/day
Monthly saving: ~$6,000 on that single table alone

Redshift — DISTKEY + SORTKEY

Redshift performance degrades sharply on large fact tables without distribution and sort keys. schema-genie automatically adds DISTKEY and SORTKEY on the first ID column of every generated fact table.

-- Generated automatically for Redshift
CREATE TABLE orders (
    orders_key  INTEGER IDENTITY(1,1) PRIMARY KEY,
    customer_id VARCHAR(64) DISTKEY,
    order_date  TIMESTAMP,
    ...
)
SORTKEY (customer_id);

Properly keyed Redshift tables execute join queries 2×–5× faster, which translates directly to fewer node-hours consumed and lower DC2/RA3 costs.

Snowflake — column ordering and types

Snowflake charges per Credit (~$2–4/credit depending on tier). Queries over wide VARCHAR columns on fact tables consume far more credits than typed FLOAT/NUMBER columns. schema-genie maps every column to the tightest correct SQL type rather than defaulting everything to VARCHAR(MAX).

Column pattern	Naive DDL	schema-genie DDL
`revenue` (float)	`VARCHAR(255)`	`NUMBER(18,2)`
`created_at` (datetime)	`VARCHAR(255)`	`TIMESTAMP_NTZ`
`status` (low cardinality)	`VARCHAR(255)`	`VARCHAR(128)`
`description` (free text)	`VARCHAR(255)`	`VARCHAR(4096)`

Correct typing eliminates implicit casts at query time, reduces storage footprint, and enables micro-partition pruning.

Data Science Impact

Faster iteration on analytics models

Data scientists blocked on schema work cannot build models. Every hour a schema is wrong or missing is an hour of blocked ML/analytics work. schema-genie turns schema provisioning from a multi-day ticket into a same-session command.

Correct semantic types feed better models

The type detector (type_detector.py) distinguishes six semantic types: id, date, currency, measure, category, text. This prevents classic data leakage bugs:

Currency columns are never treated as free-form measures
ID columns are never factored into statistical aggregations
Low-cardinality strings are automatically flagged as categoricals — the correct dtype for tree models, embeddings, and one-hot encoders

SCD Type 2 — auditable datasets for time-series modelling

Every dimension table gets four audit columns generated automatically:

_valid_from   TIMESTAMP_NTZ  DEFAULT CURRENT_TIMESTAMP()
_valid_to     TIMESTAMP_NTZ
_is_current   BOOLEAN        DEFAULT TRUE
_loaded_at    TIMESTAMP_NTZ  DEFAULT CURRENT_TIMESTAMP()

Without these, historical point-in-time joins are impossible, which rules out large classes of churn, CLV, and forecasting models. Most teams add these columns only after realising they need them — months into a project.

Relationship graph as a feature engineering map

The FK relationships detected by the value-intersection algorithm (relationships.py) double as a feature engineering roadmap. Every score >= 0.8 link is a join path that can be exploited in feature stores, dbt models, or graph neural networks.

Pipeline Complexity — What Runs in Seconds

For a schema design task that takes a human engineer hours, schema-genie executes the following automatically:

Stage 1 — Type Detection
  Per column: name regex match → dtype check → cardinality ratio
  O(n_cols × n_rows) — negligible on pandas DataFrames

Stage 2 — Cardinality Analysis
  unique_values / n_rows per column
  Threshold: 0.05 (configurable via normalize_threshold)

Stage 3 — Relationship Detection
  For every pair of tables, for every pair of name-stem-matching columns:
    intersection_score = |A ∩ B| / min(|A|, |B|)
  O(n_tables² × n_cols²) — pre-filtered by name stem to stay practical

Stage 4 — Fact Table Selection
  fact_score = (measure_cols / total_cols)
             + (id_col_count × 0.2)
             + (log10(n_rows) / 10)

Stage 5 — Schema Type Recommendation
  Snowflake schema triggered when any dimension has > 40% category columns

Stage 6 — DDL Generation
  Target-specific CREATE TABLE with PKs, FKs, SCD columns, indexes, DISTKEY/SORTKEY/PARTITION BY

Total wall-clock time on a 5-table, 100k-row dataset: typically under 2 seconds.

Quantified Savings Summary

Estimates based on industry-standard data engineering hourly rates and major cloud pricing as of 2025/2026. Actual savings vary by team size, dataset scale, and query patterns.

Scenario	Monthly saving
1 engineer, 2 new datasets/month	~$3,500 in eng time
BigQuery, 1 TB/day partitioned table	~$6,000 in query costs
Redshift, correct DISTKEY on 10B-row fact	2×–5× query speedup → fewer node-hours
Eliminating 4-dialect DDL maintenance	~15 hrs/project × $75/hr = $1,125
SCD audit columns auto-generated	2–4 hrs saved per dimension table

Installation

pip install schema-genie

With optional warehouse connectors:

pip install schema-genie[snowflake]
pip install schema-genie[redshift]
pip install schema-genie[bigquery]
pip install schema-genie[diagram]

Quick Start

import pandas as pd
from schema_genie import SchemaGenie

df = pd.read_csv("sales_data.csv")
genie = SchemaGenie(target="snowflake")
schema = genie.infer(df, table_name="sales")

print(schema.recommended_type)   # "star" or "snowflake"
print(schema.ddl)
schema.export_ddl("schema.sql")

Multi-table inference

genie = SchemaGenie(target="redshift")
schema = genie.infer_multi({
    "orders":    orders_df,
    "customers": customers_df,
    "products":  products_df,
})

print(schema.fact_table.name)         # "orders"
print([t.name for t in schema.dimension_tables])
print(schema.scd_candidates)
schema.export_diagram("er_diagram")   # requires pip install schema-genie[diagram]

Load config from YAML

# genie_config.yaml
target: postgres
schema_type: auto
detect_scd: true
normalize_threshold: 0.05

genie = SchemaGenie.from_config("genie_config.yaml")

How It Works

schema-genie runs a 6-stage statistical inference pipeline:

Raw DataFrames / CSVs
        │
        ▼
┌────────────────────────────┐
│    Type Detector            │  Maps each column to a semantic type
│  (measure/date/id/          │  using dtype + cardinality + name heuristics
│   category/text/currency)   │
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│    Cardinality Analyzer     │  Measures unique value ratio per column
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Relationship Detector     │  FK-like overlaps via value intersection scoring
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Fact Table Selector       │  Picks table with highest measure density
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Schema Type Recommender   │  Star vs. Snowflake based on dimension depth
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│     DDL Generator           │  CREATE TABLE statements for target warehouse
└────────────────────────────┘

API Reference

`SchemaGenie`

SchemaGenie(
    target: str = "snowflake",          # "snowflake" | "redshift" | "bigquery" | "postgres"
    schema_type: str = "auto",          # "auto" | "star" | "snowflake"
    primary_key_strategy: str = "surrogate",
    detect_scd: bool = True,
    normalize_threshold: float = 0.05
)

Method	Description
`genie.infer(df, table_name)`	Infer schema from a single DataFrame
`genie.infer_multi(dict_of_dfs)`	Infer schema across multiple related tables
`genie.deploy(connection, schema)`	Execute DDL against a live warehouse
`SchemaGenie.from_config(path)`	Load configuration from a YAML file

`InferredSchema`

schema.recommended_type     # "star" | "snowflake"
schema.fact_table           # TableDefinition
schema.dimension_tables     # list[TableDefinition]
schema.relationships        # list[dict] — detected FK relationships
schema.ddl                  # str — full DDL ready to execute
schema.scd_candidates       # list[str] — SCD Type 2 flagged columns
schema.confidence_score     # float — pipeline confidence [0, 1]
schema.export_ddl(path)     # Save DDL to a .sql file
schema.export_diagram(path) # Export ER diagram (requires graphviz extra)
schema.summary()            # Human-readable summary string

Supported Targets

Target	Surrogate Key	Currency Type	Partitioning / Distribution	Direct Deploy
`snowflake`	`AUTOINCREMENT`	`NUMBER(18,2)`	Micro-partition (automatic)	Yes
`redshift`	`IDENTITY(1,1)`	`DECIMAL(18,2)`	`DISTKEY` + `SORTKEY` on fact PK	Yes
`bigquery`	`INT64`	`NUMERIC`	`PARTITION BY` on first date col	Yes
`postgres`	`SERIAL`	`NUMERIC(18,2)`	`CREATE INDEX` on all ID columns	Yes

All dimension tables automatically receive SCD Type 2 audit columns: _valid_from, _valid_to, _is_current, _loaded_at

Development

git clone https://github.com/yourusername/schema-genie
cd schema-genie
pip install -e ".[dev]"
pytest tests/ -v

License

MIT — See LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Mar 8, 2026

1.0.0

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_genie-1.1.0.tar.gz (28.8 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schema_genie-1.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file schema_genie-1.1.0.tar.gz.

File metadata

Download URL: schema_genie-1.1.0.tar.gz
Upload date: Mar 8, 2026
Size: 28.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for schema_genie-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`381b85bf9965246ec7782da9e174ee39608f9fbb10693fe6d253c840d512bd95`
MD5	`4e7db5b5c9bfa3f5d74f43e2d8b7e2eb`
BLAKE2b-256	`19e8c6adc3242e43dc0d57ae6cb06804edd9a69d51dd5d1c1939ea150cc7fe48`

See more details on using hashes here.

File details

Details for the file schema_genie-1.1.0-py3-none-any.whl.

File metadata

Download URL: schema_genie-1.1.0-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 25.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for schema_genie-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf45c5b4692dae95f35b7242c56b92f6b701c90548904c6a4b3e1a10281c70cc`
MD5	`c83d7186cea89fadf2f0c1a0074f4daa`
BLAKE2b-256	`2b5262e5e3b136ae1fe87b21bf1ba3ed0ff6a4eefe28e7cfd5d13bf0a7ce2b78`

See more details on using hashes here.

schema-genie 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

schema-genie

Why schema-genie?

Impact at a Glance

The Real Cost of Manual Schema Work

Engineer time

Cloud Warehouse Cost Savings

BigQuery — partition pruning

Redshift — DISTKEY + SORTKEY

Snowflake — column ordering and types

Data Science Impact

Faster iteration on analytics models

Correct semantic types feed better models

SCD Type 2 — auditable datasets for time-series modelling

Relationship graph as a feature engineering map

Pipeline Complexity — What Runs in Seconds

Quantified Savings Summary

Installation

Quick Start

Multi-table inference

Load config from YAML

How It Works

API Reference

SchemaGenie

InferredSchema

Supported Targets

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SchemaGenie`

`InferredSchema`