Infer star/snowflake schemas from DataFrames and generate DDL

These details have not been verified by PyPI

Project links

Project description

schema-genie

Automatically infer optimal star and snowflake schemas from raw CSVs or DataFrames and generate production-ready DDL for Redshift, BigQuery, Snowflake, and PostgreSQL.

Installation

pip install schema-genie

With optional warehouse connectors:

pip install schema-genie[snowflake]
pip install schema-genie[redshift]
pip install schema-genie[bigquery]
pip install schema-genie[diagram]

Quick Start

import pandas as pd
from schema_genie import SchemaGenie

df = pd.read_csv("sales_data.csv")
genie = SchemaGenie(target="snowflake")
schema = genie.infer(df, table_name="sales")

print(schema.recommended_type)   # "star" or "snowflake"
print(schema.ddl)
schema.export_ddl("schema.sql")

Multi-table inference

genie = SchemaGenie(target="redshift")
schema = genie.infer_multi({
    "orders":    orders_df,
    "customers": customers_df,
    "products":  products_df,
})

print(schema.fact_table.name)         # "orders"
print([t.name for t in schema.dimension_tables])
print(schema.scd_candidates)
schema.export_diagram("er_diagram")   # requires pip install schema-genie[diagram]

Load config from YAML

# genie_config.yaml
target: postgres
schema_type: auto
detect_scd: true
normalize_threshold: 0.05

genie = SchemaGenie.from_config("genie_config.yaml")

How It Works

schema-genie runs a 6-stage statistical inference pipeline:

Raw DataFrames / CSVs
        │
        ▼
┌────────────────────────────┐
│    Type Detector            │  Maps each column to a semantic type
│  (measure/date/id/          │  using dtype + cardinality + name heuristics
│   category/text/currency)   │
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│    Cardinality Analyzer     │  Measures unique value ratio per column
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Relationship Detector     │  FK-like overlaps via value intersection scoring
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Fact Table Selector       │  Picks table with highest measure density
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Schema Type Recommender   │  Star vs. Snowflake based on dimension depth
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│     DDL Generator           │  CREATE TABLE statements for target warehouse
└────────────────────────────┘

API Reference

`SchemaGenie`

SchemaGenie(
    target: str = "snowflake",          # "snowflake" | "redshift" | "bigquery" | "postgres"
    schema_type: str = "auto",          # "auto" | "star" | "snowflake"
    primary_key_strategy: str = "surrogate",
    detect_scd: bool = True,
    normalize_threshold: float = 0.05
)

Method	Description
`genie.infer(df, table_name)`	Infer schema from a single DataFrame
`genie.infer_multi(dict_of_dfs)`	Infer schema across multiple related tables
`genie.deploy(connection, schema)`	Execute DDL against a live warehouse
`SchemaGenie.from_config(path)`	Load configuration from a YAML file

`InferredSchema`

schema.recommended_type     # "star" | "snowflake"
schema.fact_table           # TableDefinition
schema.dimension_tables     # list[TableDefinition]
schema.relationships        # list[dict] — detected FK relationships
schema.ddl                  # str — full DDL ready to execute
schema.scd_candidates       # list[str] — SCD Type 2 flagged columns
schema.confidence_score     # float — pipeline confidence [0, 1]
schema.export_ddl(path)     # Save DDL to a .sql file
schema.export_diagram(path) # Export ER diagram (requires graphviz extra)
schema.summary()            # Human-readable summary string

Supported Targets

Target	Surrogate Key	Currency Type	Direct Deploy
`snowflake`	`AUTOINCREMENT`	`NUMBER(18,2)`	Yes
`redshift`	`IDENTITY(1,1)`	`DECIMAL(18,2)`	Yes
`bigquery`	`INT64`	`NUMERIC`	Yes
`postgres`	`SERIAL`	`NUMERIC(18,2)`	Yes

All dimension tables automatically receive SCD Type 2 audit columns: _valid_from, _valid_to, _is_current, _loaded_at

Development

git clone https://github.com/yourusername/schema-genie
cd schema-genie
pip install -e ".[dev]"
pytest tests/ -v

License

MIT — See LICENSE

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Mar 8, 2026

This version

1.0.0

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_genie-1.0.0.tar.gz (21.0 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schema_genie-1.0.0-py3-none-any.whl (22.0 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file schema_genie-1.0.0.tar.gz.

File metadata

Download URL: schema_genie-1.0.0.tar.gz
Upload date: Mar 7, 2026
Size: 21.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for schema_genie-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`aa7f038699b5eb158ccde66c07a6276b842b047505110be198244f8e11e82503`
MD5	`d411b33ffd8d43fa374e0191d59d7de8`
BLAKE2b-256	`4e00d8368452c0e1555212dcc07828988d7e2f95c170b95cbc9830a1220e8cb6`

See more details on using hashes here.

File details

Details for the file schema_genie-1.0.0-py3-none-any.whl.

File metadata

Download URL: schema_genie-1.0.0-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 22.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for schema_genie-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d84854bf63a4c257cb371442a16a636e86ec4b2720c5a44d123db4eb6b249392`
MD5	`5a5a8a3a5dc06d677db48ccb03d70d54`
BLAKE2b-256	`1dcd6ddb2d67c879fc7886e99d7f402507e277131ee9a6ee87f4c493dad18896`

See more details on using hashes here.

schema-genie 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

schema-genie

Installation

Quick Start

Multi-table inference

Load config from YAML

How It Works

API Reference

`SchemaGenie`

`InferredSchema`

Supported Targets

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes