Skip to main content

Infer star/snowflake schemas from DataFrames and generate DDL

Project description

schema-genie

Automatically infer optimal star and snowflake schemas from raw CSVs or DataFrames and generate production-ready DDL for Redshift, BigQuery, Snowflake, and PostgreSQL.


Installation

pip install schema-genie

With optional warehouse connectors:

pip install schema-genie[snowflake]
pip install schema-genie[redshift]
pip install schema-genie[bigquery]
pip install schema-genie[diagram]

Quick Start

import pandas as pd
from schema_genie import SchemaGenie

df = pd.read_csv("sales_data.csv")
genie = SchemaGenie(target="snowflake")
schema = genie.infer(df, table_name="sales")

print(schema.recommended_type)   # "star" or "snowflake"
print(schema.ddl)
schema.export_ddl("schema.sql")

Multi-table inference

genie = SchemaGenie(target="redshift")
schema = genie.infer_multi({
    "orders":    orders_df,
    "customers": customers_df,
    "products":  products_df,
})

print(schema.fact_table.name)         # "orders"
print([t.name for t in schema.dimension_tables])
print(schema.scd_candidates)
schema.export_diagram("er_diagram")   # requires pip install schema-genie[diagram]

Load config from YAML

# genie_config.yaml
target: postgres
schema_type: auto
detect_scd: true
normalize_threshold: 0.05
genie = SchemaGenie.from_config("genie_config.yaml")

How It Works

schema-genie runs a 6-stage statistical inference pipeline:

Raw DataFrames / CSVs
        │
        ▼
┌────────────────────────────┐
│    Type Detector            │  Maps each column to a semantic type
│  (measure/date/id/          │  using dtype + cardinality + name heuristics
│   category/text/currency)   │
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│    Cardinality Analyzer     │  Measures unique value ratio per column
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Relationship Detector     │  FK-like overlaps via value intersection scoring
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Fact Table Selector       │  Picks table with highest measure density
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Schema Type Recommender   │  Star vs. Snowflake based on dimension depth
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│     DDL Generator           │  CREATE TABLE statements for target warehouse
└────────────────────────────┘

API Reference

SchemaGenie

SchemaGenie(
    target: str = "snowflake",          # "snowflake" | "redshift" | "bigquery" | "postgres"
    schema_type: str = "auto",          # "auto" | "star" | "snowflake"
    primary_key_strategy: str = "surrogate",
    detect_scd: bool = True,
    normalize_threshold: float = 0.05
)
Method Description
genie.infer(df, table_name) Infer schema from a single DataFrame
genie.infer_multi(dict_of_dfs) Infer schema across multiple related tables
genie.deploy(connection, schema) Execute DDL against a live warehouse
SchemaGenie.from_config(path) Load configuration from a YAML file

InferredSchema

schema.recommended_type     # "star" | "snowflake"
schema.fact_table           # TableDefinition
schema.dimension_tables     # list[TableDefinition]
schema.relationships        # list[dict] — detected FK relationships
schema.ddl                  # str — full DDL ready to execute
schema.scd_candidates       # list[str] — SCD Type 2 flagged columns
schema.confidence_score     # float — pipeline confidence [0, 1]
schema.export_ddl(path)     # Save DDL to a .sql file
schema.export_diagram(path) # Export ER diagram (requires graphviz extra)
schema.summary()            # Human-readable summary string

Supported Targets

Target Surrogate Key Currency Type Direct Deploy
snowflake AUTOINCREMENT NUMBER(18,2) Yes
redshift IDENTITY(1,1) DECIMAL(18,2) Yes
bigquery INT64 NUMERIC Yes
postgres SERIAL NUMERIC(18,2) Yes

All dimension tables automatically receive SCD Type 2 audit columns: _valid_from, _valid_to, _is_current, _loaded_at


Development

git clone https://github.com/yourusername/schema-genie
cd schema-genie
pip install -e ".[dev]"
pytest tests/ -v

License

MIT — See LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_genie-1.0.0.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schema_genie-1.0.0-py3-none-any.whl (22.0 kB view details)

Uploaded Python 3

File details

Details for the file schema_genie-1.0.0.tar.gz.

File metadata

  • Download URL: schema_genie-1.0.0.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for schema_genie-1.0.0.tar.gz
Algorithm Hash digest
SHA256 aa7f038699b5eb158ccde66c07a6276b842b047505110be198244f8e11e82503
MD5 d411b33ffd8d43fa374e0191d59d7de8
BLAKE2b-256 4e00d8368452c0e1555212dcc07828988d7e2f95c170b95cbc9830a1220e8cb6

See more details on using hashes here.

File details

Details for the file schema_genie-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: schema_genie-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 22.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for schema_genie-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d84854bf63a4c257cb371442a16a636e86ec4b2720c5a44d123db4eb6b249392
MD5 5a5a8a3a5dc06d677db48ccb03d70d54
BLAKE2b-256 1dcd6ddb2d67c879fc7886e99d7f402507e277131ee9a6ee87f4c493dad18896

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page