Skip to main content

Databricks Unity Catalog schema extractor

Project description

databricks-schema

A CLI tool and Python library that uses the Databricks SDK to extract and diff Unity Catalog schemas as YAML files. It can also generate Databricks Spark SQL to apply schema changes across catalogs.

Overview

Extract a catalog to YAML files, then diff those files against a catalog — the same one to detect drift or a different one to compare environments (e.g. prod vs test):

# 1. Find the catalog you want to snapshot
databricks-schema list-catalogs

# 2. Extract its schemas to YAML files (one file per schema)
databricks-schema extract prod_catalog --output-dir ./schemas/

# 3. Diff those files against a catalog (same or different)
databricks-schema diff test_catalog ./schemas/

# 4. Generate SQL to bring that catalog in line with the YAML files
databricks-schema generate-sql test_catalog ./schemas/ --output-dir ./migrations/

The YAML files act as a version-controllable snapshot of your schema. The diff command exits with code 1 when differences are found, making it suitable for CI pipelines.

Output Format

Each schema is written to {output-dir}/{schema-name}.yaml. Fields with no value (null comments, empty tag dicts, empty FK lists) are omitted. Use --format json to write .json files with the same structure.

name: main
comment: Main production schema
tags:
  env: prod
tables:
  - name: users
    table_type: MANAGED
    comment: User accounts
    tags:
      domain: identity
    columns:
      - name: id
        data_type: bigint
        nullable: false
        comment: Primary key
      - name: email
        data_type: string
      - name: org_id
        data_type: bigint
    primary_key:
      name: pk_users
      columns:
        - id
    foreign_keys:
      - name: fk_org
        columns:
          - org_id
        ref_schema: orgs
        ref_table: organizations
        ref_columns:
          - id

Installation

Requires Python 3.11+ and uv.

git clone <repo>
cd databricks-schema
uv sync

For development (includes pytest and ruff):

uv sync --all-groups

Authentication

The tool uses the Databricks SDK for auth. Configure it via environment variables:

export DATABRICKS_HOST=https://<workspace>.cloud.databricks.com
export DATABRICKS_TOKEN=<your-personal-access-token>

Or use a Databricks CLI profile (~/.databrickscfg) — the SDK will pick it up automatically.

You can also pass credentials directly as flags (see --host / --token below).

CLI Usage

databricks-schema [OPTIONS] COMMAND [ARGS]...

extract

Extract all schemas from a catalog to YAML files:

databricks-schema extract <catalog> --output-dir ./schemas/

Use --format json to write .json files instead of .yaml.

Extract specific schemas only:

databricks-schema extract <catalog> --schema main --schema raw --output-dir ./schemas/

Print a single schema to stdout (no --output-dir):

databricks-schema extract <catalog> --schema main

Skip tag lookups for faster extraction (tags will be absent from output):

databricks-schema extract <catalog> --output-dir ./schemas/ --no-tags

Include additional metadata (owner, storage_location) in the output:

databricks-schema extract <catalog> --output-dir ./schemas/ --include-metadata

Control the number of parallel workers (default: 4):

databricks-schema extract <catalog> --output-dir ./schemas/ --workers 8

diff

Compare the live catalog against previously extracted schema files (format auto-detected from the directory — YAML or JSON, not mixed):

databricks-schema diff <catalog> ./schemas/

Compare specific schemas only:

databricks-schema diff <catalog> ./schemas/ --schema main --schema raw

Skip tag lookups during comparison:

databricks-schema diff <catalog> ./schemas/ --no-tags

Include additional metadata (owner, storage_location) in the comparison:

databricks-schema diff <catalog> ./schemas/ --include-metadata

Exits with code 0 if no differences are found, 1 if there are — making it suitable for CI pipelines. Output example:

~ Schema: main [MODIFIED]
  ~ Table: users [MODIFIED]
    ~ Column: score [MODIFIED]
        data_type: 'int' -> 'double'
    + Column: phone [ADDED]
  + Table: events [ADDED]
- Schema: legacy [REMOVED]

Markers: + added in catalog, - removed from catalog, ~ modified.

generate-sql

Generate Databricks Spark SQL statements to bring the live catalog in line with local schema files (format auto-detected, YAML or JSON, not mixed). Statements are printed to stdout by default:

databricks-schema generate-sql <catalog> ./schemas/

Write one .sql file per schema to a directory instead:

databricks-schema generate-sql <catalog> ./schemas/ --output-dir ./migrations/

Destructive statements (DROP SCHEMA, DROP TABLE, DROP COLUMN) are emitted as SQL comments by default. Pass --allow-drop to emit them as executable statements:

databricks-schema generate-sql <catalog> ./schemas/ --allow-drop

Filter to specific schemas:

databricks-schema generate-sql <catalog> ./schemas/ --schema main --schema raw

Skip tag lookups for faster comparison:

databricks-schema generate-sql <catalog> ./schemas/ --no-tags

Include additional metadata (owner, storage_location) in the comparison:

databricks-schema generate-sql <catalog> ./schemas/ --include-metadata

list-catalogs

List all accessible catalogs:

databricks-schema list-catalogs

list-schemas

List schemas in a catalog:

databricks-schema list-schemas <catalog>

Python Library Usage

from pathlib import Path
from databricks_schema import CatalogExtractor, catalog_to_yaml, schema_from_yaml
from databricks_schema import diff_catalog_with_dir, diff_schemas, schema_diff_to_sql

# Extract using configured auth (max_workers controls parallel table extraction)
extractor = CatalogExtractor(max_workers=4)
catalog = extractor.extract_catalog("my_catalog", schema_filter=["main", "raw"])

# Skip tag lookups for faster extraction
catalog = extractor.extract_catalog("my_catalog", include_tags=False)

# Include additional metadata (owner, storage_location)
catalog = extractor.extract_catalog("my_catalog", include_metadata=True)

# Serialise to YAML
yaml_text = catalog_to_yaml(catalog)

# Deserialise from YAML
schema = schema_from_yaml(open("schemas/main.yaml").read())
print(schema.tables[0].columns)

# Compare live catalog against local YAML files
result = diff_catalog_with_dir(catalog, Path("./schemas/"))
if result.has_changes:
    for schema_diff in result.schemas:
        print(schema_diff.name, schema_diff.status)

# Compare two Schema objects directly
stored = schema_from_yaml(open("schemas/main.yaml").read())
diff = diff_schemas(live=catalog.schemas[0], stored=stored)

# Generate SQL to bring live in line with stored
sql = schema_diff_to_sql("my_catalog", diff, stored_schema=stored, allow_drop=False)
print(sql)

Development

# Run tests
uv run pytest

# Lint
uv run ruff check databricks_schema/ tests/

# Format
uv run ruff format databricks_schema/ tests/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks_schema-0.5.1.tar.gz (66.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

databricks_schema-0.5.1-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file databricks_schema-0.5.1.tar.gz.

File metadata

  • Download URL: databricks_schema-0.5.1.tar.gz
  • Upload date:
  • Size: 66.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for databricks_schema-0.5.1.tar.gz
Algorithm Hash digest
SHA256 9899fb4bdd711e87c68fe8125d3941690194f82ad755d606ae3608339c2c33b9
MD5 bfec03d874f6ddcc0c08802f46313c4f
BLAKE2b-256 bdf3f4bb94836c008b636f9fde5b6fbacbd63cfeff09e743ea59c88e633e3ff0

See more details on using hashes here.

File details

Details for the file databricks_schema-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: databricks_schema-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for databricks_schema-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d7d97e7610df7f0b2995d0d16b06971cc99c3fe7b86b8467f3744f6209e55f8c
MD5 353506debc8fe6ce373e8a560f454512
BLAKE2b-256 2881286994ab1392690bb0220f0f93a044c40b44961657a57a383b194406884c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page