Databricks Unity Catalog schema extractor

Project description

databricks-schema

A CLI tool and Python library that uses the Databricks SDK to extract and diff Unity Catalog schemas as YAML files. It can also generate Databricks Spark SQL to apply schema changes across catalogs.

Overview

Extract a catalog to YAML files, then diff those files against a catalog — the same one to detect drift or a different one to compare environments (e.g. prod vs test):

# 1. Find the catalog you want to snapshot
databricks-schema list-catalogs

# 2. Extract its schemas to YAML files (one file per schema)
databricks-schema extract prod_catalog --output-dir ./schemas/

# 3. Diff those files against a catalog (same or different)
databricks-schema diff test_catalog ./schemas/

# 4. Generate SQL to bring that catalog in line with the YAML files
databricks-schema generate-sql test_catalog ./schemas/ --output-dir ./migrations/

The YAML files act as a version-controllable snapshot of your schema. The diff command exits with code 1 when differences are found, making it suitable for CI pipelines.

Output Format

Each schema is written to {output-dir}/{schema-name}.yaml. Fields with no value (null comments, empty tag dicts, empty FK lists) are omitted. Use --format json to write .json files with the same structure.

name: main
comment: Main production schema
tags:
  env: prod
tables:
  - name: users
    table_type: MANAGED
    comment: User accounts
    tags:
      domain: identity
    columns:
      - name: id
        data_type: bigint
        nullable: false
        comment: Primary key
      - name: email
        data_type: string
      - name: org_id
        data_type: bigint
    primary_key:
      name: pk_users
      columns:
        - id
    foreign_keys:
      - name: fk_org
        columns:
          - org_id
        ref_schema: orgs
        ref_table: organizations
        ref_columns:
          - id

Installation

Requires Python 3.11+ and uv.

git clone <repo>
cd databricks-schema
uv sync

For development (includes pytest and ruff):

uv sync --all-groups

Authentication

The tool uses the Databricks SDK for auth. Configure it via environment variables:

export DATABRICKS_HOST=https://<workspace>.cloud.databricks.com
export DATABRICKS_TOKEN=<your-personal-access-token>

Or use a Databricks CLI profile (~/.databrickscfg) — the SDK will pick it up automatically.

You can also pass credentials directly as flags (see --host / --token below).

CLI Usage

databricks-schema [OPTIONS] COMMAND [ARGS]...

`extract`

Extract all schemas from a catalog to YAML files:

databricks-schema extract <catalog> --output-dir ./schemas/

Use --format json to write .json files instead of .yaml.

Extract specific schemas only:

databricks-schema extract <catalog> --schema main --schema raw --output-dir ./schemas/

Print a single schema to stdout (no --output-dir):

databricks-schema extract <catalog> --schema main

Skip tag lookups for faster extraction (tags will be absent from output):

databricks-schema extract <catalog> --output-dir ./schemas/ --no-tags

Include additional metadata (owner, storage_location) in the output:

databricks-schema extract <catalog> --output-dir ./schemas/ --include-metadata

Control the number of parallel workers (default: 4):

databricks-schema extract <catalog> --output-dir ./schemas/ --workers 8

`diff`

Compare the live catalog against previously extracted schema files (format auto-detected from the directory — YAML or JSON, not mixed):

databricks-schema diff <catalog> ./schemas/

Compare specific schemas only:

databricks-schema diff <catalog> ./schemas/ --schema main --schema raw

Skip tag lookups during comparison:

databricks-schema diff <catalog> ./schemas/ --no-tags

Include additional metadata (owner, storage_location) in the comparison:

databricks-schema diff <catalog> ./schemas/ --include-metadata

Exits with code 0 if no differences are found, 1 if there are — making it suitable for CI pipelines. Output example:

~ Schema: main [MODIFIED]
  ~ Table: users [MODIFIED]
    ~ Column: score [MODIFIED]
        data_type: 'int' -> 'double'
    + Column: phone [ADDED]
  + Table: events [ADDED]
- Schema: legacy [REMOVED]

Markers: + added in catalog, - removed from catalog, ~ modified.

`generate-sql`

Generate Databricks Spark SQL statements to bring the live catalog in line with local schema files (format auto-detected, YAML or JSON, not mixed). Statements are printed to stdout by default:

databricks-schema generate-sql <catalog> ./schemas/

Write one .sql file per schema to a directory instead:

databricks-schema generate-sql <catalog> ./schemas/ --output-dir ./migrations/

Destructive statements (DROP SCHEMA, DROP TABLE, DROP COLUMN) are emitted as SQL comments by default. Pass --allow-drop to emit them as executable statements:

databricks-schema generate-sql <catalog> ./schemas/ --allow-drop

Filter to specific schemas:

databricks-schema generate-sql <catalog> ./schemas/ --schema main --schema raw

Skip tag lookups for faster comparison:

databricks-schema generate-sql <catalog> ./schemas/ --no-tags

Include additional metadata (owner, storage_location) in the comparison:

databricks-schema generate-sql <catalog> ./schemas/ --include-metadata

`list-catalogs`

List all accessible catalogs:

databricks-schema list-catalogs

`list-schemas`

List schemas in a catalog:

databricks-schema list-schemas <catalog>

Python Library Usage

from pathlib import Path
from databricks_schema import CatalogExtractor, catalog_to_yaml, schema_from_yaml
from databricks_schema import diff_catalog_with_dir, diff_schemas, schema_diff_to_sql

# Extract using configured auth (max_workers controls parallel table extraction)
extractor = CatalogExtractor(max_workers=4)
catalog = extractor.extract_catalog("my_catalog", schema_filter=["main", "raw"])

# Skip tag lookups for faster extraction
catalog = extractor.extract_catalog("my_catalog", include_tags=False)

# Include additional metadata (owner, storage_location)
catalog = extractor.extract_catalog("my_catalog", include_metadata=True)

# Serialise to YAML
yaml_text = catalog_to_yaml(catalog)

# Deserialise from YAML
schema = schema_from_yaml(open("schemas/main.yaml").read())
print(schema.tables[0].columns)

# Compare live catalog against local YAML files
result = diff_catalog_with_dir(catalog, Path("./schemas/"))
if result.has_changes:
    for schema_diff in result.schemas:
        print(schema_diff.name, schema_diff.status)

# Compare two Schema objects directly
stored = schema_from_yaml(open("schemas/main.yaml").read())
diff = diff_schemas(live=catalog.schemas[0], stored=stored)

# Generate SQL to bring live in line with stored
sql = schema_diff_to_sql("my_catalog", diff, stored_schema=stored, allow_drop=False)
print(sql)

Development

# Run tests
uv run pytest

# Lint
uv run ruff check databricks_schema/ tests/

# Format
uv run ruff format databricks_schema/ tests/

Project details

Release history Release notifications | RSS feed

0.6.1

Mar 19, 2026

0.6.0

Mar 15, 2026

0.5.1

Feb 26, 2026

This version

0.5.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks_schema-0.5.0.tar.gz (65.7 kB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

databricks_schema-0.5.0-py3-none-any.whl (19.9 kB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file databricks_schema-0.5.0.tar.gz.

File metadata

Download URL: databricks_schema-0.5.0.tar.gz
Upload date: Feb 26, 2026
Size: 65.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for databricks_schema-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`79cc576c8844973184e6cfc5f10a00806af13032f9d3e9f153c009ecc257bb48`
MD5	`0023d68a010e3f6f8903d904bbb21b2a`
BLAKE2b-256	`52e0cf84b84d96da2bb7c21dd21d945a92c7d6105317f29e57456492ac3a9e73`

See more details on using hashes here.

File details

Details for the file databricks_schema-0.5.0-py3-none-any.whl.

File metadata

Download URL: databricks_schema-0.5.0-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 19.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for databricks_schema-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cdc409c3b223f75a73f3d289182398fee9b86c274ea6739c6e474cb7e95c9cc9`
MD5	`98823522afed7ef0868e0929a50331d3`
BLAKE2b-256	`c10247505e6cd864dbe105f92355f2f4b3c4f5c1d2c8772c7e6ca032a065985b`

See more details on using hashes here.

databricks-schema 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

databricks-schema

Overview

Output Format

Installation

Authentication

CLI Usage

`extract`

`diff`

`generate-sql`

`list-catalogs`

`list-schemas`

Python Library Usage

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes