Skip to main content

Schema diffing and evolution tool for Iceberg and beyond.

Project description

iceberg-evolve

Schema diffing and evolution tool for Apache Iceberg and beyond.

📣 New in 1.0.0

Initial release with core support for schema comparison and automated evolution against live Iceberg tables.

🔧 Features

  • Schema Loading

    • Store and load Iceberg schemas to/from standalone JSON files via IcebergSchemaJSONSerializer.
    • Fetch table schemas directly from Iceberg catalogs (Hive, Glue, REST) via PyIceberg configurations (pyiceberg.yaml).
  • Schema Diffing

    • Detect added, removed, renamed, and type-changed columns.
    • Support matching by column id or name strategies (default: id).
  • Automated Evolution

    • Generate and apply Iceberg schema evolution operations (add/rename/update/drop).
    • Preview migrations with a --dry-run mode before applying changes.
  • Rich CLI

    • iceberg-evolve diff <old.json> <new.json> to view schema diffs in a colored, tree-style format.
    • iceberg-evolve evolve --catalog-url <URI> --table-ident <db.table> --schema-path <new.json> to apply migrations.
  • Python API

    • Programmatic access to Schema, SchemaDiff, and migration utilities for integration in CI/CD pipelines or custom scripts.
  • Utilities

    • Clean and normalize Iceberg type strings.
    • Render operation plans to console via Rich.

🚀 Use Cases

  • Automate schema migrations for data lakes built on Iceberg.
  • Integrate schema checks into CI/CD workflows to prevent accidental breaking changes.
  • Generate human-readable schema evolution plans for review and auditing.
  • Build Python tooling around Iceberg schemas, including advanced analyses and reporting.

🚚 Installation

Requires Python 3.10 or later.

pip install iceberg-evolve

Or, to install for development with Poetry:

git clone https://github.com/anatol-ju/iceberg-evolve.git
cd iceberg-evolve
poetry install --with dev
pre-commit install  # optional: enable linting and formatting hooks

🧱 Quick Examples

For a quick look at the output, install the project and run:

poetry run example

Python API

from iceberg_evolve.schema import Schema
from iceberg_evolve.diff import SchemaDiff
from iceberg_evolve.renderer import SchemaDiffRenderer

# Load schemas
old = Schema.from_json_file("schemas/users_current.json")
new = Schema.from_json_file("schemas/users_new.json")

# Compute diff and render to console
diff = SchemaDiff(old, new)
SchemaDiffRenderer(diff).display()

from iceberg_evolve.schema import Schema
from iceberg_evolve.serializer import IcebergSchemaJSONSerializer

# Load an Iceberg Schema from a local file (in the expected format)
old_schema = Schema.from_json_file("schemas/users_current.json")

# Write it out to a standalone JSON file...
IcebergSchemaJSONSerializer.to_json_file(old_schema, "schemas/users_exported.json")

# ...and read it back in later
reloaded_schema = IcebergSchemaJSONSerializer.from_json_file("schemas/users_exported.json")

CLI

# View diff between two JSON schemas
iceberg-evolve diff users_current.json users_new.json \
  --match-by name

# Apply evolution to a live Iceberg table (dry run)
iceberg-evolve evolve \
  --catalog-url hive://localhost:9083 \
  --table-ident analytics.users \
  --schema-path users_new.json \
  --dry-run

# Serialize a table's schema
iceberg-evolve serialize \
  --catalog-url hive://localhost:9083 \
  --table-ident analytics.users \
  --output-path schemas/users_table_schema.json

⚙️ Configuration

This package relies on PyIceberg, therefore the configuration is the same. See documentation. Create a pyiceberg.yaml in your project root to configure catalogs:

catalogs:
  default:
    type: hive
    uri: thrift://localhost:9083

  glue:
    type: glue
    region: eu-west-1

You can find an example configuration in the examples directory. Alternatively, you can use environmental variables to set the catalog details.

When using the CLI, pass the catalog name or full URI to the evolve command via --catalog-url (e.g., glue://default).

🧪 Testing

Run unit tests with pytest:

poetry run pytest

Coverage reports are generated automatically via the existing configuration.

This project contains a basic local setup to test the functionality with a hive metastore. The purpose is to give you some insights before applying the package in your pipelines. You can run integration tests, once the Docker containers are up. Either by:

poetry run pytest tests/test_integration.py

Or without logging into the container:

docker compose exec runner poetry run pytest tests/test_integration.py

You don't have to select the integration test explicitly, it will be skipped automatically if you run unit tests outside of a container.

📝 License

This project is licensed under the MIT License. See the LICENSE file for details.

🧑‍💻 Author

Anatol Jurenkow Cloud Data Engineer | AWS Enthusiast | Iceberg Fan GitHub · LinkedIn

Feel free to open issues or contribute via pull requests—happy evolving!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iceberg_evolve-1.0.0.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

iceberg_evolve-1.0.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file iceberg_evolve-1.0.0.tar.gz.

File metadata

  • Download URL: iceberg_evolve-1.0.0.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.11 Linux/6.6.87.2-microsoft-standard-WSL2

File hashes

Hashes for iceberg_evolve-1.0.0.tar.gz
Algorithm Hash digest
SHA256 65f9ce29f99df04b3708dbaef104f31b45423e7b89ead31bd4faf12129484ab4
MD5 5854ade188b930e18efbf03cc9245c5d
BLAKE2b-256 b45bb29761c6d2c9fb2bd38e4da28b7a7b82a2d10144caaf7f4b0437e699a80b

See more details on using hashes here.

File details

Details for the file iceberg_evolve-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: iceberg_evolve-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 25.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.11 Linux/6.6.87.2-microsoft-standard-WSL2

File hashes

Hashes for iceberg_evolve-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60aeae166728efcf73343ce9abd8682c5f2a459977bedabd9296586271a9cb25
MD5 a88c0645f029bf1f81b8c49171ac62de
BLAKE2b-256 dd5a34a0c8eeb58c2dd2d38e84700fc48708a30b4aebd225fbbcd637e9df8174

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page