Skip to main content

A DuckDB-based Iceberg catalog implementation

Project description

boringdata.io — Kickstart your Iceberg journey with our data stack templates.

Boring Data

Boring Catalog

A lightweight, file-based Iceberg catalog implementation using a single JSON file (e.g., on S3, local disk, or any fsspec-compatible storage).

Why Boring Catalog?

  • No need to host or maintain a dedicated catalog service
  • Easy to use, easy to understand, perfect to get started with Iceberg
  • DuckDB CLI interface to easily explore your iceberg tables and metadata

How It Works

Boring Catalog stores all Iceberg catalog state in a single JSON file:

  • Namespaces and tables are tracked in this file
  • S3 conditional writes prevent concurrent modifications when storing catalog on S3
  • The .ice/index file in your project directory stores the configuration for your catalog, including:
    • catalog_uri: the path to your catalog JSON file
    • catalog_name: the logical name of your catalog
    • properties: additional properties (e.g., warehouse location)

Installation

pip install boringcatalog

Quickstart

Initialize a Catalog

ice init

That's it ! Your catalog is now ready to use.

2 files are created:

  • warehouse/catalog/catalog_boring.json = catalog file
  • .ice/index = points to the catalog location (similar to a git index file, but for Iceberg)

Note: You can also specify a remote location for your Iceberg data and catalog file:

ice init -p warehouse=s3://mybucket/mywarehouse

More details on the Custom Init and Catalog Location section.

You can then start using the catalog:

Commit a table

# Get some data
curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet

# Commit the table
ice commit my_table --source /tmp/yellow_tripdata_2023-01.parquet

Check the commit history:

ice log 

Explore your Iceberg (data and metadata) with DuckDB

ice duck

This opens an interactive DuckDB session with pointers to all your tables and namespaces.

Example DuckDB queries:

show;                               -- show all tables               
select * from catalog.namespaces;   -- list namespaces
select * from catalog.tables;       -- list tables
select * from <namespace>.<table>;  -- query iceberg table

Python Usage

from boringcatalog import BoringCatalog

# Auto-detects .ice/index in the current working directory
catalog = BoringCatalog()

# Or specify a catalog
catalog = BoringCatalog(name="mycat", uri="path/to/catalog.json")

# Interact with your iceberg catalog
catalog.create_namespace("my_namespace")
catalog.create_table("my_namespace", "my_table")
catalog.load_table("my_namespace.my_table")

import pyarrow.parquet as pq
df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
table = catalog.load_table(("ice_default", "my_table"))
table.append(df)

Custom Init and Catalog Location

You can configure your Iceberg catalog in several ways, depending on where you want to store your catalog metadata (the JSON file) and your Iceberg data (the warehouse):

  • The warehouse property determines where your Iceberg tables' data will be stored.
  • The --catalog option lets you specify the exact path for your catalog JSON file.
  • If you use both, the catalog file will be created at the path you specify, and the warehouse will be used for table data.

Examples

Command Example Catalog File Location Warehouse/Data Location Use Case
ice init warehouse/catalog/catalog_boring.json warehouse/ Local, simple
ice init -p warehouse=... <warehouse>/catalog/catalog_boring.json <warehouse>/ Custom warehouse
ice init --catalog ... <custom>.json (to define when creating a table) Custom catalog file
ice init --catalog ... -p warehouse=... <custom>.json <warehouse>/ Full control
ice init --catalog ... --catalog-name ... <custom>.json (to define when creating a table) Custom name & file

Edge Cases & Manual Editing

  • Custom Catalog Name: By default, the catalog is named "boring", but you can set a custom name with --catalog-name. This name is used in the catalog JSON and for file naming if you don't specify a custom path.
  • Re-initialization: If you run ice init multiple times in the same directory, the .ice/index file will be overwritten with the new configuration. This is useful if you want to re-point your project to a different catalog, but be aware that it will not migrate or merge any existing data.
  • Manual Editing: Advanced users can manually edit .ice/index to point to a different catalog file or change the catalog name. If you do this, make sure the catalog_uri and catalog_name fields are consistent with your actual catalog JSON file. If you set a warehouse property but do not update catalog_uri, Boring Catalog will always use the catalog_uri from the index file.

Roadmap

  • Improve CLI to allow MERGE operation, partition spec, etc.
  • Expose REST API for integration with AWS, Snowflake, etc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boringcatalog-0.3.0.tar.gz (520.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

boringcatalog-0.3.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file boringcatalog-0.3.0.tar.gz.

File metadata

  • Download URL: boringcatalog-0.3.0.tar.gz
  • Upload date:
  • Size: 520.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for boringcatalog-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c26d0511a75b1420c4f7b6d8573b0de43830d18e06a541458d97a6264ab98447
MD5 da99951a0e9e94d4603fc38a7a766a90
BLAKE2b-256 f053392888288a40fbcf716f1e0359e6dbe5de34f6b29f670f34a4cc12d1f22b

See more details on using hashes here.

File details

Details for the file boringcatalog-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for boringcatalog-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4fe5f16f6fba54e1b141cb5c6f8a2a7fb83145ebc12a3265712f01c3e9b1dbe
MD5 a800fc556cbbe5691426f3161230ff92
BLAKE2b-256 9fe94e82e7e96db383f459756e4bd96460aa3854a945eb35084546442c0df16a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page