A Python library for Ducklake, providing a pyiceberg-like API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jghoman

These details have not been verified by PyPI

Project description

pyducklake

A Python SDK for Ducklake, providing a pyiceberg-like API for the Ducklake lakehouse format.

What is pyducklake?

pyducklake is to Ducklake what pyiceberg is to Apache Iceberg: a native Python client for catalog management, table operations, schema evolution, and data I/O. It connects to a Ducklake metadata database (DuckDB, PostgreSQL, MySQL, or SQLite) and reads/writes Parquet data files via DuckDB's ducklake extension. The result is a zero-infrastructure lakehouse you can spin up with a single Python import.

Installation

# Core (DuckDB + PyArrow)
pip install pyducklake

# With optional output format support
pip install pyducklake[pandas]
pip install pyducklake[polars]
pip install pyducklake[all]

Development install:

git clone https://github.com/your-org/pyducklake.git
cd pyducklake
uv sync

Quick Start

import pyarrow as pa
from pyducklake import Catalog, Schema, required, optional, IntegerType, StringType

# 1. Create a catalog (DuckDB file as metadata store)
catalog = Catalog("my_lake", "meta.duckdb", data_path="./data")

# 2. Define a schema
schema = Schema.of(
    required("id", IntegerType()),
    optional("name", StringType()),
    optional("email", StringType()),
)

# 3. Create a table and write data
table = catalog.create_table("users", schema)
table.append(pa.table({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Carol"],
    "email": ["alice@example.com", "bob@example.com", "carol@example.com"],
}))

# 4. Read data
table.scan().to_arrow()                  # full table
table.scan("id > 1").to_arrow()          # filtered
table.scan().select("name").to_arrow()   # projected
table.scan().count()                     # row count

See examples/quickstart/ for the full runnable version.

Comparison with pyiceberg

pyducklake follows pyiceberg's API patterns where they make sense, but takes advantage of Ducklake's architecture (SQL metadata database vs file-based manifests) to provide features that are difficult or impossible in pyiceberg.

Feature Comparison

Feature	pyducklake	pyiceberg
Metadata storage	SQL database (DuckDB, Postgres, MySQL, SQLite)	Files (JSON, Avro manifests)
Catalog backends	4 (DuckDB, PostgreSQL, MySQL, SQLite)	7 (REST, Hive, Glue, DynamoDB, SQL, BigQuery, In-memory)
Schema definition	`Schema.of()` with `required()`/`optional()` helpers	`Schema()` with `NestedField()`
Read formats	Arrow, pandas, Polars, DuckDB, RecordBatchReader, PyArrow Dataset	Arrow, pandas, DuckDB, Ray, Polars
Write inputs	Arrow, Polars, any PyCapsule (`__arrow_c_stream__`), RecordBatchReader	Arrow only
Append	Yes	Yes
Overwrite	Yes (full or filtered)	Yes (full or filtered)
Delete	Yes	Yes
Upsert / Merge	Yes, with `UpsertResult` counts	Yes (v0.7+)
Streaming writes	`append_batches()` from RecordBatchReader or iterator	No
Schema evolution	Add, drop, rename, widen type, set/drop NOT NULL	Add, drop, rename, widen, reorder, union-by-name
Column reordering	No (Ducklake limitation)	Yes
Identifier fields	No (not in Ducklake spec)	Yes
Partitioning	Identity, year, month, day, hour	Identity, bucket, truncate, year, month, day, hour
Bucket/truncate transforms	No (Ducklake limitation)	Yes
Sort orders	Yes (applied during writes and compaction)	Spec only (not applied during writes)
Time travel	By snapshot ID or timestamp	By snapshot ID, ref name, or timestamp
Snapshot branches/tags	No (not in Ducklake spec)	Yes
Snapshot rollback	`rollback_to_snapshot()`, `rollback_to_timestamp()`	Not implemented (long-standing request)
Change data capture	`ChangeSet` with filtering, column projection, timestamp bounds, update pre/post image correlation	Not implemented (long-standing request)
Transactions	Multi-table atomic commits via SQL transactions	Single-table only
Encryption	Per-file Parquet encryption (catalog-level)	Not implemented (long-standing request)
Table maintenance	Compact, expire snapshots, rewrite files, cleanup, checkpoint	Expire snapshots (limited)
Views	Full CRUD + scannable `View` class	Not implemented
Metadata compaction	Not needed (SQL database)	Not implemented (long-standing request)
Memory management	DuckDB handles spilling to disk	Can OOM on large scans
Concurrency	Serializable isolation via database transactions	Optimistic concurrency (no retry)
External file registration	`add_files()` with `allow_missing`/`ignore_extra_columns`	`add_files()`
Inspect API	Snapshots, files, history, partitions	Snapshots, files, manifests, entries, refs, partitions
PyArrow Dataset interface	`table.to_arrow_dataset()`	Not implemented (long-standing request)
CLI	13 commands with text/JSON output	Full CLI
Zero-infrastructure quickstart	Yes (DuckDB file)	Requires catalog service
Package size	~3 deps (duckdb, pyarrow, click)	~200MB with PyArrow + optional deps

API Comparison

Operation	pyducklake	pyiceberg
Load catalog	`Catalog("name", "uri")`	`load_catalog("name", **props)`
Create table	`catalog.create_table(id, schema)`	`catalog.create_table(id, schema)`
Load table	`catalog.load_table(id)`	`catalog.load_table(id)`
Scan	`table.scan().filter(...).select(...).to_arrow()`	`table.scan(row_filter=..., selected_fields=...).to_arrow()`
Append	`table.append(df)`	`table.append(df)`
Schema evolution	`with table.update_schema() as u: u.add_column(...)`	`with table.update_schema() as u: u.add_column(...)`
Partitioning	`with table.update_spec() as s: s.add_field(...)`	`with table.update_spec() as s: s.add_field(...)`
Transaction	`with catalog.begin_transaction() as txn: ...`	Not available
CDC	`table.table_changes(start, end).updates()`	Not available
Rollback	`table.rollback_to_snapshot(id)`	Not available

Metadata Backends

Backend	URI Example
DuckDB	`Catalog("lake", "meta.duckdb")`
PostgreSQL	`Catalog("lake", "postgres:dbname=mydb host=localhost")`
MySQL	`Catalog("lake", "mysql:host=localhost database=mydb")`
SQLite	`Catalog("lake", "sqlite:meta.sqlite")`

All backends expose the same API. Swap the URI and everything else stays the same.

Features

Catalog Management

Create, list, and drop namespaces, tables, and views.

catalog.create_namespace("analytics")
catalog.list_namespaces()            # ["main", "analytics"]
catalog.list_tables("analytics")     # [("analytics", "events"), ...]

table = catalog.create_table(("analytics", "events"), schema)
catalog.rename_table("events", "events_v2")
catalog.drop_table("events_v2")

Views

Full view lifecycle with scannable View objects.

view = catalog.create_view("active_users", "SELECT * FROM users WHERE active = true")
view = catalog.load_view("active_users")

# Views are scannable, just like tables
view.scan().to_arrow()
view.scan("id > 100").select("name").to_pandas()
view.to_arrow_dataset()

catalog.create_or_replace_view("active_users", "SELECT * FROM users WHERE status = 'active'")
catalog.rename_view("active_users", "current_users")
catalog.list_views()
catalog.drop_view("current_users")

Schema Definition

Use Schema.of() with required() / optional() helpers for concise schema definitions.

from pyducklake import (
    Schema, required, optional,
    IntegerType, StringType, DoubleType, TimestampType,
    ListType, StructType, MapType, NestedField,
)

schema = Schema.of(
    required("id", IntegerType()),
    optional("name", StringType()),
    optional("score", DoubleType()),
    optional("tags", ListType(element_type=StringType())),
)

# Dict shorthand (all fields optional)
schema = Schema.of({"id": IntegerType(), "name": StringType()})

Reading Data

DataScan is an immutable builder. Chain methods and materialize with a terminal call.

scan = table.scan()

# Output formats
scan.to_arrow()                # pyarrow.Table
scan.to_pandas()               # pandas.DataFrame
scan.to_polars()               # polars.DataFrame
scan.to_duckdb()               # duckdb.DuckDBPyRelation
scan.to_arrow_batch_reader()   # pa.RecordBatchReader (streaming)
scan.to_arrow_dataset()        # pyarrow.dataset.Dataset (engine interop)
scan.count()                   # int

# Filtering, projection, limit
table.scan("price > 10.0").select("name", "price").with_limit(100).to_arrow()

The to_arrow_dataset() method returns a standard PyArrow Dataset, enabling interop with DuckDB, Polars, DataFusion, Dask, and any other engine that consumes the PyArrow dataset API.

Writing Data

import pyarrow as pa

df = pa.table({"id": [1, 2], "value": [10.0, 20.0]})

table.append(df)                                    # append rows
table.overwrite(df)                                 # replace all data
table.overwrite(df, overwrite_filter="id < 10")     # replace matching rows
table.delete("id = 1")                              # delete matching rows
result = table.upsert(df, join_cols=["id"])          # merge
print(result.rows_updated, result.rows_inserted)

Streaming writes for memory-efficient ingestion:

reader = pa.RecordBatchReader.from_batches(schema, batch_iterator)
table.append_batches(reader)

Arrow PyCapsule interface — pass Polars DataFrames, nanoarrow arrays, or any object implementing __arrow_c_stream__ directly to append(), overwrite(), and upsert():

import polars as pl
df = pl.DataFrame({"id": [1, 2], "value": [10.0, 20.0]})
table.append(df)  # no conversion needed

External file registration:

table.add_files(
    ["s3://bucket/data.parquet"],
    allow_missing=True,          # fill missing columns with defaults
    ignore_extra_columns=True,   # drop columns not in schema
)

See examples/etl_pipeline/ for a full ETL workflow.

Schema Evolution

Use the update_schema() context manager to batch changes.

with table.update_schema() as update:
    update.add_column("email", StringType())
    update.rename_column("score", "points")
    update.update_column("points", BigIntType())  # type widening
    update.drop_column("old_field")
    update.set_nullability("email", required=True)

See examples/schema_evolution/ for all operations.

Partitioning

Hidden partitioning with identity and temporal transforms.

from pyducklake import DAY, IDENTITY

with table.update_spec() as spec:
    spec.add_field("event_date", DAY)
    spec.add_field("region", IDENTITY)

table.spec               # current partition spec
table.spec.is_unpartitioned  # False

Sort Orders

Configure sort orders applied during writes and compaction.

from pyducklake import SortDirection, NullOrder

with table.update_sort_order() as sort:
    sort.add_field("timestamp", SortDirection.ASC, NullOrder.NULLS_LAST)
    sort.add_field("id", SortDirection.ASC)

# Writes automatically respect the sort order
table.append(unsorted_data)  # data written in sorted order

Time Travel

Query historical table state by snapshot ID or timestamp.

from datetime import datetime

# By snapshot ID
table.scan().with_snapshot(snap_id).to_arrow()

# By timestamp
table.scan().with_timestamp(datetime(2024, 1, 15)).to_arrow()

# Rollback to a previous state
table.rollback_to_snapshot(snap_id)
table.rollback_to_timestamp(datetime(2024, 1, 15))

# List snapshots
for snap in table.snapshots():
    print(snap.snapshot_id, snap.timestamp)

See examples/time_travel/ for a full walkthrough.

Change Data Capture

Query row-level changes between snapshots or timestamps, with filtering, column projection, and update correlation.

from datetime import datetime, timedelta

# All changes between snapshots
changes = table.table_changes(start_snapshot=2, end_snapshot=5)

# Timestamp-based bounds
changes = table.table_changes(
    start_time=datetime.now() - timedelta(hours=1),
    end_time=datetime.now(),
)

# Column projection and predicate pushdown
changes = table.table_changes(
    start_snapshot=2,
    columns=["id", "status"],
    filter_expr="status = 'active'",
)

# ChangeSet provides structured access
changes.inserts()           # only inserted rows
changes.deletes()           # only deleted rows
changes.update_preimages()  # pre-update row state
changes.update_postimages() # post-update row state
changes.summary()           # {"insert": 5, "delete": 2, ...}

# Correlate update pre/post images by row ID
for old_row, new_row in changes.updates():
    print(f"Changed {old_row['name']} -> {new_row['name']}")

# Convenience methods for insert-only or delete-only queries
inserted = table.table_insertions(start_snapshot=2)
deleted = table.table_deletions(start_snapshot=2)

See examples/time_travel/ and examples/table_replication/ for CDC examples.

Transactions

Atomic multi-table writes with automatic rollback on error.

with catalog.begin_transaction() as txn:
    orders = txn.load_table("orders")
    items = txn.load_table("order_items")

    orders.append(order_data)
    items.append(items_data)
    # Commits on clean exit; rolls back on exception

See examples/multi_table_transaction/.

Table Maintenance

Compaction, snapshot expiration, and file cleanup.

maint = table.maintenance()

maint.compact()                    # merge small files
maint.expire_snapshots(versions=5) # keep last 5 snapshots
maint.rewrite_data_files()         # rewrite files with current sort order
maint.cleanup_files()              # remove unreferenced files
maint.delete_orphaned_files()      # remove orphaned files
maint.checkpoint()                 # full maintenance pass

See examples/maintenance/.

Encryption

Catalog-level Parquet encryption for data at rest. Keys are auto-generated per file and stored in the catalog metadata.

catalog = Catalog("secure", "meta.duckdb", data_path="./data", encrypted=True)

table = catalog.create_table("pii", schema)
table.append(sensitive_data)

# Reads through the catalog are transparently decrypted
table.scan().to_arrow()

# Raw Parquet files are unreadable without the catalog

See examples/encrypted_catalog/.

Inspect API

Query table metadata as Arrow tables.

inspect = table.inspect()

inspect.snapshots()                      # snapshot history
inspect.files()                          # data file listing with sizes
inspect.files(snapshot_time="2024-06-01") # files at a point in time
inspect.history()                        # commit history (newest first)
inspect.partitions()                     # partition info

CLI

The pyducklake command-line tool provides catalog inspection and maintenance.

# List tables
pyducklake --uri meta.duckdb list-tables

# Describe a table (schema, partition spec, sort order)
pyducklake --uri meta.duckdb describe users

# Show schema
pyducklake --uri meta.duckdb schema users

# List snapshots
pyducklake --uri meta.duckdb snapshots users

# List data files
pyducklake --uri meta.duckdb files users

# Compact small files
pyducklake --uri meta.duckdb compact users

# Expire old snapshots
pyducklake --uri meta.duckdb expire-snapshots users --versions 5

# JSON output
pyducklake --uri meta.duckdb --output json list-tables

# Show version
pyducklake version

Examples

Example	Description	Docker
`quickstart/`	Create catalog, table, write data, read it back	No
`etl_pipeline/`	Batch loading, upserts, schema evolution, partitioning	No
`time_travel/`	Historical queries and change data capture	No
`multi_table_transaction/`	Atomic multi-table writes with rollback	No
`schema_evolution/`	Add, rename, drop columns and widen types	No
`maintenance/`	Compaction, snapshot expiration, file cleanup	No
`encrypted_catalog/`	Parquet-level encryption for data at rest	No
`postgres_backend/`	PostgreSQL as metadata backend	Yes
`table_replication/`	CDC-based replication to downstream Ducklakes by team_id	Yes

Run all local examples:

just examples

Run individual examples:

just example-quickstart
just example-etl
just example-time-travel
just example-transactions
just example-schema-evolution
just example-maintenance
just example-encryption
just example-postgres          # requires Docker
just example-replication       # requires Docker

API Documentation

Full API documentation is available via pdoc:

just docs-serve

Development

Prerequisites: uv, just, and Docker (for integration tests).

just sync              # install dependencies
just test              # unit tests
just test-integration  # integration tests (Docker required)
just ci                # lint + typecheck + test
just fmt               # format code
just typecheck         # mypy + pyright
just typecheck-mypy    # mypy strict
just typecheck-pyright # pyright strict
just audit             # dependency vulnerability scan
just build             # build wheel + sdist
just docs              # generate API docs

Roadmap

Expand Hypothesis property-based test coverage

License

Apache License 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jghoman

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.16

May 21, 2026

1.0.15

May 21, 2026

1.0.14

May 6, 2026

1.0.13

May 6, 2026

1.0.12

May 5, 2026

1.0.11

May 5, 2026

1.0.10

May 5, 2026

1.0.9

Apr 28, 2026

1.0.8

Apr 20, 2026

1.0.7

Apr 20, 2026

1.0.3

Apr 6, 2026

1.0.2

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyducklake-1.0.16.tar.gz (283.3 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyducklake-1.0.16-py3-none-any.whl (50.1 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file pyducklake-1.0.16.tar.gz.

File metadata

Download URL: pyducklake-1.0.16.tar.gz
Upload date: May 21, 2026
Size: 283.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pyducklake-1.0.16.tar.gz
Algorithm	Hash digest
SHA256	`566bd0e6e305d2f35b1e3241af98203c63b74d92c3f98ae4c8e6d5488d3ffc9a`
MD5	`4ff2904b08e0af9884fd89d49685ac91`
BLAKE2b-256	`dc72b947e9fdf653789dd9539495f49ebdd70d09719ce97ef5a375e3385ff16e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyducklake-1.0.16.tar.gz:

Publisher: release.yml on jghoman/pyducklake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyducklake-1.0.16.tar.gz
- Subject digest: 566bd0e6e305d2f35b1e3241af98203c63b74d92c3f98ae4c8e6d5488d3ffc9a
- Sigstore transparency entry: 1596381159
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: jghoman/pyducklake@728370dfbbbab50aa8a609af93c11935dea1a064
- Branch / Tag: refs/tags/v1.0.16
- Owner: https://github.com/jghoman
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@728370dfbbbab50aa8a609af93c11935dea1a064
- Trigger Event: push

File details

Details for the file pyducklake-1.0.16-py3-none-any.whl.

File metadata

Download URL: pyducklake-1.0.16-py3-none-any.whl
Upload date: May 21, 2026
Size: 50.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pyducklake-1.0.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca589a7fa8b5689b882241905e061bd4b835d579340cb4b3009ec9c5b67b7797`
MD5	`03d3a552ef91f11db4f45c133e52f49c`
BLAKE2b-256	`bf7bb10e5f1e2b8da3bba4d0188e712dbdde48411076bff78bfd31b9d29a0ef2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyducklake-1.0.16-py3-none-any.whl:

Publisher: release.yml on jghoman/pyducklake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyducklake-1.0.16-py3-none-any.whl
- Subject digest: ca589a7fa8b5689b882241905e061bd4b835d579340cb4b3009ec9c5b67b7797
- Sigstore transparency entry: 1596381230
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: jghoman/pyducklake@728370dfbbbab50aa8a609af93c11935dea1a064
- Branch / Tag: refs/tags/v1.0.16
- Owner: https://github.com/jghoman
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@728370dfbbbab50aa8a609af93c11935dea1a064
- Trigger Event: push

pyducklake 1.0.16

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pyducklake

What is pyducklake?

Installation

Quick Start

Comparison with pyiceberg

Feature Comparison

API Comparison

Metadata Backends

Features

Catalog Management

Views

Schema Definition

Reading Data

Writing Data

Schema Evolution

Partitioning

Sort Orders

Time Travel

Change Data Capture

Transactions

Table Maintenance

Encryption

Inspect API

CLI

Examples

API Documentation

Development

Roadmap

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance