Skip to main content

Pandas integration for DuckLake catalogs

Project description

ducklake-pandas

This project is a proof of concept. It was 100% written by Claude Code (Anthropic's AI coding agent). It is not intended for production use.

A pure-Python Pandas integration for DuckLake catalogs — both read and write.

Reads and writes DuckLake metadata directly from SQLite or PostgreSQL and scans the underlying Parquet data files through Pandas' native Parquet reader. No DuckDB runtime dependency.

Installation

pip install ducklake-pandas

# With PostgreSQL catalog support
pip install ducklake-pandas[postgres]

Runtime dependencies: pandas >= 2.0 and pyarrow >= 12.0. SQLite catalogs use Python's built-in sqlite3. PostgreSQL catalogs require the postgres extra (adds psycopg2).

Quick start

Reading data

from ducklake_pandas import read_ducklake

# Read a table into a DataFrame
df = read_ducklake("catalog.ducklake", "my_table")

# Select specific columns
df = read_ducklake("catalog.ducklake", "my_table", columns=["x", "y"])

# Time travel
df = read_ducklake("catalog.ducklake", "my_table", snapshot_version=3)
df = read_ducklake("catalog.ducklake", "my_table", snapshot_time="2025-01-15T10:30:00")

# PostgreSQL-backed catalog
df = read_ducklake("postgresql://user:pass@localhost/mydb", "my_table")

Writing data

import pandas as pd
from ducklake_pandas import write_ducklake

df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})

# Create and populate a new table
write_ducklake(df, "catalog.ducklake", "users", mode="error")

# Append rows
write_ducklake(new_rows, "catalog.ducklake", "users", mode="append")

# Overwrite all data
write_ducklake(df, "catalog.ducklake", "users", mode="overwrite")

DDL operations

from ducklake_pandas import (
    create_ducklake_table,
    alter_ducklake_add_column,
    alter_ducklake_drop_column,
    alter_ducklake_rename_column,
    alter_ducklake_set_partitioned_by,
    drop_ducklake_table,
    rename_ducklake_table,
    create_ducklake_schema,
    drop_ducklake_schema,
    create_ducklake_view,
    drop_ducklake_view,
)

# Schema management
create_ducklake_schema("catalog.ducklake", "analytics")
drop_ducklake_schema("catalog.ducklake", "analytics", cascade=True)

# Table management — schema_dict uses DuckDB type strings
create_ducklake_table("catalog.ducklake", "events", {"ts": "timestamp", "value": "double"})
rename_ducklake_table("catalog.ducklake", "events", "event_log")
drop_ducklake_table("catalog.ducklake", "event_log")

# Column management — dtype is a DuckDB type string
alter_ducklake_add_column("catalog.ducklake", "users", "email", "varchar")
alter_ducklake_rename_column("catalog.ducklake", "users", "email", "contact_email")
alter_ducklake_drop_column("catalog.ducklake", "users", "contact_email")

# Partitioning
alter_ducklake_set_partitioned_by("catalog.ducklake", "events", ["region", "date"])

# Views
create_ducklake_view("catalog.ducklake", "active_users", "SELECT * FROM users WHERE active = true")
drop_ducklake_view("catalog.ducklake", "active_users")

DML operations

from ducklake_pandas import delete_ducklake, update_ducklake, merge_ducklake

# Delete rows matching a predicate (callable: DataFrame -> Series[bool])
deleted = delete_ducklake("catalog.ducklake", "users", lambda df: df["active"] == False)

# Update rows
updated = update_ducklake(
    "catalog.ducklake", "users",
    updates={"status": "inactive"},
    predicate=lambda df: df["last_login"] < "2024-01-01",
)

# Merge (upsert)
rows_updated, rows_inserted = merge_ducklake(
    "catalog.ducklake", "users", source_df, on="id",
    when_matched_update=True,
    when_not_matched_insert=True,
)

Catalog inspection

from ducklake_pandas import DuckLakeCatalog

catalog = DuckLakeCatalog("catalog.ducklake")

catalog.snapshots()           # All snapshots
catalog.current_snapshot()    # Latest snapshot ID
catalog.list_schemas()        # All schemas
catalog.list_tables()         # Tables in a schema
catalog.table_info()          # Per-table storage metadata
catalog.list_files("users")   # Data files and delete files
catalog.options()             # Catalog key-value metadata
catalog.settings()            # Backend type and data path

# Change data feed
catalog.table_insertions("users", start_version=1, end_version=5)
catalog.table_deletions("users", start_version=1, end_version=5)
catalog.table_changes("users", start_version=1, end_version=5)

Maintenance

from ducklake_pandas import expire_snapshots, vacuum_ducklake

# Expire old snapshots (metadata cleanup)
expired = expire_snapshots("catalog.ducklake", keep_last_n=10)

# Delete orphaned Parquet files (disk cleanup)
deleted = vacuum_ducklake("catalog.ducklake")

Features

Read path

  • Eager reads via read_ducklake()
  • Column projection via the columns parameter
  • File pruning via column-level min/max statistics and partition values
  • Time travel by snapshot version or timestamp
  • Delete file handling via Iceberg-compatible positional deletes
  • Schema evolution — ADD COLUMN, DROP COLUMN, RENAME COLUMN all handled transparently
  • Inlined data — small tables stored directly in catalog metadata
  • Partition pruning for identity-transform partitions
  • Column renames — old Parquet files with old names seamlessly reconciled

Write path

  • INSERT — append, overwrite, or error-on-exists modes
  • DELETE — predicate-based row deletion with position-delete files
  • UPDATE — atomic delete + insert in a single snapshot
  • MERGE — upsert with configurable matched/unmatched behavior
  • CREATE TABLE AS — single-snapshot table creation with data
  • Data inlining — small inserts stored as rows in catalog metadata
  • Partitioned writes — Hive-style directory layout per partition key

DDL

  • CREATE/DROP TABLE with full snapshot versioning
  • ADD/DROP/RENAME COLUMN with schema evolution tracking
  • CREATE/DROP SCHEMA with cascade support
  • RENAME TABLE preserving table identity
  • SET PARTITIONED BY for identity-transform partitioning
  • CREATE/DROP VIEW with OR REPLACE support

Catalog inspection

  • Snapshot history and time travel metadata
  • Per-table storage statistics (file counts, sizes)
  • Data file and delete file listing
  • Schema and table enumeration
  • Key-value catalog options
  • Change data feed — insertions, deletions, and update detection

Maintenance

  • expire_snapshots — remove old snapshot metadata
  • vacuum — delete orphaned Parquet files

Backend support

  • SQLite — via Python stdlib sqlite3 (zero-dependency)
  • PostgreSQL — via psycopg2 (optional extra)
  • Full interoperability with DuckDB's DuckLake extension

DuckDB interoperability

ducklake-pandas produces catalogs that are fully interoperable with DuckDB's DuckLake extension. You can:

  • Create catalogs with DuckDB, read/write with ducklake-pandas
  • Create catalogs with ducklake-pandas, read/query with DuckDB
  • Mix operations freely — both tools read the same metadata format
# Create catalog with DuckDB
import duckdb
con = duckdb.connect()
con.execute("INSTALL ducklake; LOAD ducklake")
con.execute("ATTACH 'ducklake:sqlite:catalog.ducklake' AS lake (DATA_PATH 'data/')")
con.execute("CREATE TABLE lake.users (id INTEGER, name VARCHAR)")
con.execute("INSERT INTO lake.users VALUES (1, 'Alice'), (2, 'Bob')")
con.close()

# Read with ducklake-pandas
from ducklake_pandas import read_ducklake
df = read_ducklake("catalog.ducklake", "users")

See the DuckDB Interop Guide for detailed interop patterns.

Supported data types

DuckLake / DuckDB type Pandas type Notes
TINYINT / int8 Int8
SMALLINT / int16 Int16
INTEGER / int32 Int32
BIGINT / int64 Int64
UTINYINT / uint8 UInt8
USMALLINT / uint16 UInt16
UINTEGER / uint32 UInt32
UBIGINT / uint64 UInt64
FLOAT / float32 Float32
DOUBLE / float64 Float64
BOOLEAN bool
VARCHAR object (str)
BLOB object (bytes)
DATE object (date)
TIME / timetz object (time)
TIMESTAMP datetime64[us]
TIMESTAMP_MS datetime64[ms]
TIMESTAMP_NS datetime64[ns]
TIMESTAMP_S datetime64[s]
TIMESTAMPTZ datetime64[us]
DECIMAL(p, s) object (Decimal)
UUID object Binary in Parquet
JSON object Binary in Parquet
HUGEINT Int64 Limited: DuckDB writes as Float64 in Parquet
INTERVAL object Limited: Pandas Parquet reader limitation
LIST(T) object (list) Recursive nesting supported
STRUCT(...) object (dict) Recursive nesting supported
MAP(K, V) object (list of dicts) Limited: Pandas Parquet reader issue

Architecture

src/ducklake_pandas/
    __init__.py       Public API (all functions and DuckLakeCatalog)
    _backend.py       Backend adapters (SQLite, PostgreSQL)
    _catalog.py       Metadata reader (snapshots, tables, columns, files, stats)
    _catalog_api.py   DuckLakeCatalog inspection class
    _dataset.py       Pandas dataset reader
    _schema.py        DuckLake type -> Pandas type mapping
    _stats.py         Column statistics for file pruning
    _writer.py        Catalog writer (tables, data, DDL, views, maintenance)

See the Architecture Overview for a detailed deep-dive.

Development

git clone https://github.com/pdet/ducklake-pandas.git
cd ducklake-pandas
pip install -e ".[dev]"

Running tests

pytest                    # Full suite (SQLite backend)
pytest -n auto            # Parallel execution
pytest -k "test_views"    # Specific pattern

# With PostgreSQL backend
DUCKLAKE_PG_DSN="postgresql://user:pass@localhost/testdb" pytest

Test suite: 553 tests (4 xfailed for known DuckDB/Pandas limitations). Tests are parametrized over backends — SQLite always runs; PostgreSQL runs when DUCKLAKE_PG_DSN is set.

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ducklake_pandas-0.1.0.tar.gz (103.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ducklake_pandas-0.1.0-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file ducklake_pandas-0.1.0.tar.gz.

File metadata

  • Download URL: ducklake_pandas-0.1.0.tar.gz
  • Upload date:
  • Size: 103.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ducklake_pandas-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8cf3452a4469a164d19503b0a525bf28de14762b017536433e2b1edeb2bdecea
MD5 204db458d1c5e736c22f4afbba5224ef
BLAKE2b-256 2105e262297366404ac5529daa7f4268476e3ca89f62fe7c4ccde05ff0fbb4bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for ducklake_pandas-0.1.0.tar.gz:

Publisher: publish.yml on pdet/ducklake-pandas

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ducklake_pandas-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ducklake_pandas-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8b5da9eb9054b1ec272ac6359f6266724589a27d4cd9d69039cac1252d2366b9
MD5 0d85590b24d280233acc5d6fb6615eb7
BLAKE2b-256 4a0044c98b2dc28487ab01105aeae78947b63ec21c53290aee51deb4f814414e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ducklake_pandas-0.1.0-py3-none-any.whl:

Publisher: publish.yml on pdet/ducklake-pandas

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page