Pandas integration for DuckLake catalogs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

ducklake-pandas

This project is a proof of concept. It was 100% written by Claude Code (Anthropic's AI coding agent). It is not intended for production use.

A pure-Python Pandas integration for DuckLake catalogs — both read and write.

Reads and writes DuckLake metadata directly from SQLite or PostgreSQL and scans the underlying Parquet data files through Pandas' native Parquet reader. No DuckDB runtime dependency.

Installation

pip install ducklake-pandas

# With PostgreSQL catalog support
pip install ducklake-pandas[postgres]

Runtime dependencies: pandas >= 2.0 and pyarrow >= 12.0. SQLite catalogs use Python's built-in sqlite3. PostgreSQL catalogs require the postgres extra (adds psycopg2).

Quick start

Reading data

from ducklake_pandas import read_ducklake

# Read a table into a DataFrame
df = read_ducklake("catalog.ducklake", "my_table")

# Select specific columns
df = read_ducklake("catalog.ducklake", "my_table", columns=["x", "y"])

# Time travel
df = read_ducklake("catalog.ducklake", "my_table", snapshot_version=3)
df = read_ducklake("catalog.ducklake", "my_table", snapshot_time="2025-01-15T10:30:00")

# PostgreSQL-backed catalog
df = read_ducklake("postgresql://user:pass@localhost/mydb", "my_table")

Writing data

import pandas as pd
from ducklake_pandas import write_ducklake

df = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})

# Create and populate a new table
write_ducklake(df, "catalog.ducklake", "users", mode="error")

# Append rows
write_ducklake(new_rows, "catalog.ducklake", "users", mode="append")

# Overwrite all data
write_ducklake(df, "catalog.ducklake", "users", mode="overwrite")

DDL operations

from ducklake_pandas import (
    create_ducklake_table,
    alter_ducklake_add_column,
    alter_ducklake_drop_column,
    alter_ducklake_rename_column,
    alter_ducklake_set_partitioned_by,
    drop_ducklake_table,
    rename_ducklake_table,
    create_ducklake_schema,
    drop_ducklake_schema,
    create_ducklake_view,
    drop_ducklake_view,
)

# Schema management
create_ducklake_schema("catalog.ducklake", "analytics")
drop_ducklake_schema("catalog.ducklake", "analytics", cascade=True)

# Table management — schema_dict uses DuckDB type strings
create_ducklake_table("catalog.ducklake", "events", {"ts": "timestamp", "value": "double"})
rename_ducklake_table("catalog.ducklake", "events", "event_log")
drop_ducklake_table("catalog.ducklake", "event_log")

# Column management — dtype is a DuckDB type string
alter_ducklake_add_column("catalog.ducklake", "users", "email", "varchar")
alter_ducklake_rename_column("catalog.ducklake", "users", "email", "contact_email")
alter_ducklake_drop_column("catalog.ducklake", "users", "contact_email")

# Partitioning
alter_ducklake_set_partitioned_by("catalog.ducklake", "events", ["region", "date"])

# Views
create_ducklake_view("catalog.ducklake", "active_users", "SELECT * FROM users WHERE active = true")
drop_ducklake_view("catalog.ducklake", "active_users")

DML operations

from ducklake_pandas import delete_ducklake, update_ducklake, merge_ducklake

# Delete rows matching a predicate (callable: DataFrame -> Series[bool])
deleted = delete_ducklake("catalog.ducklake", "users", lambda df: df["active"] == False)

# Update rows
updated = update_ducklake(
    "catalog.ducklake", "users",
    updates={"status": "inactive"},
    predicate=lambda df: df["last_login"] < "2024-01-01",
)

# Merge (upsert)
rows_updated, rows_inserted = merge_ducklake(
    "catalog.ducklake", "users", source_df, on="id",
    when_matched_update=True,
    when_not_matched_insert=True,
)

Catalog inspection

from ducklake_pandas import DuckLakeCatalog

catalog = DuckLakeCatalog("catalog.ducklake")

catalog.snapshots()           # All snapshots
catalog.current_snapshot()    # Latest snapshot ID
catalog.list_schemas()        # All schemas
catalog.list_tables()         # Tables in a schema
catalog.table_info()          # Per-table storage metadata
catalog.list_files("users")   # Data files and delete files
catalog.options()             # Catalog key-value metadata
catalog.settings()            # Backend type and data path

# Change data feed
catalog.table_insertions("users", start_version=1, end_version=5)
catalog.table_deletions("users", start_version=1, end_version=5)
catalog.table_changes("users", start_version=1, end_version=5)

Maintenance

from ducklake_pandas import expire_snapshots, vacuum_ducklake

# Expire old snapshots (metadata cleanup)
expired = expire_snapshots("catalog.ducklake", keep_last_n=10)

# Delete orphaned Parquet files (disk cleanup)
deleted = vacuum_ducklake("catalog.ducklake")

Features

Read path

Eager reads via read_ducklake()
Column projection via the columns parameter
File pruning via column-level min/max statistics and partition values
Time travel by snapshot version or timestamp
Delete file handling via Iceberg-compatible positional deletes
Schema evolution — ADD COLUMN, DROP COLUMN, RENAME COLUMN all handled transparently
Inlined data — small tables stored directly in catalog metadata
Partition pruning for identity-transform partitions
Column renames — old Parquet files with old names seamlessly reconciled

Write path

INSERT — append, overwrite, or error-on-exists modes
DELETE — predicate-based row deletion with position-delete files
UPDATE — atomic delete + insert in a single snapshot
MERGE — upsert with configurable matched/unmatched behavior
CREATE TABLE AS — single-snapshot table creation with data
Data inlining — small inserts stored as rows in catalog metadata
Partitioned writes — Hive-style directory layout per partition key

DDL

CREATE/DROP TABLE with full snapshot versioning
ADD/DROP/RENAME COLUMN with schema evolution tracking
CREATE/DROP SCHEMA with cascade support
RENAME TABLE preserving table identity
SET PARTITIONED BY for identity-transform partitioning
CREATE/DROP VIEW with OR REPLACE support

Catalog inspection

Snapshot history and time travel metadata
Per-table storage statistics (file counts, sizes)
Data file and delete file listing
Schema and table enumeration
Key-value catalog options
Change data feed — insertions, deletions, and update detection

Maintenance

expire_snapshots — remove old snapshot metadata
vacuum — delete orphaned Parquet files

Backend support

SQLite — via Python stdlib sqlite3 (zero-dependency)
PostgreSQL — via psycopg2 (optional extra)
Full interoperability with DuckDB's DuckLake extension

DuckDB interoperability

ducklake-pandas produces catalogs that are fully interoperable with DuckDB's DuckLake extension. You can:

Create catalogs with DuckDB, read/write with ducklake-pandas
Create catalogs with ducklake-pandas, read/query with DuckDB
Mix operations freely — both tools read the same metadata format

# Create catalog with DuckDB
import duckdb
con = duckdb.connect()
con.execute("INSTALL ducklake; LOAD ducklake")
con.execute("ATTACH 'ducklake:sqlite:catalog.ducklake' AS lake (DATA_PATH 'data/')")
con.execute("CREATE TABLE lake.users (id INTEGER, name VARCHAR)")
con.execute("INSERT INTO lake.users VALUES (1, 'Alice'), (2, 'Bob')")
con.close()

# Read with ducklake-pandas
from ducklake_pandas import read_ducklake
df = read_ducklake("catalog.ducklake", "users")

See the DuckDB Interop Guide for detailed interop patterns.

Supported data types

DuckLake / DuckDB type	Pandas type	Notes
`TINYINT` / `int8`	`Int8`
`SMALLINT` / `int16`	`Int16`
`INTEGER` / `int32`	`Int32`
`BIGINT` / `int64`	`Int64`
`UTINYINT` / `uint8`	`UInt8`
`USMALLINT` / `uint16`	`UInt16`
`UINTEGER` / `uint32`	`UInt32`
`UBIGINT` / `uint64`	`UInt64`
`FLOAT` / `float32`	`Float32`
`DOUBLE` / `float64`	`Float64`
`BOOLEAN`	`bool`
`VARCHAR`	`object` (str)
`BLOB`	`object` (bytes)
`DATE`	`object` (date)
`TIME` / `timetz`	`object` (time)
`TIMESTAMP`	`datetime64[us]`
`TIMESTAMP_MS`	`datetime64[ms]`
`TIMESTAMP_NS`	`datetime64[ns]`
`TIMESTAMP_S`	`datetime64[s]`
`TIMESTAMPTZ`	`datetime64[us]`
`DECIMAL(p, s)`	`object` (Decimal)
`UUID`	`object`	Binary in Parquet
`JSON`	`object`	Binary in Parquet
`HUGEINT`	`Int64`	Limited: DuckDB writes as Float64 in Parquet
`INTERVAL`	`object`	Limited: Pandas Parquet reader limitation
`LIST(T)`	`object` (list)	Recursive nesting supported
`STRUCT(...)`	`object` (dict)	Recursive nesting supported
`MAP(K, V)`	`object` (list of dicts)	Limited: Pandas Parquet reader issue

Architecture

src/ducklake_pandas/
    __init__.py       Public API (all functions and DuckLakeCatalog)
    _backend.py       Backend adapters (SQLite, PostgreSQL)
    _catalog.py       Metadata reader (snapshots, tables, columns, files, stats)
    _catalog_api.py   DuckLakeCatalog inspection class
    _dataset.py       Pandas dataset reader
    _schema.py        DuckLake type -> Pandas type mapping
    _stats.py         Column statistics for file pruning
    _writer.py        Catalog writer (tables, data, DDL, views, maintenance)

See the Architecture Overview for a detailed deep-dive.

Development

git clone https://github.com/pdet/ducklake-pandas.git
cd ducklake-pandas
pip install -e ".[dev]"

Running tests

pytest                    # Full suite (SQLite backend)
pytest -n auto            # Parallel execution
pytest -k "test_views"    # Specific pattern

# With PostgreSQL backend
DUCKLAKE_PG_DSN="postgresql://user:pass@localhost/testdb" pytest

Test suite: 553 tests (4 xfailed for known DuckDB/Pandas limitations). Tests are parametrized over backends — SQLite always runs; PostgreSQL runs when DUCKLAKE_PG_DSN is set.

Documentation

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pdet

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ducklake_pandas-0.1.0.tar.gz (103.3 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ducklake_pandas-0.1.0-py3-none-any.whl (44.3 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file ducklake_pandas-0.1.0.tar.gz.

File metadata

Download URL: ducklake_pandas-0.1.0.tar.gz
Upload date: Feb 25, 2026
Size: 103.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ducklake_pandas-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8cf3452a4469a164d19503b0a525bf28de14762b017536433e2b1edeb2bdecea`
MD5	`204db458d1c5e736c22f4afbba5224ef`
BLAKE2b-256	`2105e262297366404ac5529daa7f4268476e3ca89f62fe7c4ccde05ff0fbb4bf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ducklake_pandas-0.1.0.tar.gz:

Publisher: publish.yml on pdet/ducklake-pandas

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ducklake_pandas-0.1.0.tar.gz
- Subject digest: 8cf3452a4469a164d19503b0a525bf28de14762b017536433e2b1edeb2bdecea
- Sigstore transparency entry: 991618639
- Sigstore integration time: Feb 25, 2026
Source repository:
- Permalink: pdet/ducklake-pandas@4d83fa22c7e08d427afa38bbd2c77a6ddc71b264
- Branch / Tag: refs/tags/v0.1
- Owner: https://github.com/pdet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4d83fa22c7e08d427afa38bbd2c77a6ddc71b264
- Trigger Event: release

File details

Details for the file ducklake_pandas-0.1.0-py3-none-any.whl.

File metadata

Download URL: ducklake_pandas-0.1.0-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 44.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ducklake_pandas-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8b5da9eb9054b1ec272ac6359f6266724589a27d4cd9d69039cac1252d2366b9`
MD5	`0d85590b24d280233acc5d6fb6615eb7`
BLAKE2b-256	`4a0044c98b2dc28487ab01105aeae78947b63ec21c53290aee51deb4f814414e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ducklake_pandas-0.1.0-py3-none-any.whl:

Publisher: publish.yml on pdet/ducklake-pandas

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ducklake_pandas-0.1.0-py3-none-any.whl
- Subject digest: 8b5da9eb9054b1ec272ac6359f6266724589a27d4cd9d69039cac1252d2366b9
- Sigstore transparency entry: 991618642
- Sigstore integration time: Feb 25, 2026
Source repository:
- Permalink: pdet/ducklake-pandas@4d83fa22c7e08d427afa38bbd2c77a6ddc71b264
- Branch / Tag: refs/tags/v0.1
- Owner: https://github.com/pdet
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4d83fa22c7e08d427afa38bbd2c77a6ddc71b264
- Trigger Event: release

ducklake-pandas 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ducklake-pandas

Installation

Quick start

Reading data

Writing data

DDL operations

DML operations

Catalog inspection

Maintenance

Features

Read path

Write path

DDL

Catalog inspection

Maintenance

Backend support

DuckDB interoperability

Supported data types

Architecture

Development

Running tests

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance