A convenience wrapper around PyIceberg for simplified data loading into Apache Iceberg tables
Project description
iceberg-loader
A convenience wrapper around PyIceberg that simplifies data loading into Apache Iceberg tables. PyArrow-first, handles messy JSON, schema evolution, idempotent replace, upsert, batching, and streaming out of the box.
Status: Actively developed and under testing. PRs are welcome! Currently tested against Hive Metastore; REST Catalog support is planned.
Why iceberg-loader?
- Messy JSON friendly: auto-serializes dict/list/mixed fields to strings so writes don't fail.
- Schema evolution: add columns on the fly (opt-in), preserves field IDs.
- Safe writes: append/overwrite, idempotent replace via
replace_filter, upsert. - Stream friendly: commit intervals, batches, IPC streams.
- Single config:
LoaderConfigsets defaults; override per-call if needed.
Install
pip install "iceberg-loader[all]"
Or with uv:
uv add "iceberg-loader[all]"
Quickstart
from iceberg_loader import LoaderConfig, load_data_to_iceberg
from iceberg_loader.utils.arrow import create_arrow_table_from_data
catalog = load_catalog("default")
table_id = ("default", "comparison_complex_json")
data = [
{"id": 1, "complex_field": {"a": 1, "b": "nested"}, "signup_date": "2023-01-01"},
{"id": 2, "complex_field": {"a": 2, "b": "another", "c": [1, 2]}, "signup_date": "2023-01-02"},
{"id": 3, "complex_field": [1, 2, 3], "signup_date": "2023-01-02"},
]
arrow_table = create_arrow_table_from_data(data)
config = LoaderConfig(write_mode="append", partition_col="day(signup_date)", schema_evolution=True)
load_data_to_iceberg(arrow_table, table_id, catalog, config=config)
Which function to use?
| Function | Use when... | Input Format |
|---|---|---|
load_data_to_iceberg |
You have a single pa.Table in memory. |
pyarrow.Table |
load_batches_to_iceberg |
You have a generator/iterator of batches (memory efficient). | Iterator of pyarrow.RecordBatch |
load_ipc_stream_to_iceberg |
You are reading from an Arrow IPC stream file/socket. | File-like object or path |
Preparing Data
Use helpers to convert Python dictionaries to Arrow format (handling messy types automatically):
from iceberg_loader.utils.arrow import create_arrow_table_from_data, create_record_batches_from_dicts
# 1. Convert list of dicts -> pa.Table
arrow_table = create_arrow_table_from_data(data_list)
# 2. Convert iterator of dicts -> Iterator[pa.RecordBatch]
batches = create_record_batches_from_dicts(data_generator(), batch_size=10000)
Alternatively, use standard PyArrow conversion: pa.Table.from_pylist(data).
Public API & Stability
- Public surface:
LoaderConfig,load_data_to_iceberg,load_batches_to_iceberg,load_ipc_stream_to_iceberg. - Everything else is internal and may change without notice; always pass options via
LoaderConfig. - Avoid legacy positional arguments—use the
configparameter only. - LoaderConfig validates partition expressions and rejects unsafe combos (e.g.,
replace_filterwithupsert, identity partition on_load_dttm).
How we version
- Semantic Versioning starting at
0.1.x: MINOR for compatible features, PATCH for fixes, MAJOR for breaking API changes. - Breaking changes only happen on the public surface noted above.
- Prefer partition transforms for timestamps (
day(ts),hour(ts)), especially when usingload_timestamp.
Release checklist
- Bump version in
pyproject.tomlandsrc/iceberg_loader/__about__.py(they must match). - Update
RELEASE.mdwith highlights and breaking notes. - Run
uv lock --lockedand commituv.lockif it changes. - Run
uv run ruff check .,uv run ty check, anduv run python -m pytest. - Tag and push (
git tag -a vX.Y.Z ...), then let CI publish.
Contributing
We welcome contributions! See CONTRIBUTING.md for setup, coding style, and PR guidelines.
uv run ruff check . && uv run ruff format --check . && uv run ty check
uv run pytest
Contributors
Thanks to all contributors who have helped make this project better!
Made with contrib.rocks.
License
iceberg-loader is distributed under the terms of the MIT license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iceberg_loader-0.1.3.tar.gz.
File metadata
- Download URL: iceberg_loader-0.1.3.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ed492133bf628003ee089d50280d6f3638d76ff496e40cd87a0283b95e357eb
|
|
| MD5 |
1ecef104a6da7739d8bc60d7531a4b88
|
|
| BLAKE2b-256 |
f68bb38620e720607e84451aa11910dd55a5351da488c02a8c0ee27a37f70683
|
File details
Details for the file iceberg_loader-0.1.3-py3-none-any.whl.
File metadata
- Download URL: iceberg_loader-0.1.3-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce5609ac0d6f9d5d26958c579549fdc8be07c6d267d69cafe1daa7ec2239dbaa
|
|
| MD5 |
a4f099bfaf0f98c9692ddf33ef88d1ba
|
|
| BLAKE2b-256 |
81f197193e1a80e34dc42555e7156da6cc918b874a924449c8597361bedcac73
|