No project description provided
Project description
duckberg
DuckBerg
query your iceberg data easily and efficiently
Table of Contents
About
Duckberg is a Python package that synthesizes the power of PyIceberg and DuckDb. PyIceberg enables efficient interaction with Apache Iceberg, a format for handling large datasets, while DuckDb offers swift in-memory data analysis. When combined, these tools create Duckberg, which simplifies the querying process for large Iceberg datasets stored on blob storage with a user-friendly Pythonic approach.
The underlying principle of the Duckberg Python package is to execute your SQL queries only on those data lake files that contain the necessary data for your results. To fully utilize the benefits of this package, it's assumed that your data is partitioned in a manner that suits your query and use case.
Iceberg catalog types
Duckberg supports the same Iceberg catalogs as PyIceberg, including REST, SQL, Hive, Glue, and DynamoDB. These catalogs are sources of information about Iceberg datasets, tables, partitions, etc. Before using Duckberg, ensure that you have access to an Iceberg catalog that can be utilized.
Installation
pip install duckberg
Features
Easy initialisation
Following initialisation is using the REST Iceberg catalog
with Amazon S3
as a iceberg data storage.
from duckberg import DuckBerg
catalog_config: dict[str, str] = {
"type": "rest", # Iceberg catalog type
"uri": "http://iceberg-rest:8181/", # url for Iceberg catalog
"credentials": "user:password", # credentials for Iceberg catalog
"s3.endpoint": S3_ENDPOINT, # s3
"s3.access-key-id": S3_ACCESS_KEY_ID,
"s3.secret-access-key": S3_SECET_KEY
}
db = DuckBerg(
catalog_name="warehouse",
catalog_config=catalog_config)
Listing tables
db.list_tables()
Listing partitions for particular table
db.list_partitions(table="nyc.taxis")
Querying data to Pandas dataframe
In the latest new update we have added very crude and simple SQL parser that can extract necessary information from the SQL query without the need to specify table
and partition_filters
. This is the new and prefered way:
query = "SELECT * FROM nyc.taxis WHERE payment_type = 1 AND trip_distance > 40 ORDER BY tolls_amount DESC"
df = db.select(sql=query).read_pandas()
Old way of selecting data (will get deprecated in the future):
query = "SELECT * FROM nyc.taxis WHERE trip_distance > 40 ORDER BY tolls_amount DESC"
df = db.select(sql=query, table="nyc.taxis", partition_filter="payment_type = 1").read_pandas()
Playground
You can run the playground environment running docker compose in the playground
cd playground
docker-compose up -d
The initial run could take additional time for jupyter docker image build. Then you can access
Iceberg data init
Once all the containers have been initiated run the Spark Iceberg Jupyter notebook that will init the Iceberg data and catalog.
Duckberg playground
Navigate to localhost:8888. Then select example Jupyter notebook you want to run and enjoy Duckberg!
Development
For the development, there is recommendation to use Python 3.10. If you manage your Python versions by Pyenv use
pyenv install 3.10.13
pyenv global 3.10.13
then create and activate virtual environment
python -m venv venv
source venv/bin/activate
upgrade pip and install dependencies
pip install --upgrade pip
pip install .
then run dockers that contains Iceberg catalog and file storage containing iceberg files
cd playground
docker-compose up -d
init data by running Init Jupyter notebook and
run/test Duckberg in the file tests/duckberg-sample.py
Style & Formatting
Use
hatch run lint:fmt
hatch run lint:style
Building package
The Duckberg project is managed by Hatch. Follow [Hatch docs] for an installation or just install by command
brew install hatch
or
pip install hatch
Increase package by
hatch version "x.x.x"
Build
hatch build
and publish
hatch publish
License
duckberg
is distributed under the terms of the Apache 2.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file duckberg-0.0.5.tar.gz
.
File metadata
- Download URL: duckberg-0.0.5.tar.gz
- Upload date:
- Size: 545.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.25.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83c51f12b49cd53d0c23d5584d6018f10f25647cebdb538f1f58e4f6bcc81ed6 |
|
MD5 | 8338b65efe7aac3a2df31e6e0a7a727f |
|
BLAKE2b-256 | 3d767ac39112301193d6ad538c499c2d8d2d0608c044a68f2b92271ac1233457 |
File details
Details for the file duckberg-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: duckberg-0.0.5-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.25.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2640aed6f6e7dfe514ec2cfadf9d6dece8ee6dab397ba41cc0d04bdec1e66388 |
|
MD5 | b570e3c5b181451bd0af471763a42665 |
|
BLAKE2b-256 | 52a38bd0b1dd0fa6cb15a02cda5cfcb787ee4b2c409b2228f2799a958401309e |