No project description provided
Project description
duckberg
Duckberg
query your data lakes easy and efficient
Table of Contents
About
Duckberg is a Python package that synthesizes the power of PyIceberg and DuckDb. PyIceberg enables efficient interaction with Apache Iceberg, a format for handling large datasets, while DuckDb offers swift in-memory data analysis. When combined, these tools create Duckberg, which simplifies the querying process for large Iceberg datasets stored on blob storage. Duckberg offers high-speed data processing, memory efficiency, and a user-friendly Pythonic approach, making the querying of big data without an external query engine easy and efficient.
The underlying principle of the Duckberg Python package is to execute your SQL queries only on those data lake files that contain the necessary data for your results. To fully utilize the benefits of this package, it's assumed that your data is partitioned in a manner that suits your query and use case.
Iceberg catalog types
Duckberg supports the same Iceberg catalogs as PyIceberg, including REST, SQL, Hive, Glue, and DynamoDB. These catalogs are sources of information about Iceberg datasets, tables, partitions, etc. Before using Duckberg, ensure that you have access to an Iceberg catalog that can be utilized.
Installation
pip install duckberg
Examples
This repository contains docker compose environment that uses
- [Rest Iceberg Catalog]
- Minio as an object storage that is S3 compatible.
- Jupyter to run examples and experiment with Duckberg easily
For the first run it could take a while to download sample data and buid the images.
cd examples
docker-compose up -d
Once all the containers have been initiated, simply navigate to localhost:8888, select example you want to run and enjoy Duckberg!
Features
Easy initialisation
Following initialisation is using the REST Iceberg catalog
with Amazon S3
as a iceberg data storage.
from duckberg import DuckBerg
catalog_config: dict[str, str] = {
"type": "rest", # Iceberg catalog type
"uri": "http://iceberg-rest:8181/", # url for Iceberg catalog
"credentials": "user:password", # credentials for Iceberg catalog
"s3.endpoint": S3_ENDPOINT, # s3
"s3.access-key-id": S3_ACCESS_KEY_ID,
"s3.secret-access-key": S3_SECET_KEY
}
db = DuckBerg(
catalog_name="warehouse",
catalog_config=catalog_config)
Listing tables
db.list_tables()
Listing partitions for particular table
db.list_partitions(table="nyc.taxis")
Querying data to Pandas dataframe
query = "SELECT * FROM nyc.taxis WHERE trip_distance > 40 ORDER BY tolls_amount DESC"
df = db.select(table="nyc.taxis", partition_filter="payment_type = 1", sql=query)
Development
TBD ...
License
duckberg
is distributed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file duckberg-0.0.2.tar.gz
.
File metadata
- Download URL: duckberg-0.0.2.tar.gz
- Upload date:
- Size: 142.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.23.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6035628dcbbfd0200dba7eef725784b444f05eb545f8a9becc38d620f0fdb7ab |
|
MD5 | 299db5c6f59da086b1a1d8e5f5d42cc2 |
|
BLAKE2b-256 | d996f32e11c2e223b5888fc7ba6df97f919b1309f63604e26d0a8b5aac80b24a |
File details
Details for the file duckberg-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: duckberg-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.23.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 971167225b409425ef7aea1c579669ff80f2955ec337f6de676bfef33458c2a6 |
|
MD5 | a615653367055cf27efc73437813e270 |
|
BLAKE2b-256 | 3cb7d34af1b7ca8c2585f318c5a3a48e7480a99abd15bf4dc9ae4ab27eaa808c |