Skip to main content

No project description provided

Project description

duckberg


Duckberg
query your data lakes easy and efficient

PyPI - Version PyPI - Python Version


Table of Contents

About

Duckberg is a Python package that synthesizes the power of PyIceberg and DuckDb. PyIceberg enables efficient interaction with Apache Iceberg, a format for handling large datasets, while DuckDb offers swift in-memory data analysis. When combined, these tools create Duckberg, which simplifies the querying process for large Iceberg datasets stored on blob storage. Duckberg offers high-speed data processing, memory efficiency, and a user-friendly Pythonic approach, making the querying of big data without an external query engine easy and efficient.

The underlying principle of the Duckberg Python package is to execute your SQL queries only on those data lake files that contain the necessary data for your results. To fully utilize the benefits of this package, it's assumed that your data is partitioned in a manner that suits your query and use case.

Iceberg catalog types

Duckberg supports the same Iceberg catalogs as PyIceberg, including REST, SQL, Hive, Glue, and DynamoDB. These catalogs are sources of information about Iceberg datasets, tables, partitions, etc. Before using Duckberg, ensure that you have access to an Iceberg catalog that can be utilized.

Installation

pip install duckberg

Examples

This repository contains docker compose environment that uses

  • [Rest Iceberg Catalog]
  • Minio as an object storage that is S3 compatible.
  • Jupyter to run examples and experiment with Duckberg easily

For the first run it could take a while to download sample data and buid the images.

cd examples
docker-compose up -d

Once all the containers have been initiated, simply navigate to localhost:8888, select example you want to run and enjoy Duckberg!

Features

Easy initialisation

Following initialisation is using the REST Iceberg catalog with Amazon S3 as a iceberg data storage.

from duckberg import DuckBerg

catalog_config: dict[str, str] = {
  "type": "rest", # Iceberg catalog type 
  "uri": "http://iceberg-rest:8181/", # url for Iceberg catalog
  "credentials": "user:password", # credentials for Iceberg catalog
  "s3.endpoint": S3_ENDPOINT, # s3 
  "s3.access-key-id": S3_ACCESS_KEY_ID,
  "s3.secret-access-key": S3_SECET_KEY
}

db = DuckBerg(
     catalog_name="warehouse",
     catalog_config=catalog_config)

Listing tables

db.list_tables()

Listing partitions for particular table

db.list_partitions(table="nyc.taxis")

Querying data to Pandas dataframe

query = "SELECT * FROM nyc.taxis WHERE trip_distance > 40 ORDER BY tolls_amount DESC"
df = db.select(table="nyc.taxis", partition_filter="payment_type = 1", sql=query)

Development

TBD ...

License

duckberg is distributed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duckberg-0.0.2.tar.gz (142.2 kB view details)

Uploaded Source

Built Distribution

duckberg-0.0.2-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file duckberg-0.0.2.tar.gz.

File metadata

  • Download URL: duckberg-0.0.2.tar.gz
  • Upload date:
  • Size: 142.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.3

File hashes

Hashes for duckberg-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6035628dcbbfd0200dba7eef725784b444f05eb545f8a9becc38d620f0fdb7ab
MD5 299db5c6f59da086b1a1d8e5f5d42cc2
BLAKE2b-256 d996f32e11c2e223b5888fc7ba6df97f919b1309f63604e26d0a8b5aac80b24a

See more details on using hashes here.

File details

Details for the file duckberg-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: duckberg-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.3

File hashes

Hashes for duckberg-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 971167225b409425ef7aea1c579669ff80f2955ec337f6de676bfef33458c2a6
MD5 a615653367055cf27efc73437813e270
BLAKE2b-256 3cb7d34af1b7ca8c2585f318c5a3a48e7480a99abd15bf4dc9ae4ab27eaa808c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page