Skip to main content

No project description provided

Project description

deltadask

A connector for reading Delta Lake tables into Dask DataFrames.

Install with pip install deltadask.

Read a Delta Lake into a Dask DataFrame as follows:

import deltadask

ddf = deltadask.read_delta("path/to/delta/table")

It's a lot more efficient for Dask to read a Delta table compared to a Parquet data lake. Parquet tables are of course faster than CSV tables as explained in this video. Delta tables are the next big performance improvement for Dask users.

Basic usage

Suppose you have a Delta table with the following three versions.

Delta table with version

Here's how to read the latest version of the Delta table:

deltadask.read_delta("path/to/delta/table").compute()
   id
0   7
1   8
2   9

And here's how to read version 1 of the Delta table:

deltadask.read_delta("path/to/delta/table", version=1).compute()
   id
0   0
1   1
2   2
0   4
1   5

Delta Lake makes it easy to time travel between different versions of a Delta table with Dask.

See this notebook for a full working example with an environment so you can replicate this on your machine.

Why Delta Lake is better than Parquet for Dask

A Delta table stores data in Parquet files and metadata in a trasaction log. The metadata includes the schema and location of the files.

Delta table architecture

A Dask Parquet data lake can be stored in two different ways.

  1. Parquet files with a single metadata file
  2. Parquet files without a metadata file

Parquet files with a single metadata file is bad because of scaling limitations.

Parquet files without a metadata file require expensive file listing / Parquet footer queries to build the overall metadata statistics for the data lake. It's a lot faster to fetch this information from the Delta transaction log.

Delta Lake is better because the transaction log is scalable and can be queried must faster than a data lake.

How to make reads faster

You can make Delta Lake queries faster by using column projection and predicate pushdown filtering. These tactics allow you to send less data to the cluster.

Here's an example of how to query a Delta table with Dask and take advantage of column pruning and predicate pushdown filtering:

ddf = deltadask.read_delta(
    "path/to/delta/table", 
    columns=["col1"], # column pruning
    filters=[[('col1', '==', 0)]] # predicate pushdown
)

This query only sends col1 to the cluster and none of the other columns (column projection).

This query also uses the transaction log to idenfity files that at least contain some data with col1 equal to zero. If a file contains no matching data, then it won't be read. Depending on how the data is organized, a lot of files can be skipped. You can skip the number of files even more by partitioning or Z ORDERING the data.

How this library is built

The delta-rs library makes it really easy to build the deltadask connector.

All the transaction log parsing logic is handled by delta-rs. You can just plug into the APIs to easily build the Dask connector.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deltadask-0.2.0.tar.gz (2.5 kB view details)

Uploaded Source

Built Distribution

deltadask-0.2.0-py3-none-any.whl (2.9 kB view details)

Uploaded Python 3

File details

Details for the file deltadask-0.2.0.tar.gz.

File metadata

  • Download URL: deltadask-0.2.0.tar.gz
  • Upload date:
  • Size: 2.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.9.5 Darwin/20.3.0

File hashes

Hashes for deltadask-0.2.0.tar.gz
Algorithm Hash digest
SHA256 865114e451cfbc9d4bdefca4fff3175c896799552b1fc7b2f685247240058377
MD5 6482258a89e21e194dba6b6dd3055ae7
BLAKE2b-256 dfb525020812028e65722cca75a974eb1b91f0ec2d927d235c4f42de2798814f

See more details on using hashes here.

File details

Details for the file deltadask-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: deltadask-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 2.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.9.5 Darwin/20.3.0

File hashes

Hashes for deltadask-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54947641faa542e8f2e2c7722729e1e975f33d71af8b467a627a74c61c3162ca
MD5 1417be1601c0fbfeb0b5006b5a5c31d8
BLAKE2b-256 cf9e56468491d7b206af295eb55a72b420df801e46a26faf5c6d6e4a51018fcf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page