Skip to main content

Dask + Delta Table

Project description

Dask-DeltaTable

Reading and writing to Delta Lake using Dask engine.

Installation

dask-deltatable is available on PyPI:

pip install dask-deltatable

And conda-forge:

conda install -c conda-forge dask-deltatable

Features:

  1. Read the parquet files from Delta Lake and parallelize with Dask
  2. Write Dask dataframes to Delta Lake (limited support)
  3. Supports multiple filesystems (s3, azurefs, gcsfs)
  4. Subset of Delta Lake features:
    • Time Travel
    • Schema evolution
    • Parquet filters
      • row filter
      • partition filter

Not supported

  1. Writing to Delta Lake is still in development.
  2. optimize API to run a bin-packing operation on a Delta Table.

Reading from Delta Lake

import dask_deltatable as ddt

# read delta table
df = ddt.read_deltalake("delta_path")

# with specific version
df = ddt.read_deltalake("delta_path", version=3)

# with specific datetime
df = ddt.read_deltalake("delta_path", datetime="2018-12-19T16:39:57-08:00")

df is a Dask DataFrame that you can work with in the same way you normally would. See the Dask DataFrame documentation for available operations.

Accessing remote file systems

To be able to read from S3, azure, gcsfs, and other remote filesystems, you ensure the credentials are properly configured in environment variables or config files. For AWS, you may need ~/.aws/credential; for gcsfs, GOOGLE_APPLICATION_CREDENTIALS. Refer to your cloud provider documentation to configure these.

ddt.read_deltalake("s3://bucket_name/delta_path", version=3)

Accessing AWS Glue catalog

dask-deltatable can connect to AWS Glue catalog to read the delta table. The method will look for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, and if those are not available, fall back to ~/.aws/credentials.

Example:

ddt.read_deltalake(catalog="glue", database_name="science", table_name="physics")

Accessing Unity catalog

dask-deltatable can connect to Unity catalog to read the delta table. The method will look for DATABRICKS_HOST and DATABRICKS_TOKEN environment variables or try to find them as kwargs with the same name but lowercase.

Example:

ddt.read_unity_catalog(
    catalog_name="projects",
    schema_name="science",
    table_name="physics"
)

Writing to Delta Lake

To write a Dask dataframe to Delta Lake, use to_deltalake method.

import dask.dataframe as dd
import dask_deltatable as ddt

df = dd.read_csv("s3://bucket_name/data.csv")
# do some processing on the dataframe...
ddt.to_deltalake("s3://bucket_name/delta_path", df)

Writing to Delta Lake is still in development, so be aware that some features may not work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_deltatable-0.4.0.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dask_deltatable-0.4.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file dask_deltatable-0.4.0.tar.gz.

File metadata

  • Download URL: dask_deltatable-0.4.0.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.7

File hashes

Hashes for dask_deltatable-0.4.0.tar.gz
Algorithm Hash digest
SHA256 d75af862b5d89d435bf425b5ee9c46ef557a8993fcaba6090013c3b139153d29
MD5 e4b2b0f839e1a5def0177e708e93567d
BLAKE2b-256 f35d151216c49cb2978f3da65dd3867f580757a77e1f21bc6b9e359f3be33441

See more details on using hashes here.

File details

Details for the file dask_deltatable-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dask_deltatable-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 63da540706ac0d7eb3bf066e99d46b6891bd04fd34a35e43d0a13fc4ec5ee5e1
MD5 74ae19696b336ef901931af67df14aad
BLAKE2b-256 a3191143ef5e88855cd4d4e45d1cb530ebdfbe40081f2f1cfd9596fc72bfd011

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page