Skip to main content

Lightweight wrapper for reading Delta tables without Spark

Project description

Build package

Delta Lake Reader

The Delta format, developed by Databricks, is often used to build data lakes.

While it tries to solve many issues with data lakes, one of the downsides is that delta tables rely on Spark to read the data. If you only need to read a small table, this can introduce a lot of unnecessary overhead.

This package tries to fix this, by providing a lightweight python wrapper around the delta file format, without any Spark dependencies

Installation

Install the package using pip

pip install delta-lake-reader

This will only install the minimal dependencies for working with local file system. To access Delta Tables stored in popular cloud storages, use one of the following commands, to include the cloud specific dependencies

Azure

pip install delta-lake-reader[azure]

Amazon Web Services (AWS)

pip install delta-lake-reader[aws]

Google Cloud Platform (GCP)

pip install delta-lake-reader[gcp]

Usage

Package is build on PyArrow and FSSpec.

This means that you get all the features of PyArrow, like predicate pushdown, partition pruning and easy interoperability with Pandas.

Meanwhile, FSSpec serves as a FileSystem agnostic backend, that lets you read files from many places, including popular cloud providers.

To read a DeltaTable, first create a DeltaTable object. This will read the delta transaction log to find the current files, and get the schema. This will, however, not read any data. To read the content of the table, call to_table() to get a pyarrow.Table object, or to_pandas() to get a pandas.DataFrame

Local file system

from deltalake import DeltaTable

# native file path. Can be relative or absolute
table_path = "somepath/mytable"

# Get table as pyarrow table
df = DeltaTable(table_path).to_table()

# Get table as pandas dataframe
df = DeltaTable(table_path).to_pandas()

Azure

The Azure integration is based on adlfs package, developed by the Dask community.

credential used to authenticate against the storage account, can be either a SAS token, Access Keys or one of the azure.identity classes. See authentication using Azure SDK for more information.

from deltalake import DeltaTable
from adlfs import AzureBlobFileSystem

#example url  'abfss://myContainer@myStorageAccount.dfs.core.windows.net/somepath/mytable'
fs = AzureBlobFileSystem(
        account_name="myStorageAccount", 
        credential='...'
    )
df = DeltaTable("mycontainer/somepath/mytable", file_system=fs).to_pandas()

Amazon Web Service (AWS)

The AWS integration is based on s3fs package, developed by the Dask community.

To authenticate you can either specify the access key and secret, or since it's build on boto, use one of their methods for authentication. See authentication using AWS SDK for more information.

from deltalake import DeltaTable
from s3fs import S3FileSystem

#example url  's3://myBucket/somepath/mytable'
fs = S3FileSystem() #authenticate using environment variables, in this example
df = DeltaTable("myBucket/somepath/mytable", file_system=fs).to_pandas()

Google Cloud Platform (GCP)

The GCP integration is based on the gcsfs, developed by the Dask community.

For more information about authentication with GCP see the gcsfs documentation or the GCP documentation

from deltalake import DeltaTable
from gcsfs import GCSFileSystem

#example url  'gs://myBucket/somepath/mytable'
fs = GCSFileSystem() #authenticate using environment variables, in this example
df = DeltaTable("myBucket/somepath/mytable", file_system=fs).to_pandas()

Time travel

One of the features of the Delta format, is the ability to do timetravel.

This can be done using the as_version method. Note that this currenly only support specific version, and not timestamp.

from deltalake import DeltaTable

df = DeltaTable("somepath/mytable").as_version(5).to_pandas()

Timetraveling to a version that has been vacuumed, currently results in undefined behavior

Predicate Pushdown, Partition Pruning & Columnar file formats

Since the resulting DeltaTable is based on the pyarrow.DataSet, you get many cool features for free.

The DeltaTable.to_table is inherited from pyarrow.Dataset.to_table. This means that you can include arguments like filter, which will do partition pruning and predicate pushdown. If you have a partitioned dataset, partition pruning can potentially reduce the data needed to be downloaded substantially. The predicate pushdown will not have any effect on the amount of data downloaded, but will reduce the dataset size when loaded into memory.

Further more, since the underlying parquet file format is columnar, you can select a subset of columns to be read from the files. This can be done by passing a list of column names to to_table.

See documentation of to_pandas, or to_table for documentation of all arguments

import pyarrow.dataset as ds

#Predicate pushdown. 
#If the table is partitioned on age, it will also to partition pruning
df = DeltaTable("...").to_table(filter=ds.field("age")>=18).to_pandas()

#Only load a subset of columns
df = DeltaTable("...").to_table(columns=["age","name"]).to_pandas()

Read more about filtering data using PyArrow

Bring Your Own Filesystem

Since the implementation is using the FSSpec for filesystem abstraction, you can in principle use any FSSpec filesystem. See more about available FSSpec interfaces.

fs = SomeFSSpecFilesystem()
df = DeltaTable(path, file_system=fs).to_pandas()

Disclaimer

Databricks recently announced a stand alone reader for Delta tables in a blogpost The stand alone reader is JVM based, but a Rust library with python bindings is also mentioned. This, however, cannot be pip installed which may discourage many python developers. Although the idea for this library was made independently, some inspirations has been taken from the Rust library.

Read more

Delta transaction log

PyArrow Documentation

FSSpec Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delta-lake-reader-0.2.1.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

delta_lake_reader-0.2.1-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file delta-lake-reader-0.2.1.tar.gz.

File metadata

  • Download URL: delta-lake-reader-0.2.1.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.5 Linux/4.15.0-128-generic

File hashes

Hashes for delta-lake-reader-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4781bff9f74744d90abd9a9d1dfe4ed8a656a3c5cd59dda84e84b63a3083e68d
MD5 f659c548a5f039a9fece91b555284729
BLAKE2b-256 2c01f923611ff8d2b3d48e850e3596d1fe4e2e87c3d874c59a9c5d6c75a839fe

See more details on using hashes here.

File details

Details for the file delta_lake_reader-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: delta_lake_reader-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.5 Linux/4.15.0-128-generic

File hashes

Hashes for delta_lake_reader-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b0527b8de24be4fbc21330e292377e64d3cb860c2548bf431750f6c9f244258f
MD5 6571d712b39c13ed868202f786688793
BLAKE2b-256 4f57453099b5f2705f4cec2410829013fad9b300912c770f1240f0310bccc382

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page