Skip to main content

Lightweight wrapper for reading Delta tables without Spark

Project description

Build package

Delta Lake Reader

The Delta format, developed by Databricks, is often used to build data lakes.

While it tries to solve many issues with data lakes, one of the downsides is that delta tables rely on Spark to read the data. If you only need to read a small table, this can introduce a lot of unnecessary overhead.

This package tries to fix this, by providing a lightweight python wrapper around the delta file format.

Usage

Package currently only support local file system, and azure blob storage, but should be easily extended to AWS and GCP in the future. The main entry point should be the DeltaReader class. This will try to derrive the underlying file system, based on the input URL.

When the class is instantiated, it will try to parse the transaction log files, to find the files in the newest table version. It will, however, not read any data before you run the to_pyarrow or to_pandas functions.

Local file system

from deltalake import DeltaReader

# native file path
table_path = "somepath/mytable"
# Get table as pyarrow table
df = DeltaReader(table_path).to_pyarrow()
# Get table as pandas dataframe
df = DeltaReader(table_path).to_pandas()


# file url
table_path = "file://somepath/mytable"
df = DeltaReader(table_path).to_pandas()

Azure

The Azure integration is based on the Azure python SDK. The credential used to authenticate against the storage account, can be either a SAS token, Access Keys or one of the azure.identity classes (read more).

The input path can either be the https or abfss protocol (will be converted to https under the hood). Note that the current implementation doesn't support the dfs.core.windows.net api. But you should simply be able to replace dfs with blob.

from deltalake import DeltaReader

credential = "..." #SAS-token, Access keys or an azure.identity class

#abfss
table_url = "abfss://mycontainer@mystorage.blob.core.windows.net/mytable"
df = DeltaReader(table_url, credential).to_pandas()

#https
table_url = "https://mystorage.blob.core.windows.net/mycontainer/mytable"
df = DeltaReader(table_url, credential).to_pandas()

Time travel

One of the features of the Delta format, is the ability to do timetravel.

This can be done using the as_version property. Note that this currenly only support specific version, and not timestamp.

from deltalake import DeltaReader

table_url = "https://mystorage.blob.core.windows.net/mycontainer/mytable"
credential = "..."
df = DeltaReader(table_url, credential).as_version(5).to_pandas()

Disclaimer

Databricks recently announced a stand alone reader for Delta tables in a blogpost The python bindings mentioned, however, requires you to install the rust library which might sound scary for a python developer.

Read more

Delta transaction log

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delta-lake-reader-0.1.1.tar.gz (6.2 kB view hashes)

Uploaded Source

Built Distribution

delta_lake_reader-0.1.1-py3-none-any.whl (7.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page