Skip to main content

Lightweight wrapper for reading Delta tables without Spark

Project description

Build package

Delta Lake Reader

The Delta format, developed by Databricks, is often used to build data lakes.

While it tries to solve many issues with data lakes, one of the downsides is that delta tables rely on Spark to read the data. If you only need to read a small table, this can introduce a lot of unnecessary overhead.

This package tries to fix this, by providing a lightweight python wrapper around the delta file format.

Usage

Package currently only support local file system, and azure blob storage, but should be easily extended to AWS and GCP in the future. The main entry point should be the DeltaReader class. This will try to derrive the underlying file system, based on the input URL.

When the class is instantiated, it will try to parse the transaction log files, to find the files in the newest table version. It will, however, not read any data before you run the to_pyarrow or to_pandas functions.

Local file system

from deltalake import DeltaReader

# native file path
table_path = "somepath/mytable"
# Get table as pyarrow table
df = DeltaReader(table_path).to_pyarrow()
# Get table as pandas dataframe
df = DeltaReader(table_path).to_pandas()


# file url
table_path = "file://somepath/mytable"
df = DeltaReader(table_path).to_pandas()

Azure

The Azure integration is based on the Azure python SDK. The credential used to authenticate against the storage account, can be either a SAS token, Access Keys or one of the azure.identity classes (read more).

The input path can either be the https or abfss protocol (will be converted to https under the hood). Note that the current implementation doesn't support the dfs.core.windows.net api. But you should simply be able to replace dfs with blob.

from deltalake import DeltaReader

credential = "..." #SAS-token, Access keys or an azure.identity class

#abfss
table_url = "abfss://mycontainer@mystorage.blob.core.windows.net/mytable"
df = DeltaReader(table_url, credential).to_pandas()

#https
table_url = "https://mystorage.blob.core.windows.net/mycontainer/mytable"
df = DeltaReader(table_url, credential).to_pandas()

Time travel

One of the features of the Delta format, is the ability to do timetravel.

This can be done using the as_version property. Note that this currenly only support specific version, and not timestamp.

from deltalake import DeltaReader

table_url = "https://mystorage.blob.core.windows.net/mycontainer/mytable"
credential = "..."
df = DeltaReader(table_url, credential).as_version(5).to_pandas()

Disclaimer

Databricks recently announced a stand alone reader for Delta tables in a blogpost The python bindings mentioned, however, requires you to install the rust library which might sound scary for a python developer.

Read more

Delta transaction log

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delta-lake-reader-0.1.1.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

delta_lake_reader-0.1.1-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file delta-lake-reader-0.1.1.tar.gz.

File metadata

  • Download URL: delta-lake-reader-0.1.1.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.5 Linux/4.15.0-128-generic

File hashes

Hashes for delta-lake-reader-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7c99bb7f7f71883cbd77b4f24a1a51f8dc38661ff7055a2a3d5835ee515512d2
MD5 961df809f83315d55c632ff5d4150ac6
BLAKE2b-256 a20e33ea5ccd2882e82f651c31f1df1e6daf9bc0c40f48ba57aa1e3e512a1cf8

See more details on using hashes here.

File details

Details for the file delta_lake_reader-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: delta_lake_reader-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.5 Linux/4.15.0-128-generic

File hashes

Hashes for delta_lake_reader-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fb1eafb9070784eccbe73f83680cefabbe0598c4c9dc117c6701d561d5abc17a
MD5 618b5675d943a238135c7fd05fc93ced
BLAKE2b-256 3f433d8c573c138244fb7dae31812b44a9c78be6bdc65ebf67d53574b49e72ab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page