Skip to main content

No project description provided

Project description

Build package

Delta Lake Reader

The Delta format, developed by Databricks, is often used to build data lakes.

While it tries to solve many issues with data lakes, one of the downsides is that delta tables rely on Spark to read the data. If you only need to read a small table, this can introduce a lot of unnecessary overhead.

This package tries to fix this, by providing a lightweight python wrapper around the delta file format.

Usage

Package currently only support local file system, and azure blob storage, but should be easily extended to AWS and GCP in the future. The main entry point should be the DeltaReader class. This will try to derrive the underlying file system, based on the input URL.

When the class is instantiated, it will try to parse the transaction log files, to find the files in the newest table version. It will, however, not read any data before you run the to_pyarrow or to_pandas functions.

Local file system

from deltalake import DeltaReader

# native file path
table_path = "somepath/mytable"
# Get table as pyarrow table
df = DeltaReader(table_path).to_pyarrow()
# Get table as pandas dataframe
df = DeltaReader(table_path).to_pandas()


# file url
table_path = "file://somepath/mytable"
df = DeltaReader(table_path).to_pandas()

Azure

The Azure integration is based on the Azure python SDK. The credential used to authenticate against the storage account, can be either a SAS token, Access Keys or one of the azure.identity classes (read more).

The input path can either be the https or abfss protocol (will be converted to https under the hood). Note that the current implementation doesn't support the dfs.core.windows.net api. But you should simply be able to replace dfs with blob.

from deltalake import DeltaReader

credential = "..." #SAS-token, Access keys or an azure.identity class

#abfss
table_url = "abfss://mycontainer@mystorage.blob.core.windows.net/mytable"
df = DeltaReader(table_url, credential).to_pandas()

#https
table_url = "https://mystorage.blob.core.windows.net/mycontainer/mytable"
df = DeltaReader(table_url, credential).to_pandas()

Time travel

One of the features of the Delta format, is the ability to do timetravel.

This can be done using the as_version property. Note that this currenly only support specific version, and not timestamp.

from deltalake import DeltaReader

table_url = "https://mystorage.blob.core.windows.net/mycontainer/mytable"
credential = "..."
df = DeltaReader(table_url, credential).as_version(5).to_pandas()

Disclaimer

Databricks recently announced a stand alone reader for Delta tables in a blogpost The python bindings mentioned, however, requires you to install the rust library which might sound scary for a python developer.

Read more

Delta transaction log

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delta-lake-reader-0.1.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

delta_lake_reader-0.1.0-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file delta-lake-reader-0.1.0.tar.gz.

File metadata

  • Download URL: delta-lake-reader-0.1.0.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.5 Linux/4.15.0-128-generic

File hashes

Hashes for delta-lake-reader-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e962398cd52500ba7044e6af0c9f1b02f8799eb2eacc5d12bee152f47f6e2be
MD5 8e52a6b3792cfa4b20631902c18a3871
BLAKE2b-256 e9379b77af323b431a8c4a21c6e9dce24bdb6dd412e5382bf997fdbc038718e5

See more details on using hashes here.

File details

Details for the file delta_lake_reader-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: delta_lake_reader-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.5 Linux/4.15.0-128-generic

File hashes

Hashes for delta_lake_reader-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b95bffaab2534417f50873f6a830e937e9159dc10c680826e30830a657434ed
MD5 2e4fbdb6f5d1040684c06c95bd4ccc27
BLAKE2b-256 8d721ce56731d3322a306c5690ba330cbe5fb0903ab674b0c68a3a808ed83018

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page