Lightweight wrapper for reading Delta tables without Spark
Project description
Delta Lake Reader
The Delta format, developed by Databricks, is often used to build data lakes.
While it tries to solve many issues with data lakes, one of the downsides is that delta tables rely on Spark to read the data. If you only need to read a small table, this can introduce a lot of unnecessary overhead.
This package tries to fix this, by providing a lightweight python wrapper around the delta file format.
Usage
Package currently only support local file system, and azure blob storage, but should be easily extended to AWS and GCP in the future.
The main entry point should be the DeltaReader
class. This will try to derrive the underlying file system, based on the input URL.
When the class is instantiated, it will try to parse the transaction log files, to find the files in the newest table version. It will, however, not read any data before you run the to_pyarrow
or to_pandas
functions.
Local file system
from deltalake import DeltaReader
# native file path
table_path = "somepath/mytable"
# Get table as pyarrow table
df = DeltaReader(table_path).to_pyarrow()
# Get table as pandas dataframe
df = DeltaReader(table_path).to_pandas()
# file url
table_path = "file://somepath/mytable"
df = DeltaReader(table_path).to_pandas()
Azure
The Azure integration is based on the Azure python SDK. The credential
used to authenticate against the storage account, can be either a SAS token, Access Keys or one of the azure.identity
classes (read more).
The input path can either be the https or abfss protocol (will be converted to https under the hood). Note that the current implementation doesn't support the dfs.core.windows.net
api. But you should simply be able to replace dfs with blob.
from deltalake import DeltaReader
credential = "..." #SAS-token, Access keys or an azure.identity class
#abfss
table_url = "abfss://mycontainer@mystorage.blob.core.windows.net/mytable"
df = DeltaReader(table_url, credential).to_pandas()
#https
table_url = "https://mystorage.blob.core.windows.net/mycontainer/mytable"
df = DeltaReader(table_url, credential).to_pandas()
Time travel
One of the features of the Delta format, is the ability to do timetravel.
This can be done using the as_version
property. Note that this currenly only support specific version, and not timestamp.
from deltalake import DeltaReader
table_url = "https://mystorage.blob.core.windows.net/mycontainer/mytable"
credential = "..."
df = DeltaReader(table_url, credential).as_version(5).to_pandas()
Disclaimer
Databricks recently announced a stand alone reader for Delta tables in a blogpost The python bindings mentioned, however, requires you to install the rust library which might sound scary for a python developer.
Read more
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file delta-lake-reader-0.1.1.tar.gz
.
File metadata
- Download URL: delta-lake-reader-0.1.1.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.7.5 Linux/4.15.0-128-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c99bb7f7f71883cbd77b4f24a1a51f8dc38661ff7055a2a3d5835ee515512d2 |
|
MD5 | 961df809f83315d55c632ff5d4150ac6 |
|
BLAKE2b-256 | a20e33ea5ccd2882e82f651c31f1df1e6daf9bc0c40f48ba57aa1e3e512a1cf8 |
File details
Details for the file delta_lake_reader-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: delta_lake_reader-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.7.5 Linux/4.15.0-128-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb1eafb9070784eccbe73f83680cefabbe0598c4c9dc117c6701d561d5abc17a |
|
MD5 | 618b5675d943a238135c7fd05fc93ced |
|
BLAKE2b-256 | 3f433d8c573c138244fb7dae31812b44a9c78be6bdc65ebf67d53574b49e72ab |