Delta Sharing Python client
Project description
General Use Case for Spark/Databricks
The usage of Spark is required for big data that needs a cluster of compute nodes and cheap storage (delta lake). The result of processing big data needs to be HANA tables that can be consumed by DataSphere. In order to share the results with other applications you either need a
- system - to - system integration where the credentials of the target applications are shared or
- "data product"-kind of integration where you expose the data to external users with an external access management like "delta sharing".
The following client was developed to use the API of the Databricks delta-sharing-server of the Unity catalog.
The data can be downloaded to a csv-file with the name <share><schema><table>.csv and uploded to a HANA Database. Because the standard spark catalog is based on the delta lake format there is no feature for defining primary keys. The reason might be that there are not inherent tests on a "primary key violation" when a new record is written. A "primary key" might therefore lead to false assumptions.
For supporting primary keys and DB-specific data types a csn-file can be used when a table is created in HANA. The csn-file needs to have the same name as the table file-name but with the ".csn"-extension.
Used files:
- filename of table: <share><schema><table>.csv
- metadata of table: <share><schema><table>_meta.json
- delta records of tables when CDF-enabled: <share><schema><table>_delta.csv
- CSN table definition: <share><schema><table>_delta.csn
- Last downloaded version: <share><schema><table>_version.csn
-
- Last uploaded hana version: <share><schema><table>_hana.csn
All files are stored in the CWD or to the path given in the command options.
CSN File Creation
To support the creation of a csn-file you can use
pip install pycsn
pyscn -h
pycsn <csv-file> -p [<primary keys>] -n [<table names>]
usage: pycsn [-h] [-o OUTPUT] [-p PRIMARY_KEYS [PRIMARY_KEYS ...]] [-n NAMES [NAMES ...]] [-s] [-b BUFFER] filenames [filenames ...]
Creates csn-file from pandas DataFrame.
positional arguments:
filenames Data Filenames (csv)
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT Overwrite default filename
-p PRIMARY_KEYS [PRIMARY_KEYS ...], --primary_keys PRIMARY_KEYS [PRIMARY_KEYS ...] Add primary keys (Attention for all tables!).
-n NAMES [NAMES ...], --names NAMES [NAMES ...] Set table names.
-s, --sql Create table sql.
-b BUFFER, --buffer BUFFER Additional string buffer
Installation
pip install dsclient
Delta Sharing Commandline Client
Lists and downloads files from Delta Sharing.
usage: dsclient.py [-h] [-p PATH] [-d] [-s STARTING] [-e ENDING] [-m] [-c CONFIG_FILE] [-H] [-S] profile [table]
positional arguments:
profile Profile of delta sharing
table (optional) Table URL: <share>.<schema>.<table>. If not given table can be selected from available table list.
options:
-h, --help show this help message and exit
-p PATH, --path PATH Directory to store data.
-d, --delta Capture delta feeds
-s STARTING, --starting STARTING
Start version (int, min=1)m or timestamp (str,"2019-09-26T07:58:30.996+0200")
-e ENDING, --ending ENDING
End version (int, min=1) or timestamp (str,"2019-09-26T07:58:30.996+0200")
-m, --meta Show metadata
-c CONFIG_FILE, --config-file CONFIG_FILE
Config-file for HANA access (yaml with url, user, pwd, port)
-H, --Hana Upload to hana
-S, --Sync Sync files with hana
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.