Databricks client SDK with command line client for Databricks REST APIs
Project description
pyspark-me
Databricks client SDK for Python with command line interface for Databricks REST APIs.
[TOC]
Introduction
Pysparkme package provides python SDK for Databricks REST API:
- dbfs
- workspace
- jobs
- runs
The package also comes with a useful CLI which might be very helpful in automation.
Python Client SDK for Databricks REST APIs
Create Databricks connection
# Get Databricks workspace connection
dbc = pysparkme.databricks.connect(
bearer_token='dapixyzabcd09rasdf',
url='https://westeurope.azuredatabricks.net')
DBFS
# Get list of items at path /FileStore
dbc.dbfs.ls('/FileStore')
# Check if file or directory exists
dbc.dbfs.exists('/path/to/heaven')
# Make a directory and it's parents
dbc.dbfs.mkdirs('/path/to/heaven')
# Delete a directory recusively
dbc.dbfs.rm('/path', recursive=True)
# Download file block starting 1024 with size 2048
dbc.dbfs.read('/data/movies.csv', 1024, 2048)
# Download entire file
dbc.dbfs.read_all('/data/movies.csv')
Databricks workspace
# List root workspace directory
dbc.workspace.ls('/')
# Check if workspace item exists
dbc.workspace.exists('/explore')
# Check if workspace item is a directory
dbc.workspace.is_directory('/')
# Export notebook in default (SOURCE) format
dbc.workspace.export('/my_notebook')
# Export notebook in HTML format
dbc.workspace.export('/my_notebook', 'HTML')
Databricks CLI dbr-me
You can call the Databricks CLI using convenient shell command dbr-me
:
$ dbr-me --help
or using python module:
$ python -m pysparkme.databricks.cli --help
To connect to the Databricks cluster, you can supply arguments at the command line:
--bearer-token
--url
--cluster-id
Alternatively, you can define environment variables. Command line arguments take precedence.
export DATABRICKS_URL='https://westeurope.azuredatabricks.net/'
export DATABRICKS_BEARER_TOKEN='dapixyz89u9ufsdfd0'
export DATABRICKS_CLUSTER_ID='1234-456778-abc234'
export DATABRICKS_ORG_ID='87287878293983984'
Workspace
####################
# List workspace
# Default path is root - '/'
dbr-me workspace ls
# auto-add leading '/'
dbr-me workspace ls 'Users'
# Space-indentend json output with number of spaces
dbr-me workspace --json-indent 4 ls
# Custom indent string
dbr-me workspace ls --json-indent='>'
#####################
# Export workspace items
# Export everything in source format using defaults: format=SOURCE, path=/
dbr-me workspace export -o ./.dev/export
# Export everything in DBC format
dbr-me workspace export -f DBC -o ./.dev/export.
# When path is folder, export is recursive
dbr-me workspace export -o ./.dev/export-utils 'Utils'
# Export single ITEM
dbr-me workspace export -o ./.dev/GetML 'Utils/Download MovieLens.py'
DBFS
List DBFS items
# List items on DBFS
dbr-me dbfs ls --json-indent 3 FileStore/movielens
[
{
"path": "/FileStore/movielens/ml-latest-small",
"is_dir": true,
"file_size": 0,
"is_file": false,
"human_size": "0 B"
}
]
# Download a file and print to STDOUT
dbr-me dbfs get ml-latest-small/movies.csv
# Download recursively entire directory and store locally
dbr-me dbfs get -o ml-local ml-latest-small
Runs
Submit a notebook
Implements: https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-submit
$ dbr-me runs submit "Utils/Download MovieLens"
{"run_id": 4}
You can retrieve the job information using runs get
:
$ dbr-me runs get 4 -i 3
Get run metadata
Implements: Databricks REST runs/get
$ dbr-me runs get -i 3 6
{
"job_id": 6,
"run_id": 6,
"creator_user_name": "your.name@gmail.com",
"number_in_job": 1,
"original_attempt_run_id": null,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"schedule": null,
"task": {
"notebook_task": {
"notebook_path": "/Utils/Download MovieLens"
}
},
"cluster_spec": {
"existing_cluster_id": "xxxx-yyyyy-zzzzzz"
},
"cluster_instance": {
"cluster_id": "xxxx-yyyyy-zzzzzz",
"spark_context_id": "783487348734873873"
},
"overriding_parameters": null,
"start_time": 1592062497162,
"setup_duration": 0,
"execution_duration": 11000,
"cleanup_duration": 0,
"trigger": null,
"run_name": "pyspark-me-1592062494",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=398348734873487#job/6/run/1",
"run_type": "SUBMIT_RUN"
}
List Runs
Implements: Databricks REST runs/list
$ dbr-me runs ls
To get only the runs for a particular job:
# Get job with job-id=4
$ dbr-me runs ls 4 -i 3
{
"runs": [
{
"job_id": 4,
"run_id": 4,
"creator_user_name": "your.name@gmail.com",
"number_in_job": 1,
"original_attempt_run_id": null,
"state": {
"life_cycle_state": "PENDING",
"state_message": ""
},
"schedule": null,
"task": {
"notebook_task": {
"notebook_path": "/Utils/Download MovieLens"
}
},
"cluster_spec": {
"existing_cluster_id": "xxxxx-yyyy-zzzzzzz"
},
"cluster_instance": {
"cluster_id": "xxxxx-yyyy-zzzzzzz"
},
"overriding_parameters": null,
"start_time": 1592058826123,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"trigger": null,
"run_name": "pyspark-me-1592058823",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=abcdefghasdf#job/4/run/1",
"run_type": "SUBMIT_RUN"
}
],
"has_more": false
}
Export run
Implements: Databricks REST runs/export
$ dbr-me runs export --content-only 4 > .dev/run-view.html
Get run output
Implements: Databricks REST runs/get-output
$ dbr-me runs get-output -i 3 6
{
"notebook_output": {
"result": "Downloaded files: README.txt, links.csv, movies.csv, ratings.csv, tags.csv",
"truncated": false
},
"error": null,
"metadata": {
"job_id": 5,
"run_id": 5,
"creator_user_name": "your.name@gmail.com",
"number_in_job": 1,
"original_attempt_run_id": null,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"schedule": null,
"task": {
"notebook_task": {
"notebook_path": "/Utils/Download MovieLens"
}
},
"cluster_spec": {
"existing_cluster_id": "xxxx-yyyyy-zzzzzzz"
},
"cluster_instance": {
"cluster_id": "xxxx-yyyyy-zzzzzzz",
"spark_context_id": "8973498743973498"
},
"overriding_parameters": null,
"start_time": 1592062147101,
"setup_duration": 1000,
"execution_duration": 11000,
"cleanup_duration": 0,
"trigger": null,
"run_name": "pyspark-me-1592062135",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=89798374987987#job/5/run/1",
"run_type": "SUBMIT_RUN"
}
}
To get only the exit output:
$ dbr-me runs get-output -r 6
Downloaded files: README.txt, links.csv, movies.csv, ratings.csv, tags.csv
Build and publish
python setup.py sdist bdist_wheel
python -m twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyspark_me-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48fc48abede5548f7e98100a15d65d8210c9280e81c954e6c19efda4e35d9595 |
|
MD5 | 7d4559bcfb124d41f1e0a9955d5888e1 |
|
BLAKE2b-256 | e1fe5f4d14dc11b6915b19720ccaaadade4e18ced790491a2405f05a087e6e53 |