Databricks client SDK with command line client for Databricks REST APIs
Project description
pydbr
Databricks client SDK for Python with command line interface for Databricks REST APIs.
{:toc}
Introduction
Pydbr (short of Python-Databricks) package provides python SDK for Databricks REST API:
- dbfs
- workspace
- jobs
- runs
The package also comes with a useful CLI which might be very helpful in automation.
Installation
$ pip install pydbr
Databricks CLI
Databricks command line client provides convenient way to interact with Databricks cluster at the command line. A very popular use of such approach in in automation tasks, like DevOps pipelines or third party workflow managers.
You can call the Databricks CLI using convenient shell command pydbr
:
$ pydbr --help
or using python module:
$ python -m pydbr.cli --help
To connect to the Databricks cluster, you can supply arguments at the command line:
--bearer-token
--url
--cluster-id
Alternatively, you can define environment variables. Command line arguments take precedence.
export DATABRICKS_URL='https://westeurope.azuredatabricks.net/'
export DATABRICKS_BEARER_TOKEN='dapixyz89u9ufsdfd0'
export DATABRICKS_CLUSTER_ID='1234-456778-abc234'
export DATABRICKS_ORG_ID='87287878293983984'
DBFS
List DBFS items
# List items on DBFS
pydbr dbfs ls --json-indent 3 FileStore/movielens
[
{
"path": "/FileStore/movielens/ml-latest-small",
"is_dir": true,
"file_size": 0,
"is_file": false,
"human_size": "0 B"
}
]
Download file from DBFS
# Download a file and print to STDOUT
pydbr dbfs get ml-latest-small/movies.csv
Download directory from DBFS
# Download recursively entire directory and store locally
pydbr dbfs get -o ml-local ml-latest-small
Workspace
Databricks workspace contains notebooks and other items.
List workspace
####################
# List workspace
# Default path is root - '/'
$ pydbr workspace ls
# auto-add leading '/'
$ pydbr workspace ls 'Users'
# Space-indentend json output with number of spaces
$ pydbr workspace --json-indent 4 ls
# Custom indent string
$ pydbr workspace ls --json-indent='>'
Export items from Databricks workspace
#####################
# Export workspace items
# Export everything in source format using defaults: format=SOURCE, path=/
pydbr workspace export -o ./.dev/export
# Export everything in DBC format
pydbr workspace export -f DBC -o ./.dev/export.
# When path is folder, export is recursive
pydbr workspace export -o ./.dev/export-utils 'Utils'
# Export single ITEM
pydbr workspace export -o ./.dev/GetML 'Utils/Download MovieLens.py'
Runs
This command group implements the jobs/runs
Databricks REST API.
Submit a notebook
Implements: https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-submit
$ pydbr runs submit "Utils/Download MovieLens"
{"run_id": 4}
You can retrieve the job information using runs get
:
$ pydbr runs get 4 -i 3
If you need to pass parameters, use the --parameters
or -p
option and specify JSON text.
$ pydbr runs submit -p '{"run_tag":"20250103"}' "Utils/Download MovieLens"
You can refer also to parameters in JSON file:
$ pydbr runs submit -p '@params.json' "Utils/Download MovieLens"
You can use the parameters in the notebook and will also be able to see them in the run metadata:
pydbr runs get-output -i 3 8
{
"notebook_output": {
"result": "Downloaded files (tag: 20250103): README.txt, links.csv, movies.csv, ratings.csv, tags.csv",
"truncated": false
},
"error": null,
"metadata": {
"job_id": 8,
"run_id": 8,
"creator_user_name": "your.name@gmail.com",
"number_in_job": 1,
"original_attempt_run_id": null,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"schedule": null,
"task": {
"notebook_task": {
"notebook_path": "/Utils/Download MovieLens",
"base_parameters": {
"run_tag": "20250103"
}
}
},
"cluster_spec": {
"existing_cluster_id": "xxxx-yyyyyy-zzzzzz"
},
"cluster_instance": {
"cluster_id": "xxxx-yyyyyy-zzzzzzzz",
"spark_context_id": "8734983498349834"
},
"overriding_parameters": null,
"start_time": 1592067357734,
"setup_duration": 0,
"execution_duration": 11000,
"cleanup_duration": 0,
"trigger": null,
"run_name": "pydbr-1592067355",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=89349849834#job/8/run/1",
"run_type": "SUBMIT_RUN"
}
}
Get run metadata
Implements: Databricks REST runs/get
$ pydbr runs get -i 3 6
{
"job_id": 6,
"run_id": 6,
"creator_user_name": "your.name@gmail.com",
"number_in_job": 1,
"original_attempt_run_id": null,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"schedule": null,
"task": {
"notebook_task": {
"notebook_path": "/Utils/Download MovieLens"
}
},
"cluster_spec": {
"existing_cluster_id": "xxxx-yyyyy-zzzzzz"
},
"cluster_instance": {
"cluster_id": "xxxx-yyyyy-zzzzzz",
"spark_context_id": "783487348734873873"
},
"overriding_parameters": null,
"start_time": 1592062497162,
"setup_duration": 0,
"execution_duration": 11000,
"cleanup_duration": 0,
"trigger": null,
"run_name": "pydbr-1592062494",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=398348734873487#job/6/run/1",
"run_type": "SUBMIT_RUN"
}
List Runs
Implements: Databricks REST runs/list
$ pydbr runs ls
To get only the runs for a particular job:
# Get job with job-id=4
$ pydbr runs ls 4 -i 3
{
"runs": [
{
"job_id": 4,
"run_id": 4,
"creator_user_name": "your.name@gmail.com",
"number_in_job": 1,
"original_attempt_run_id": null,
"state": {
"life_cycle_state": "PENDING",
"state_message": ""
},
"schedule": null,
"task": {
"notebook_task": {
"notebook_path": "/Utils/Download MovieLens"
}
},
"cluster_spec": {
"existing_cluster_id": "xxxxx-yyyy-zzzzzzz"
},
"cluster_instance": {
"cluster_id": "xxxxx-yyyy-zzzzzzz"
},
"overriding_parameters": null,
"start_time": 1592058826123,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"trigger": null,
"run_name": "pydbr-1592058823",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=abcdefghasdf#job/4/run/1",
"run_type": "SUBMIT_RUN"
}
],
"has_more": false
}
Export run
Implements: Databricks REST runs/export
$ pydbr runs export --content-only 4 > .dev/run-view.html
Get run output
Implements: Databricks REST runs/get-output
$ pydbr runs get-output -i 3 6
{
"notebook_output": {
"result": "Downloaded files: README.txt, links.csv, movies.csv, ratings.csv, tags.csv",
"truncated": false
},
"error": null,
"metadata": {
"job_id": 5,
"run_id": 5,
"creator_user_name": "your.name@gmail.com",
"number_in_job": 1,
"original_attempt_run_id": null,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"schedule": null,
"task": {
"notebook_task": {
"notebook_path": "/Utils/Download MovieLens"
}
},
"cluster_spec": {
"existing_cluster_id": "xxxx-yyyyy-zzzzzzz"
},
"cluster_instance": {
"cluster_id": "xxxx-yyyyy-zzzzzzz",
"spark_context_id": "8973498743973498"
},
"overriding_parameters": null,
"start_time": 1592062147101,
"setup_duration": 1000,
"execution_duration": 11000,
"cleanup_duration": 0,
"trigger": null,
"run_name": "pydbr-1592062135",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=89798374987987#job/5/run/1",
"run_type": "SUBMIT_RUN"
}
}
To get only the exit output:
$ pydbr runs get-output -r 6
Downloaded files: README.txt, links.csv, movies.csv, ratings.csv, tags.csv
Python Client SDK for Databricks REST APIs
To implement your own Databricks REST API client, you can use the Python Client SDK for Databricks REST APIs.
Create Databricks connection
# Get Databricks workspace connection
dbc = pydbr.connect(
bearer_token='dapixyzabcd09rasdf',
url='https://westeurope.azuredatabricks.net')
DBFS
# Get list of items at path /FileStore
dbc.dbfs.ls('/FileStore')
# Check if file or directory exists
dbc.dbfs.exists('/path/to/heaven')
# Make a directory and it's parents
dbc.dbfs.mkdirs('/path/to/heaven')
# Delete a directory recusively
dbc.dbfs.rm('/path', recursive=True)
# Download file block starting 1024 with size 2048
dbc.dbfs.read('/data/movies.csv', 1024, 2048)
# Download entire file
dbc.dbfs.read_all('/data/movies.csv')
Databricks workspace
# List root workspace directory
dbc.workspace.ls('/')
# Check if workspace item exists
dbc.workspace.exists('/explore')
# Check if workspace item is a directory
dbc.workspace.is_directory('/')
# Export notebook in default (SOURCE) format
dbc.workspace.export('/my_notebook')
# Export notebook in HTML format
dbc.workspace.export('/my_notebook', 'HTML')
Build and publish
pip install wheel twine
python setup.py sdist bdist_wheel
python -m twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.