Skip to main content

Databricks client SDK with command line client for Databricks REST APIs

Project description

pyspark-me

Databricks client SDK for Python with command line interface for Databricks REST APIs.

[TOC]

Introduction

Pysparkme package provides python SDK for Databricks REST API:

  • dbfs
  • workspace
  • jobs
  • runs

The package also comes with a useful CLI which might be very helpful in automation.

Python Client SDK for Databricks REST APIs

Create Databricks connection

# Get Databricks workspace connection
dbc = pysparkme.databricks.connect(
        bearer_token='dapixyzabcd09rasdf',
        url='https://westeurope.azuredatabricks.net')

DBFS

# Get list of items at path /FileStore
dbc.dbfs.ls('/FileStore')

# Check if file or directory exists
dbc.dbfs.exists('/path/to/heaven')

# Make a directory and it's parents
dbc.dbfs.mkdirs('/path/to/heaven')

# Delete a directory recusively
dbc.dbfs.rm('/path', recursive=True)

# Download file block starting 1024 with size 2048
dbc.dbfs.read('/data/movies.csv', 1024, 2048)

# Download entire file
dbc.dbfs.read_all('/data/movies.csv')

Databricks workspace

# List root workspace directory
dbc.workspace.ls('/')

# Check if workspace item exists
dbc.workspace.exists('/explore')

# Check if workspace item is a directory
dbc.workspace.is_directory('/')

# Export notebook in default (SOURCE) format
dbc.workspace.export('/my_notebook')

# Export notebook in HTML format
dbc.workspace.export('/my_notebook', 'HTML')

Databricks CLI dbr-me

You can call the Databricks CLI using convenient shell command dbr-me:

$ dbr-me --help

or using python module:

$ python -m pysparkme.databricks.cli --help

To connect to the Databricks cluster, you can supply arguments at the command line:

  • --bearer-token
  • --url
  • --cluster-id

Alternatively, you can define environment variables. Command line arguments take precedence.

export DATABRICKS_URL='https://westeurope.azuredatabricks.net/'
export DATABRICKS_BEARER_TOKEN='dapixyz89u9ufsdfd0'
export DATABRICKS_CLUSTER_ID='1234-456778-abc234'
export DATABRICKS_ORG_ID='87287878293983984'

Workspace

####################
# List workspace
# Default path is root - '/'
dbr-me workspace ls
# auto-add leading '/'
dbr-me workspace ls 'Users'
# Space-indentend json output with number of spaces
dbr-me workspace --json-indent 4 ls
# Custom indent string
dbr-me workspace ls --json-indent='>'

#####################
# Export workspace items
# Export everything in source format using defaults: format=SOURCE, path=/
dbr-me workspace export -o ./.dev/export
# Export everything in DBC format
dbr-me workspace export -f DBC -o ./.dev/export.
# When path is folder, export is recursive
dbr-me workspace export -o ./.dev/export-utils 'Utils'
# Export single ITEM
dbr-me workspace export -o ./.dev/GetML 'Utils/Download MovieLens.py'

DBFS

List DBFS items

# List items on DBFS
dbr-me dbfs ls --json-indent 3 FileStore/movielens
[
   {
      "path": "/FileStore/movielens/ml-latest-small",
      "is_dir": true,
      "file_size": 0,
      "is_file": false,
      "human_size": "0 B"
   }
]
# Download a file and print to STDOUT
dbr-me dbfs get ml-latest-small/movies.csv
# Download recursively entire directory and store locally
dbr-me dbfs get -o ml-local ml-latest-small

Runs

Submit a notebook

Implements: https://docs.databricks.com/dev-tools/api/latest/jobs.html#runs-submit

$ dbr-me runs submit "Utils/Download MovieLens"
{"run_id": 4}

You can retrieve the job information using runs get:

$ dbr-me runs get 4 -i 3

Get run metadata

Implements: Databricks REST runs/get

$ dbr-me runs get -i 3 6
{
   "job_id": 6,
   "run_id": 6,
   "creator_user_name": "your.name@gmail.com",
   "number_in_job": 1,
   "original_attempt_run_id": null,
   "state": {
      "life_cycle_state": "TERMINATED",
      "result_state": "SUCCESS",
      "state_message": ""
   },
   "schedule": null,
   "task": {
      "notebook_task": {
         "notebook_path": "/Utils/Download MovieLens"
      }
   },
   "cluster_spec": {
      "existing_cluster_id": "xxxx-yyyyy-zzzzzz"
   },
   "cluster_instance": {
      "cluster_id": "xxxx-yyyyy-zzzzzz",
      "spark_context_id": "783487348734873873"
   },
   "overriding_parameters": null,
   "start_time": 1592062497162,
   "setup_duration": 0,
   "execution_duration": 11000,
   "cleanup_duration": 0,
   "trigger": null,
   "run_name": "pyspark-me-1592062494",
   "run_page_url": "https://westeurope.azuredatabricks.net/?o=398348734873487#job/6/run/1",
   "run_type": "SUBMIT_RUN"
}

List Runs

Implements: Databricks REST runs/list

$ dbr-me runs ls

To get only the runs for a particular job:

# Get job with job-id=4
$ dbr-me runs ls 4 -i 3
{
   "runs": [
      {
         "job_id": 4,
         "run_id": 4,
         "creator_user_name": "your.name@gmail.com",
         "number_in_job": 1,
         "original_attempt_run_id": null,
         "state": {
            "life_cycle_state": "PENDING",
            "state_message": ""
         },
         "schedule": null,
         "task": {
            "notebook_task": {
               "notebook_path": "/Utils/Download MovieLens"
            }
         },
         "cluster_spec": {
            "existing_cluster_id": "xxxxx-yyyy-zzzzzzz"
         },
         "cluster_instance": {
            "cluster_id": "xxxxx-yyyy-zzzzzzz"
         },
         "overriding_parameters": null,
         "start_time": 1592058826123,
         "setup_duration": 0,
         "execution_duration": 0,
         "cleanup_duration": 0,
         "trigger": null,
         "run_name": "pyspark-me-1592058823",
         "run_page_url": "https://westeurope.azuredatabricks.net/?o=abcdefghasdf#job/4/run/1",
         "run_type": "SUBMIT_RUN"
      }
   ],
   "has_more": false
}

Export run

Implements: Databricks REST runs/export

$ dbr-me runs export --content-only 4 > .dev/run-view.html

Get run output

Implements: Databricks REST runs/get-output

$ dbr-me runs get-output -i 3 6
{
   "notebook_output": {
      "result": "Downloaded files: README.txt, links.csv, movies.csv, ratings.csv, tags.csv",
      "truncated": false
   },
   "error": null,
   "metadata": {
      "job_id": 5,
      "run_id": 5,
      "creator_user_name": "your.name@gmail.com",
      "number_in_job": 1,
      "original_attempt_run_id": null,
      "state": {
         "life_cycle_state": "TERMINATED",
         "result_state": "SUCCESS",
         "state_message": ""
      },
      "schedule": null,
      "task": {
         "notebook_task": {
            "notebook_path": "/Utils/Download MovieLens"
         }
      },
      "cluster_spec": {
         "existing_cluster_id": "xxxx-yyyyy-zzzzzzz"
      },
      "cluster_instance": {
         "cluster_id": "xxxx-yyyyy-zzzzzzz",
         "spark_context_id": "8973498743973498"
      },
      "overriding_parameters": null,
      "start_time": 1592062147101,
      "setup_duration": 1000,
      "execution_duration": 11000,
      "cleanup_duration": 0,
      "trigger": null,
      "run_name": "pyspark-me-1592062135",
      "run_page_url": "https://westeurope.azuredatabricks.net/?o=89798374987987#job/5/run/1",
      "run_type": "SUBMIT_RUN"
   }
}

To get only the exit output:

$ dbr-me runs get-output -r 6
Downloaded files: README.txt, links.csv, movies.csv, ratings.csv, tags.csv

Build and publish

python setup.py sdist bdist_wheel
python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark-me-0.0.6.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

pyspark_me-0.0.6-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file pyspark-me-0.0.6.tar.gz.

File metadata

  • Download URL: pyspark-me-0.0.6.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.6rc1

File hashes

Hashes for pyspark-me-0.0.6.tar.gz
Algorithm Hash digest
SHA256 fc05a7465f4d57e075a08aa1633d77127f00ff4b8c9712c9ba4de41af82c4f51
MD5 422dfcff837c2b82f331b54c00a68565
BLAKE2b-256 c404dd7a47355a1e500ec11fd8cfe6f97360cd6b25266b99d340d92c807156b7

See more details on using hashes here.

File details

Details for the file pyspark_me-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: pyspark_me-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.6rc1

File hashes

Hashes for pyspark_me-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 48fc48abede5548f7e98100a15d65d8210c9280e81c954e6c19efda4e35d9595
MD5 7d4559bcfb124d41f1e0a9955d5888e1
BLAKE2b-256 e1fe5f4d14dc11b6915b19720ccaaadade4e18ced790491a2405f05a087e6e53

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page