Skip to main content

Reusable CLI for uploading, submitting, validating, fetching logs, and cleaning Databricks job runs

Project description

databricks-job-runner

PyPI version

Reusable CLI for uploading, submitting, and cleaning Databricks job runs.

Wraps the Databricks Python SDK into a small library that each project configures with a Runner instance. One Runner gives you nine CLI subcommands — upload, download, submit, validate, logs, clean, catalog, schema, and volume — without writing any Databricks API code in your project.

Installation

uv add databricks-job-runner

Or with pip:

pip install databricks-job-runner

For local development against a checkout:

# pyproject.toml
[tool.uv.sources]
databricks-job-runner = { path = "../databricks-job-runner", editable = true }

Quick start

Create a cli/ package in your project with two files:

cli/__init__.py

from databricks_job_runner import Runner

runner = Runner(
    run_name_prefix="my_project",
    wheel_package="my_package",  # optional
)

cli/__main__.py

from cli import runner
runner.main()

Then run from your project's root (not from the databricks-job-runner repo — this is a library, not a standalone CLI):

uv run python -m cli upload --all          # upload agent_modules/*.py
uv run python -m cli upload test_hello.py  # upload a single file
uv run python -m cli upload --wheel        # build and upload wheel
uv run python -m cli upload --data         # upload data/ to UC volume
uv run python -m cli upload --data exports # upload a specific data subdirectory
uv run python -m cli download results/out.csv        # download a file from the volume
uv run python -m cli download results/out.csv -o /tmp/out.csv  # specify local destination
uv run python -m cli download --list               # list files at the volume root
uv run python -m cli download --list results       # list a subdirectory
uv run python -m cli submit test_hello.py          # submit a job and wait
uv run python -m cli submit test_hello.py --no-wait
uv run python -m cli submit test_hello.py --upload  # upload all scripts then submit
uv run python -m cli submit test_hello.py --compute serverless  # override compute mode
uv run python -m cli validate              # list remote workspace contents
uv run python -m cli validate test_hello.py  # verify a specific file is uploaded
uv run python -m cli logs                  # stdout/stderr from the most recent run
uv run python -m cli logs 12345            # stdout/stderr from a specific run
uv run python -m cli clean --yes           # clean workspace + runs
uv run python -m cli clean --runs --yes    # clean only runs

# Unity Catalog management
uv run python -m cli catalog list
uv run python -m cli catalog get my_catalog              # show storage location
uv run python -m cli catalog create my_catalog --comment "Analytics"
uv run python -m cli catalog create my_catalog --storage-root "abfss://container@account.dfs.core.windows.net/path"
uv run python -m cli catalog delete my_catalog --force --yes

uv run python -m cli schema list my_catalog
uv run python -m cli schema create my_catalog.my_schema
uv run python -m cli schema delete my_catalog.my_schema --yes

uv run python -m cli volume list my_catalog.my_schema
uv run python -m cli volume create my_catalog.my_schema.my_vol
uv run python -m cli volume create my_catalog.my_schema.ext_vol --volume-type EXTERNAL --storage-location s3://bucket/path
uv run python -m cli volume delete my_catalog.my_schema.my_vol --yes

Configuration

The runner reads a .env file from the project root. Core keys (all prefixed with DATABRICKS_ for consistency):

Key Default Required Description
DATABRICKS_PROFILE no CLI profile in ~/.databrickscfg. When unset, the SDK's unified auth falls back to env vars (DATABRICKS_HOST/DATABRICKS_TOKEN), Azure CLI, service principals, etc.
DATABRICKS_COMPUTE_MODE cluster no cluster or serverless. Selects the compute backend for submitted jobs.
DATABRICKS_CLUSTER_ID when DATABRICKS_COMPUTE_MODE=cluster All-purpose cluster to run jobs on. Started automatically if terminated.
DATABRICKS_SERVERLESS_ENV_VERSION 3 no Serverless environment version (e.g. 3 for Python 3.12).
DATABRICKS_WORKSPACE_DIR yes Remote workspace path (e.g. /Users/you/my_project)
DATABRICKS_VOLUME_PATH when using upload --wheel, upload --data, or download UC Volume path for wheel/data uploads and downloads.
DATABRICKS_SECRET_SCOPE when using secret_keys Databricks secret scope name. Values for keys listed in secret_keys are fetched from this scope at runtime instead of passed as plaintext parameters.

Precedence: pre-existing environment variables override .env values, matching 12-factor conventions (CI/CD and shell exports can override the file).

Additional non-core keys are captured in RunnerConfig.extras and automatically passed to submitted jobs as KEY=VALUE parameters. Scripts call inject_params() at startup to load them into os.environ, then use pydantic BaseSettings to read configuration.

Compute modes

  • Classic cluster (DATABRICKS_COMPUTE_MODE=cluster, the default): jobs submit to an existing all-purpose cluster identified by DATABRICKS_CLUSTER_ID. The runner auto-starts the cluster if it is terminated, and attaches wheels via Library(whl=...).
  • Serverless (DATABRICKS_COMPUTE_MODE=serverless): jobs submit to Databricks serverless compute with a job-level environment spec. No cluster ID needed; wheels attach as Environment.dependencies entries (UC Volume paths are supported directly).

Example .env (classic cluster)

DATABRICKS_PROFILE=my-profile
DATABRICKS_CLUSTER_ID=0123-456789-abcdef
DATABRICKS_WORKSPACE_DIR=/Users/ryan.knight@example.com/my_project
DATABRICKS_VOLUME_PATH=/Volumes/catalog/schema/volume
NEO4J_URI=neo4j+s://abc123.databases.neo4j.io
NEO4J_PASSWORD=secret

Example .env (serverless)

DATABRICKS_PROFILE=my-profile
DATABRICKS_COMPUTE_MODE=serverless
DATABRICKS_SERVERLESS_ENV_VERSION=3
DATABRICKS_WORKSPACE_DIR=/Users/ryan.knight@example.com/my_project
DATABRICKS_VOLUME_PATH=/Volumes/catalog/schema/volume

All DATABRICKS_* keys listed above become typed fields on RunnerConfig; any other keys (like NEO4J_URI above) go into config.extras.

API

Runner

Runner(
    run_name_prefix: str,
    project_dir: Path | str | None = None,
    wheel_package: str | None = None,
    secret_keys: list[str] | None = None,
    scripts_dir: str = "agent_modules",
    extra_files: list[str] | None = None,
    cli_command: str = "uv run python -m cli",
)
Parameter Description
run_name_prefix Prefix for job run names and cleanup filtering
project_dir Project root (defaults to cwd()). Must contain .env and the scripts directory
wheel_package Package name for wheel builds. Enables upload --wheel. Wheels upload to <DATABRICKS_VOLUME_PATH>/wheels/
secret_keys .env key names whose values are stored in a Databricks secret scope instead of forwarded as plaintext parameters. Requires DATABRICKS_SECRET_SCOPE in .env
scripts_dir Name of the local subdirectory containing scripts to upload and submit (default: "agent_modules"). Change to match your project layout, e.g. "jobs" or "scripts"
extra_files Paths relative to project_dir that are uploaded into the remote scripts directory alongside Python scripts. Use for non-Python assets job scripts read via Path(__file__).parent on the cluster (e.g. ["sql/gold_schema.sql"])
cli_command Command string printed in "Next steps" hints after a job is submitted (default: "uv run python -m cli"). Override when your project uses a custom entry point, e.g. "uv run myproject"

RunnerConfig

Pydantic model holding parsed .env values. Frozen (immutable) after construction.

Field Type Description
databricks_profile str | None CLI profile name, or None for unified-auth fallback
databricks_compute_mode Literal["cluster", "serverless"] Compute backend ("cluster" by default)
databricks_cluster_id str | None Cluster ID (required when databricks_compute_mode == "cluster")
databricks_serverless_env_version str Serverless environment version (default "3")
databricks_workspace_dir str Remote workspace root (required)
databricks_volume_path str | None UC Volume path for wheel/data uploads and downloads
secret_scope str | None Databricks secret scope name (set via DATABRICKS_SECRET_SCOPE)
extras dict[str, str] All non-core keys from .env

The env_params() method returns DATABRICKS_WORKSPACE_DIR, DATABRICKS_VOLUME_PATH (when set), secret-scope metadata (when secret_keys is configured), and all extras as KEY=VALUE strings suitable for job parameter injection.

inject_params

from databricks_job_runner import inject_params
inject_params()

Call at the top of submitted scripts to parse KEY=VALUE parameters from sys.argv into os.environ. This lets scripts use pydantic BaseSettings or os.getenv() to read configuration that the runner injected from .env. Uses setdefault so pre-existing env vars take precedence.

Note: databricks_job_runner is not available on the Databricks cluster. For standalone scripts (not part of a wheel), inline the equivalent logic instead of importing:

import os, sys
for _arg in sys.argv[1:]:
    if "=" in _arg and not _arg.startswith("-"):
        _key, _, _value = _arg.partition("=")
        os.environ.setdefault(_key, _value)

For wheel-based scripts, the wheel's entry point can call inject_params() only if databricks_job_runner is listed as a wheel dependency and installed in the cluster environment. Since submitted scripts run on Databricks where the package is not available by default, the inline approach is preferred.

RunnerError

Raised when a runner operation cannot proceed (missing config, file not found, cluster stopped, job failed). The CLI formats and exits; library callers can catch and handle.

Project layout

The runner expects this layout in your project:

my_project/
  .env
  agent_modules/
    test_hello.py
    run_lab2.py
    ...
  cli/
    __init__.py    # Runner config
    __main__.py    # entry point

Scripts in agent_modules/ are uploaded to {DATABRICKS_WORKSPACE_DIR}/agent_modules/ on Databricks and submitted as Spark Python tasks.

Subcommands

upload

  • upload <file> — Upload a single file from agent_modules/
  • upload --all — Upload all *.py files from agent_modules/
  • upload --wheel — Build a wheel with uv build and upload to the UC Volume (requires wheel_package and DATABRICKS_VOLUME_PATH)
  • upload --data [DIR] — Upload a local data directory to the UC Volume (default: data/). Requires DATABRICKS_VOLUME_PATH. Preserves subdirectory structure under the volume path.

download

  • download <path> — Download a file from the UC Volume. Path is relative to DATABRICKS_VOLUME_PATH, or absolute (starting with /Volumes).
  • download <path> --dest/-o <local> — Specify local destination path (default: current directory, using the remote filename).
  • download --list [SUBDIR] — List files at the volume root or in an optional subdirectory. Requires DATABRICKS_VOLUME_PATH.

submit

  • submit <script> — Submit a script as a one-time Databricks job and wait for completion. Default: test_hello.py
  • submit <script> --no-wait — Submit without waiting
  • submit <script> --upload — Upload all scripts from agent_modules/ before submitting
  • submit <script> --compute cluster|serverless — Override DATABRICKS_COMPUTE_MODE for this run only

On classic mode, if the target cluster is not already RUNNING, it is started automatically and the submit waits (up to 20 minutes, the SDK default) for it to reach RUNNING. On serverless, no warm-up step is required. When submitting a script whose name starts with run_{wheel_package} (e.g. run_{wheel_package}.py, run_{wheel_package}_schema.py, run_{wheel_package}_sample.py), the runner automatically attaches the wheel — as a Library(whl=...) on classic, or as an Environment.dependencies entry on serverless. Scripts that don't follow this prefix convention (e.g. test_hello.py) are submitted without the wheel.

validate

  • validate — List the remote workspace directory and its agent_modules/ subdirectory. On classic, auto-starts the cluster if needed; on serverless, this is a no-op.
  • validate <file> — Also verify that {DATABRICKS_WORKSPACE_DIR}/agent_modules/<file> exists; exits non-zero if not.

logs

  • logs — Print stdout/stderr, error, and trace from the most recent run matching {run_name_prefix}:*
  • logs <run_id> — Print output for a specific parent run ID

Output is fetched via the Jobs API's get_run_output, which returns the tail 5 MB of stdout/stderr captured per task (the API caps output size; truncation is signaled in the output). The runner resolves the parent run to its task-level run IDs automatically, so pass the parent run_id shown at submit time. Databricks auto-expires runs after 60 days.

clean

  • clean — Delete the remote workspace directory and all matching job runs
  • clean --workspace — Delete only the workspace directory
  • clean --runs — Delete only job runs
  • clean --yes — Skip confirmation prompt

catalog

Manage Unity Catalog catalogs.

  • catalog list — List all catalogs visible to the current user
  • catalog get <name> — Show details for a catalog, including the storage root and storage location (managed location). Use this to find where a catalog's managed tables are stored.
  • catalog create <name> [--storage-root URL] [--comment TEXT] — Create a new catalog. --storage-root sets the managed storage location (equivalent to MANAGED LOCATION in SQL). Required on metastores that use per-catalog storage roots instead of a metastore-level default.
  • catalog delete <name> [--yes] [--force] — Delete a catalog. --force cascades to all schemas, tables, and volumes inside it. Prompts for confirmation unless --yes is given.

Finding the managed location

To find the managed storage location for an existing catalog:

uv run python -m cli catalog get my_catalog

This prints the storage root (the URL set at creation time) and the storage location (the full resolved path where managed tables are stored). Example output:

Catalog: my_catalog
  Owner:            ryan.knight@example.com
  Storage root:     abfss://container@account.dfs.core.windows.net/path
  Storage location: abfss://container@account.dfs.core.windows.net/path/__unitystorage/catalogs/abc123
  Comment:          Analytics catalog
  Created:          2025-01-15 10:30:00

schema

Manage Unity Catalog schemas. Schema names use dotted notation: catalog.schema.

  • schema list <catalog> — List schemas in a catalog
  • schema get <catalog.schema> — Show details for a schema
  • schema create <catalog.schema> [--comment TEXT] — Create a new schema
  • schema delete <catalog.schema> [--yes] — Delete a schema. Prompts for confirmation unless --yes is given.

volume

Manage Unity Catalog volumes. Volume names use dotted notation: catalog.schema.volume.

  • volume list <catalog.schema> — List volumes in a schema
  • volume get <catalog.schema.volume> — Show details (type, owner, storage location) for a volume
  • volume create <catalog.schema.volume> [--volume-type MANAGED|EXTERNAL] [--storage-location URL] [--comment TEXT] — Create a volume. Defaults to MANAGED. EXTERNAL volumes require --storage-location.
  • volume delete <catalog.schema.volume> [--yes] — Delete a volume. Prompts for confirmation unless --yes is given.

Releasing

Releases are published to PyPI automatically when you push a Git tag. The version in the tag becomes the package version.

git tag v0.4.8 && git push origin v0.4.8

The GitHub Actions workflow strips the v prefix, patches pyproject.toml with the version, builds the wheel and sdist, and publishes to PyPI via trusted publishing.

Requirements

  • Python 3.12+
  • Databricks authentication: either a Databricks CLI profile, or env vars (DATABRICKS_HOST/DATABRICKS_TOKEN), or any other unified-auth method
  • Either a Databricks all-purpose cluster (auto-started if terminated) or serverless compute enabled for the workspace
  • uv (for wheel building only)

Architecture

databricks-job-runner is layered into a thin CLI, an orchestrator, and a set of single-purpose action modules. Runner is the only class consuming projects need to touch.

cli.py          argparse + dispatch (flags -> Runner method calls)
  |
runner.py       Runner: holds config, owns the WorkspaceClient,
  |             exposes one method per subcommand
  |
  |-- config.py     RunnerConfig (frozen pydantic) + .env parser
  |-- compute.py    ClassicCluster / Serverless strategies (Protocol)
  |-- inject.py     inject_params() for submitted scripts
  |-- upload.py     workspace file + wheel + data upload
  |-- download.py   UC Volume file download and directory listing
  |-- submit.py     compute-agnostic job submission
  |-- validate.py   workspace listing + file-existence checks
  |-- logs.py       per-task stdout/stderr retrieval
  |-- catalog.py    Unity Catalog catalog/schema/volume management
  |-- clean.py      workspace + run cleanup
  |-- errors.py     RunnerError

Layers

  • CLI (cli.py) owns all argparse setup and translates the parsed namespace into method calls on Runner. Formats RunnerError into friendly exit messages. No argparse knowledge lives outside this file.
  • Orchestration (runner.py) exposes the Runner class. RunnerConfig and the WorkspaceClient are built lazily on first access, so importing a project's cli/__init__.py doesn't touch Databricks. Each public method coordinates a single subcommand end-to-end.
  • Action modules (upload.py, submit.py, validate.py, logs.py, clean.py, catalog.py) are plain functions wrapping SDK calls. None know about argparse or Runner, keeping each unit composable and independently testable.
  • Compute strategies (compute.py) implement the Compute protocol. A strategy knows how to (1) validate that its backend is ready, (2) decorate a SubmitTask with backend-specific fields, and (3) produce the top-level environments[] list for jobs.submit. submit_job is compute-agnostic — swapping backends is a strategy change, not a conditional branch.

Design choices

  • Strategy pattern for compute. Compute is a typing.Protocol, so adding a new backend is a new frozen dataclass that matches the shape — no changes to submit_job, Runner, or the CLI. ClassicCluster and Serverless are both frozen dataclasses for value-equality and immutability.
  • Single validation point. Required-key enforcement lives entirely in RunnerConfig.from_env_file, branching on DATABRICKS_COMPUTE_MODE (only DATABRICKS_CLUSTER_ID is required when mode is cluster). Downstream code trusts the config is valid.
  • Automatic parameter injection. All non-core .env keys are passed to submitted jobs as KEY=VALUE parameters via RunnerConfig.env_params(). Scripts call inject_params() at startup to load them into os.environ, then use pydantic BaseSettings to read configuration. No callback or per-project wiring needed.
  • Wheel convention. A submitted script whose name starts with run_{wheel_package} auto-attaches the latest wheel from dist/ — as Library(whl=...) on classic, or an Environment.dependencies entry on serverless. This covers both a single entry-point script (run_my_package.py) and per-phase scripts (run_my_package_schema.py, run_my_package_sample.py). Scripts that don't match the prefix (e.g. test_hello.py) are submitted without the wheel.
  • 12-factor .env. Pre-existing env vars override .env values, so CI/CD exports and shell overrides trump the file — matching standard .env semantics.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks_job_runner-0.4.8.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

databricks_job_runner-0.4.8-py3-none-any.whl (34.5 kB view details)

Uploaded Python 3

File details

Details for the file databricks_job_runner-0.4.8.tar.gz.

File metadata

  • Download URL: databricks_job_runner-0.4.8.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for databricks_job_runner-0.4.8.tar.gz
Algorithm Hash digest
SHA256 9daf3a66c48a42aaf89c79d08eefd7efe63d2cfae45a47ce32661bdd88a8ed1f
MD5 ae9463e7055248f9f6eba0d47255e664
BLAKE2b-256 b8a42eca16546967e0dd1a12d2b5ac04b49318b56ae4c7fb7c5fe6279e885790

See more details on using hashes here.

Provenance

The following attestation bundles were made for databricks_job_runner-0.4.8.tar.gz:

Publisher: publish.yml on neo4j-partners/databricks-job-runner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file databricks_job_runner-0.4.8-py3-none-any.whl.

File metadata

File hashes

Hashes for databricks_job_runner-0.4.8-py3-none-any.whl
Algorithm Hash digest
SHA256 0a1681039afe38e7a3a7c565fc827ce6ecbefb74819272f55d7a6162dafa0536
MD5 30b2adcc26f1f7a2c5b67f3757da3183
BLAKE2b-256 5e8da81fb93bbbfa589eb869be0b04dfbcb57d09feb6707de010d9974908c293

See more details on using hashes here.

Provenance

The following attestation bundles were made for databricks_job_runner-0.4.8-py3-none-any.whl:

Publisher: publish.yml on neo4j-partners/databricks-job-runner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page