Reusable CLI for uploading, submitting, validating, fetching logs, and cleaning Databricks job runs
Project description
databricks-job-runner
Reusable CLI for uploading, submitting, and cleaning Databricks job runs.
Wraps the Databricks Python SDK into a small library that each project configures with a Runner instance. One Runner gives you eight CLI subcommands — upload, submit, validate, logs, clean, catalog, schema, and volume — without writing any Databricks API code in your project.
Installation
uv add databricks-job-runner
Or with pip:
pip install databricks-job-runner
For local development against a checkout:
# pyproject.toml
[tool.uv.sources]
databricks-job-runner = { path = "../databricks-job-runner", editable = true }
Warning — do not list
databricks-job-runneras a core dependency.
databricks-job-runneris a local-only CLI tool — it is not published to PyPI. If you add it to your project's[project.dependencies](core dependencies), any wheel you build from that project will declare it as a requirement. When Databricks serverless (or any remote environment) tries to install your wheel, pip will fail because it cannot resolvedatabricks-job-runner.Instead, put it in an optional extras group so it is only installed locally:
[project.optional-dependencies] cli = ["databricks-job-runner"]Then install locally with
uv sync --extra cli(orpip install -e '.[cli]'). Your submitted scripts (e.g.run_my_package.py) should never importdatabricks_job_runner— they run on Databricks where it is not available.
Quick start
Create a cli/ package in your project with two files:
cli/__init__.py
from databricks_job_runner import Runner
runner = Runner(
run_name_prefix="my_project",
wheel_package="my_package", # optional
)
cli/__main__.py
from cli import runner
runner.main()
Then run from your project's root (not from the databricks-job-runner repo — this is a library, not a standalone CLI):
uv run python -m cli upload --all # upload agent_modules/*.py
uv run python -m cli upload test_hello.py # upload a single file
uv run python -m cli upload --wheel # build and upload wheel
uv run python -m cli submit test_hello.py # submit a job and wait
uv run python -m cli submit test_hello.py --no-wait
uv run python -m cli validate # list remote workspace contents
uv run python -m cli validate test_hello.py # verify a specific file is uploaded
uv run python -m cli logs # stdout/stderr from the most recent run
uv run python -m cli logs 12345 # stdout/stderr from a specific run
uv run python -m cli clean --yes # clean workspace + runs
uv run python -m cli clean --runs --yes # clean only runs
# Unity Catalog management
uv run python -m cli catalog list
uv run python -m cli catalog get my_catalog # show storage location
uv run python -m cli catalog create my_catalog --comment "Analytics"
uv run python -m cli catalog create my_catalog --storage-root "abfss://container@account.dfs.core.windows.net/path"
uv run python -m cli catalog delete my_catalog --force --yes
uv run python -m cli schema list my_catalog
uv run python -m cli schema create my_catalog.my_schema
uv run python -m cli schema delete my_catalog.my_schema --yes
uv run python -m cli volume list my_catalog.my_schema
uv run python -m cli volume create my_catalog.my_schema.my_vol
uv run python -m cli volume create my_catalog.my_schema.ext_vol --volume-type EXTERNAL --storage-location s3://bucket/path
uv run python -m cli volume delete my_catalog.my_schema.my_vol --yes
Configuration
The runner reads a .env file from the project root. Core keys (all prefixed with DATABRICKS_ for consistency):
| Key | Default | Required | Description |
|---|---|---|---|
DATABRICKS_PROFILE |
— | no | CLI profile in ~/.databrickscfg. When unset, the SDK's unified auth falls back to env vars (DATABRICKS_HOST/DATABRICKS_TOKEN), Azure CLI, service principals, etc. |
DATABRICKS_COMPUTE_MODE |
cluster |
no | cluster or serverless. Selects the compute backend for submitted jobs. |
DATABRICKS_CLUSTER_ID |
— | when DATABRICKS_COMPUTE_MODE=cluster |
All-purpose cluster to run jobs on. Started automatically if terminated. |
DATABRICKS_SERVERLESS_ENV_VERSION |
3 |
no | Serverless environment version (e.g. 3 for Python 3.12). |
DATABRICKS_WORKSPACE_DIR |
— | yes | Remote workspace path (e.g. /Users/you/my_project) |
DATABRICKS_VOLUME_PATH |
— | when using upload --wheel |
UC Volume path for wheel uploads. |
Precedence: pre-existing environment variables override .env values, matching 12-factor conventions (CI/CD and shell exports can override the file).
Additional non-core keys are captured in RunnerConfig.extras and automatically passed to submitted jobs as KEY=VALUE parameters. Scripts call inject_params() at startup to load them into os.environ, then use pydantic BaseSettings to read configuration.
Compute modes
- Classic cluster (
DATABRICKS_COMPUTE_MODE=cluster, the default): jobs submit to an existing all-purpose cluster identified byDATABRICKS_CLUSTER_ID. The runner auto-starts the cluster if it is terminated, and attaches wheels viaLibrary(whl=...). - Serverless (
DATABRICKS_COMPUTE_MODE=serverless): jobs submit to Databricks serverless compute with a job-level environment spec. No cluster ID needed; wheels attach asEnvironment.dependenciesentries (UC Volume paths are supported directly).
Example .env (classic cluster)
DATABRICKS_PROFILE=my-profile
DATABRICKS_CLUSTER_ID=0123-456789-abcdef
DATABRICKS_WORKSPACE_DIR=/Users/ryan.knight@example.com/my_project
DATABRICKS_VOLUME_PATH=/Volumes/catalog/schema/volume
NEO4J_URI=neo4j+s://abc123.databases.neo4j.io
NEO4J_PASSWORD=secret
Example .env (serverless)
DATABRICKS_PROFILE=my-profile
DATABRICKS_COMPUTE_MODE=serverless
DATABRICKS_SERVERLESS_ENV_VERSION=3
DATABRICKS_WORKSPACE_DIR=/Users/ryan.knight@example.com/my_project
DATABRICKS_VOLUME_PATH=/Volumes/catalog/schema/volume
All DATABRICKS_* keys listed above become typed fields on RunnerConfig; any other keys (like NEO4J_URI above) go into config.extras.
API
Runner
Runner(
run_name_prefix: str,
project_dir: Path | str | None = None,
wheel_package: str | None = None,
secret_keys: list[str] | None = None,
scripts_dir: str = "agent_modules",
extra_files: list[str] | None = None,
)
| Parameter | Description |
|---|---|
run_name_prefix |
Prefix for job run names and cleanup filtering |
project_dir |
Project root (defaults to cwd()). Must contain .env and the scripts directory |
wheel_package |
Package name for wheel builds. Enables upload --wheel. Wheels upload to <DATABRICKS_VOLUME_PATH>/wheels/ |
secret_keys |
.env key names whose values are stored in a Databricks secret scope instead of forwarded as plaintext parameters |
scripts_dir |
Name of the local subdirectory containing scripts to upload and submit (default: "agent_modules"). Change to match your project layout, e.g. "jobs" or "scripts" |
extra_files |
Paths relative to project_dir that are uploaded into the remote scripts directory alongside Python scripts. Use for non-Python assets job scripts read via Path(__file__).parent on the cluster (e.g. ["sql/gold_schema.sql"]) |
RunnerConfig
Pydantic model holding parsed .env values. Frozen (immutable) after construction.
| Field | Type | Description |
|---|---|---|
databricks_profile |
str | None |
CLI profile name, or None for unified-auth fallback |
databricks_compute_mode |
Literal["cluster", "serverless"] |
Compute backend ("cluster" by default) |
databricks_cluster_id |
str | None |
Cluster ID (required when databricks_compute_mode == "cluster") |
databricks_serverless_env_version |
str |
Serverless environment version (default "3") |
databricks_workspace_dir |
str |
Remote workspace root (required) |
databricks_volume_path |
str | None |
UC Volume path for wheel uploads |
extras |
dict[str, str] |
All non-core keys from .env |
The env_params() method returns extras (plus DATABRICKS_VOLUME_PATH) as KEY=VALUE strings suitable for job parameter injection.
inject_params
from databricks_job_runner import inject_params
inject_params()
Call at the top of submitted scripts to parse KEY=VALUE parameters from sys.argv into os.environ. This lets scripts use pydantic BaseSettings or os.getenv() to read configuration that the runner injected from .env. Uses setdefault so pre-existing env vars take precedence.
Note:
databricks_job_runneris not available on the Databricks cluster. For standalone scripts (not part of a wheel), inline the equivalent logic instead of importing:import os, sys for _arg in sys.argv[1:]: if "=" in _arg and not _arg.startswith("-"): _key, _, _value = _arg.partition("=") os.environ.setdefault(_key, _value)For wheel-based scripts, the wheel's entry point can call
inject_params()only ifdatabricks_job_runneris listed as a wheel dependency — but since it is not published to PyPI, the inline approach is preferred.
RunnerError
Raised when a runner operation cannot proceed (missing config, file not found, cluster stopped, job failed). The CLI formats and exits; library callers can catch and handle.
Project layout
The runner expects this layout in your project:
my_project/
.env
agent_modules/
test_hello.py
run_lab2.py
...
cli/
__init__.py # Runner config
__main__.py # entry point
Scripts in agent_modules/ are uploaded to {DATABRICKS_WORKSPACE_DIR}/agent_modules/ on Databricks and submitted as Spark Python tasks.
Subcommands
upload
upload <file>— Upload a single file fromagent_modules/upload --all— Upload all*.pyfiles fromagent_modules/upload --wheel— Build a wheel withuv buildand upload to the UC Volume (requireswheel_packageandDATABRICKS_VOLUME_PATH)
submit
submit <script>— Submit a script as a one-time Databricks job and wait for completion. Default:test_hello.pysubmit <script> --no-wait— Submit without waiting
On classic mode, if the target cluster is not already RUNNING, it is started automatically and the submit waits (up to 20 minutes, the SDK default) for it to reach RUNNING. On serverless, no warm-up step is required. When submitting a script whose name starts with run_{wheel_package} (e.g. run_{wheel_package}.py, run_{wheel_package}_schema.py, run_{wheel_package}_sample.py), the runner automatically attaches the wheel — as a Library(whl=...) on classic, or as an Environment.dependencies entry on serverless. Scripts that don't follow this prefix convention (e.g. test_hello.py) are submitted without the wheel.
validate
validate— List the remote workspace directory and itsagent_modules/subdirectory. On classic, auto-starts the cluster if needed; on serverless, this is a no-op.validate <file>— Also verify that{DATABRICKS_WORKSPACE_DIR}/agent_modules/<file>exists; exits non-zero if not.
logs
logs— Print stdout/stderr, error, and trace from the most recent run matching{run_name_prefix}:*logs <run_id>— Print output for a specific parent run ID
Output is fetched via the Jobs API's get_run_output, which returns the tail 5 MB of stdout/stderr captured per task (the API caps output size; truncation is signaled in the output). The runner resolves the parent run to its task-level run IDs automatically, so pass the parent run_id shown at submit time. Databricks auto-expires runs after 60 days.
clean
clean— Delete the remote workspace directory and all matching job runsclean --workspace— Delete only the workspace directoryclean --runs— Delete only job runsclean --yes— Skip confirmation prompt
catalog
Manage Unity Catalog catalogs.
catalog list— List all catalogs visible to the current usercatalog get <name>— Show details for a catalog, including the storage root and storage location (managed location). Use this to find where a catalog's managed tables are stored.catalog create <name> [--storage-root URL] [--comment TEXT]— Create a new catalog.--storage-rootsets the managed storage location (equivalent toMANAGED LOCATIONin SQL). Required on metastores that use per-catalog storage roots instead of a metastore-level default.catalog delete <name> [--yes] [--force]— Delete a catalog.--forcecascades to all schemas, tables, and volumes inside it. Prompts for confirmation unless--yesis given.
Finding the managed location
To find the managed storage location for an existing catalog:
uv run python -m cli catalog get my_catalog
This prints the storage root (the URL set at creation time) and the storage location (the full resolved path where managed tables are stored). Example output:
Catalog: my_catalog
Owner: ryan.knight@example.com
Storage root: abfss://container@account.dfs.core.windows.net/path
Storage location: abfss://container@account.dfs.core.windows.net/path/__unitystorage/catalogs/abc123
Comment: Analytics catalog
Created: 2025-01-15 10:30:00
schema
Manage Unity Catalog schemas. Schema names use dotted notation: catalog.schema.
schema list <catalog>— List schemas in a catalogschema get <catalog.schema>— Show details for a schemaschema create <catalog.schema> [--comment TEXT]— Create a new schemaschema delete <catalog.schema> [--yes]— Delete a schema. Prompts for confirmation unless--yesis given.
volume
Manage Unity Catalog volumes. Volume names use dotted notation: catalog.schema.volume.
volume list <catalog.schema>— List volumes in a schemavolume get <catalog.schema.volume>— Show details (type, owner, storage location) for a volumevolume create <catalog.schema.volume> [--volume-type MANAGED|EXTERNAL] [--storage-location URL] [--comment TEXT]— Create a volume. Defaults toMANAGED.EXTERNALvolumes require--storage-location.volume delete <catalog.schema.volume> [--yes]— Delete a volume. Prompts for confirmation unless--yesis given.
Releasing
Releases are published to PyPI automatically when you push a Git tag. The version in the tag becomes the package version.
git tag v0.4.7 && git push origin v0.4.7
The GitHub Actions workflow strips the v prefix, patches pyproject.toml with the version, builds the wheel and sdist, and publishes to PyPI via trusted publishing.
Requirements
- Python 3.12+
- Databricks authentication: either a Databricks CLI profile, or env vars (
DATABRICKS_HOST/DATABRICKS_TOKEN), or any other unified-auth method - Either a Databricks all-purpose cluster (auto-started if terminated) or serverless compute enabled for the workspace
- uv (for wheel building only)
Architecture
databricks-job-runner is layered into a thin CLI, an orchestrator, and a set of single-purpose action modules. Runner is the only class consuming projects need to touch.
cli.py argparse + dispatch (flags -> Runner method calls)
|
runner.py Runner: holds config, owns the WorkspaceClient,
| exposes one method per subcommand
|
|-- config.py RunnerConfig (frozen pydantic) + .env parser
|-- compute.py ClassicCluster / Serverless strategies (Protocol)
|-- inject.py inject_params() for submitted scripts
|-- upload.py workspace file + wheel upload
|-- submit.py compute-agnostic job submission
|-- validate.py workspace listing + file-existence checks
|-- logs.py per-task stdout/stderr retrieval
|-- catalog.py Unity Catalog catalog/schema/volume management
|-- clean.py workspace + run cleanup
|-- errors.py RunnerError
Layers
- CLI (
cli.py) owns all argparse setup and translates the parsed namespace into method calls onRunner. FormatsRunnerErrorinto friendly exit messages. No argparse knowledge lives outside this file. - Orchestration (
runner.py) exposes theRunnerclass.RunnerConfigand theWorkspaceClientare built lazily on first access, so importing a project'scli/__init__.pydoesn't touch Databricks. Each public method coordinates a single subcommand end-to-end. - Action modules (
upload.py,submit.py,validate.py,logs.py,clean.py,catalog.py) are plain functions wrapping SDK calls. None know about argparse orRunner, keeping each unit composable and independently testable. - Compute strategies (
compute.py) implement theComputeprotocol. A strategy knows how to (1) validate that its backend is ready, (2) decorate aSubmitTaskwith backend-specific fields, and (3) produce the top-levelenvironments[]list forjobs.submit.submit_jobis compute-agnostic — swapping backends is a strategy change, not a conditional branch.
Design choices
- Strategy pattern for compute.
Computeis atyping.Protocol, so adding a new backend is a new frozen dataclass that matches the shape — no changes tosubmit_job,Runner, or the CLI.ClassicClusterandServerlessare both frozen dataclasses for value-equality and immutability. - Single validation point. Required-key enforcement lives entirely in
RunnerConfig.from_env_file, branching onDATABRICKS_COMPUTE_MODE(onlyDATABRICKS_CLUSTER_IDis required when mode iscluster). Downstream code trusts the config is valid. - Automatic parameter injection. All non-core
.envkeys are passed to submitted jobs asKEY=VALUEparameters viaRunnerConfig.env_params(). Scripts callinject_params()at startup to load them intoos.environ, then use pydanticBaseSettingsto read configuration. No callback or per-project wiring needed. - Wheel convention. A submitted script whose name starts with
run_{wheel_package}auto-attaches the latest wheel fromdist/— asLibrary(whl=...)on classic, or anEnvironment.dependenciesentry on serverless. This covers both a single entry-point script (run_my_package.py) and per-phase scripts (run_my_package_schema.py,run_my_package_sample.py). Scripts that don't match the prefix (e.g.test_hello.py) are submitted without the wheel. - 12-factor
.env. Pre-existing env vars override.envvalues, so CI/CD exports and shell overrides trump the file — matching standard.envsemantics.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file databricks_job_runner-0.4.7.tar.gz.
File metadata
- Download URL: databricks_job_runner-0.4.7.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddc4261190199baf06dbaded398c81f080e302fc43bf0bbd4bfe17eb9b1727bf
|
|
| MD5 |
8f231d9cb9282b3a21be465c5a8eb75c
|
|
| BLAKE2b-256 |
9578e3af9e025ac624c91cebf13f40eee0a0f85e802a9f50d58aa2d3510f5530
|
Provenance
The following attestation bundles were made for databricks_job_runner-0.4.7.tar.gz:
Publisher:
publish.yml on neo4j-partners/databricks-job-runner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
databricks_job_runner-0.4.7.tar.gz -
Subject digest:
ddc4261190199baf06dbaded398c81f080e302fc43bf0bbd4bfe17eb9b1727bf - Sigstore transparency entry: 1359995249
- Sigstore integration time:
-
Permalink:
neo4j-partners/databricks-job-runner@2e473e06234d136206a29567760a07312d931c62 -
Branch / Tag:
refs/tags/v0.4.7 - Owner: https://github.com/neo4j-partners
-
Access:
internal
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2e473e06234d136206a29567760a07312d931c62 -
Trigger Event:
push
-
Statement type:
File details
Details for the file databricks_job_runner-0.4.7-py3-none-any.whl.
File metadata
- Download URL: databricks_job_runner-0.4.7-py3-none-any.whl
- Upload date:
- Size: 34.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c7a332a918d67de917966d45248b32435ebe3895b2e629a924fd6178a1e0296
|
|
| MD5 |
a82c53957dcf15dceaa0aec47f07b3c4
|
|
| BLAKE2b-256 |
03b266c1bfa41f5b2ce67e5c324bcadfb3d42d8fe8007d3388925e65b5c0a955
|
Provenance
The following attestation bundles were made for databricks_job_runner-0.4.7-py3-none-any.whl:
Publisher:
publish.yml on neo4j-partners/databricks-job-runner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
databricks_job_runner-0.4.7-py3-none-any.whl -
Subject digest:
6c7a332a918d67de917966d45248b32435ebe3895b2e629a924fd6178a1e0296 - Sigstore transparency entry: 1359995258
- Sigstore integration time:
-
Permalink:
neo4j-partners/databricks-job-runner@2e473e06234d136206a29567760a07312d931c62 -
Branch / Tag:
refs/tags/v0.4.7 - Owner: https://github.com/neo4j-partners
-
Access:
internal
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2e473e06234d136206a29567760a07312d931c62 -
Trigger Event:
push
-
Statement type: