Skip to main content

Utilities for building apps on Databricks: settings, authentication, SQL client, and query registry.

Project description

databricks-app-utils

A lightweight Python library for building Streamlit apps on Databricks. It handles everything that sits below the business logic: reading configuration, authenticating with Databricks, executing SQL, and loading query files. Application code should depend on these abstractions rather than touching the Databricks connector directly.

License: GPL-3.0


Modules at a glance

Module Class / function Responsibility
settings.py AppSettings Reads all configuration from environment variables / .env
auth.py DatabricksAuth, build_auth() Translates settings into an auth value object
databricks_client.py DatabricksClient Executes SQL queries against a Databricks SQL Warehouse
query_registry.py QueryRegistry, SqlQuery Loads and caches .sql files from a Python package

Settings management

AppSettings is a Pydantic Settings model. It reads every value from environment variables and optionally from a .env file in the working directory. Unknown variables are silently ignored.

Environment variables

Variable Required Default Description
DATABRICKS_SERVER_HOSTNAME adb-xxx.azuredatabricks.net (no https://)
DATABRICKS_HTTP_PATH /sql/1.0/warehouses/…
DATABRICKS_AUTH_METHOD obo pat | u2m | obo
DATABRICKS_PAT ✅ if auth_method=pat Personal access token
DATABRICKS_DEFAULT_CATALOG None Applied as USE CATALOG before each query
DATABRICKS_DEFAULT_SCHEMA None Applied as USE SCHEMA before each query
DATABRICKS_CONNECT_TIMEOUT_S 30 Connection timeout in seconds
DATABRICKS_RETRY_ATTEMPTS 1 Extra attempts on transient failures
DATABRICKS_RETRY_BACKOFF_S 0.5 Initial backoff between retries (doubles each attempt)
QUERY_TAG streamlit-app Prepended as a SQL comment: /* streamlit-app */

.env file (recommended for local development)

Create a .env file in the project root (never commit it):

DATABRICKS_SERVER_HOSTNAME=adb-1234567890123456.7.azuredatabricks.net
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/abcdef1234567890
DATABRICKS_AUTH_METHOD=u2m
DATABRICKS_DEFAULT_CATALOG=my_catalog
DATABRICKS_DEFAULT_SCHEMA=my_schema

For PAT authentication, add:

DATABRICKS_AUTH_METHOD=pat
DATABRICKS_PAT=dapi0123456789abcdef

Usage

from databricks_app_utils.settings import AppSettings

settings = AppSettings()
print(settings.databricks_server_hostname)
print(settings.databricks_auth_method)   # AuthMethod.U2M

In a Streamlit app, wrap it with @st.cache_resource so settings are read only once per server process:

@st.cache_resource
def get_settings() -> AppSettings:
    return AppSettings()

Authentication

See docs/authentication.md for a full technical deep-dive. The summary is:

Method DATABRICKS_AUTH_METHOD Best for
PAT pat CI/CD, service accounts
U2M u2m Local development — browser OAuth, zero secrets
OBO obo Deployed Databricks Apps

Usage

build_auth() converts settings into a DatabricksAuth value object. You rarely need to call it directly — DatabricksClient takes one as a constructor argument.

from databricks_app_utils.settings import AppSettings
from databricks_app_utils.auth import build_auth

settings = AppSettings()
auth = build_auth(settings)

PAT

DATABRICKS_AUTH_METHOD=pat
DATABRICKS_PAT=dapi0123456789abcdef
auth = build_auth(settings)
# auth.method  == AuthMethod.PAT
# auth.access_token == "dapi…"

U2M (browser OAuth — recommended for local dev)

DATABRICKS_AUTH_METHOD=u2m

No secrets needed. On the first query, a browser window opens for the user to log in. Subsequent queries within the same server process reuse the cached token silently.

auth = build_auth(settings)
# auth.method         == AuthMethod.U2M
# auth.oauth_persistence  ← in-memory OAuthPersistenceCache, held for process lifetime

OBO (Databricks Apps)

DATABRICKS_AUTH_METHOD=obo

The token is read from the X-Forwarded-Access-Token request header on every query. The token provider must be injected at runtime from the Streamlit layer:

auth = DatabricksAuth(
    method=AuthMethod.OBO,
    token_provider=lambda: st.context.headers["X-Forwarded-Access-Token"],
)

Database interface

DatabricksClient is the single interface for all SQL execution. It opens a short-lived connection per query (robust against warehouse idle timeouts) and applies USE CATALOG / USE SCHEMA automatically when defaults are configured.

Query methods

Method Returns Use when
query_polars(sql, params) polars.DataFrame You need a DataFrame for display or transformation
query_pandas(sql, params) pandas.DataFrame Interoperability with pandas-based libraries
query(sql, params) list[dict] Lightweight lookups; no Arrow overhead
merge_dataframe(df, table, id_cols) None Upsert a DataFrame into a Delta table

Named parameters

Use :name syntax in SQL. Lists are automatically expanded for IN clauses:

db.query_polars(
    "SELECT * FROM orders WHERE status = :status AND region IN :regions",
    params={"status": "shipped", "regions": ["EU", "US"]},
)
# Executes: SELECT * FROM orders WHERE status = ? AND region IN (?, ?)

Polars query

from databricks_app_utils.databricks_client import DatabricksClient

df = db.query_polars("SELECT id, name FROM customers LIMIT :n", params={"n": 100})
# Returns a polars.DataFrame

Pandas query

df = db.query_pandas("SELECT id, name FROM customers LIMIT :n", params={"n": 100})
# Returns a pandas.DataFrame

Plain dict query

rows = db.query("SELECT state, COUNT(*) AS cnt FROM customers GROUP BY state")
# Returns [{"state": "CA", "cnt": 1234}, …]

Upsert (MERGE)

Merge a DataFrame into a Delta table using one or more identity columns:

import polars as pl

updates = pl.DataFrame({"id": [1, 2], "score": [9.5, 7.1]})

db.merge_dataframe(
    df=updates,
    target_table="customer_scores",
    id_columns=["id"],
)

Optionally, supply a version_column for optimistic locking — rows whose version has changed since the data was read are silently skipped:

db.merge_dataframe(
    df=updates,
    target_table="customer_scores",
    id_columns=["id"],
    version_column="updated_at",
)

Retry behaviour

DatabricksClient retries failed queries with exponential backoff. Configure via settings:

DATABRICKS_RETRY_ATTEMPTS=2      # 2 extra attempts (3 total)
DATABRICKS_RETRY_BACKOFF_S=1.0   # 1 s, then 2 s

Wiring it up in Streamlit

@st.cache_resource
def get_db() -> DatabricksClient:
    settings = get_settings()
    auth = build_auth(settings)
    return DatabricksClient(settings=settings, auth=auth)

Query registry

QueryRegistry loads .sql files from a Python package directory at runtime and caches them in memory. This keeps SQL out of Python source files and makes queries easy to find, review, and test independently.

File layout

SQL files live under a queries sub-package inside your app and are organised into sub-packages:

src/<your_app>/queries/
├── __init__.py
└── customers/
    ├── list_customers.sql
    ├── list_customers_by_state.sql
    └── list_states.sql

Loading a query

from databricks_app_utils.query_registry import QueryRegistry

registry = QueryRegistry(package="your_app.queries")
q = registry.get("customers/list_customers")

print(q.name)   # "customers/list_customers"
print(q.sql)    # "SELECT customerid, first_name …\n"

The registry is lazy — a file is read from disk only on first access, then cached for the lifetime of the instance.

Passing a query to DatabricksClient

q = registry.get("customers/list_customers_by_state")
df = db.query_polars(q.sql, params={"states": ["CA", "NY"], "limit": 200})

Wiring it up in Streamlit

@st.cache_resource
def get_queries() -> QueryRegistry:
    return QueryRegistry(package="your_app.queries")

Why GPL-3.0?

We believe in open source software and want to ensure that improvements to this library remain open and available to everyone. The GPL-3.0 license guarantees that all derivatives and modifications stay free and open source.


Made with ❤️ by the contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks_app_utils-0.2.2.tar.gz (64.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

databricks_app_utils-0.2.2-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file databricks_app_utils-0.2.2.tar.gz.

File metadata

  • Download URL: databricks_app_utils-0.2.2.tar.gz
  • Upload date:
  • Size: 64.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for databricks_app_utils-0.2.2.tar.gz
Algorithm Hash digest
SHA256 bceeaa194bb1eecf6a4453d2aa8294316b1e220a5c6a3268000a6f7aeb456f0e
MD5 61e82b9167b0c1b0534e75bc02358969
BLAKE2b-256 a36447a6494eed0429f40dbbd7dd708bf754480819721d27220d108a55a9512f

See more details on using hashes here.

Provenance

The following attestation bundles were made for databricks_app_utils-0.2.2.tar.gz:

Publisher: publish.yml on cstotzer/databricks-app-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file databricks_app_utils-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for databricks_app_utils-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 56f119ab01f3b79e197ae98842ce6c3ff0ba0bb861bf5b0bedb6f782910b6815
MD5 c40f4468571d0ae0e5de584d28aeec19
BLAKE2b-256 8b5ecf36eb5fc8df59c25be504885dc6f117a2037658a6658a2ed4709a698920

See more details on using hashes here.

Provenance

The following attestation bundles were made for databricks_app_utils-0.2.2-py3-none-any.whl:

Publisher: publish.yml on cstotzer/databricks-app-utils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page