Knowit's dataops library - simplifying building data pipelines in Databricks, for both testing and production use cases. The package enables a workflow where users write their code in notebooks, and then deploy them to a Databricks workspace without stepping on each other's toes.
Project description
Brickops
DataOps framework for Databricks
Table of contents:
- Getting started
- Purpose
- Naming functions
- Deployment functions
- Getting started
- How to get into devcontainer from the command line
- Configuration options for naming and mesh levels
- Underlying philosophy
Getting Started
The package can be installed with pip:
pip install brickops
Purpose
Brickops is a framework to automatically name Databricks assets, like Unity Catalog (UC) schemas, tables and jobs, according to environment (e.g. dev, staging, prod) and domain/project/flow names (where domain, project, flow are derived from the folder path in the repository).
This enables the users (data engineers, etc) to easily develop and deploy data sets, models and pipelines, and automatically comply with organizational principles.
Brickops contains naming functions for UC assets and autojob() functions for auto-deploying jobs. In the near future autodeploy of DLT pipelines will be added.
Naming funtions
Bricksops works in the context of a folder path, representing data pipeline or flow:
orgs/acme/domains/transport/projects/taxinyc/flows/revenue/
The structure here is:
- org:
acme- domain:
transport- project:
taxinyc- flow:
revenue
- flow:
- project:
- domain:
Catalog name from path: catname_from_path()
Example output from naming functions from notebooks under that path:
# Name functions enables automatic env+user specific database naming
from libs.catname import catname_from_path
from libs.dbname import dbname
cat = catname_from_path()
print(f"Catalog name derived from path: {cat}")
Default output (use the domain):
Catalog name derived from path: transport
Output with optional full mesh prefixing (org_domain_project):
Catalog name derived from path: acme_transport_taxinyc
Environment specific database name: dbname()
Database (schema name) with environment prefix:
db = dbname(db="revenue", cat=cat)
print("DB name: {db}")
Output in dev environment:
DB name: transport.dev_paldevibe_main_0e7768a7_revenue
Output in prod environment:
DB name: transport.revenue
Table name: tablename()
from brickops.datamesh.naming import build_table_name as tablename
revenue_by_borough_tbl = tablename(cat=catalog, db="revenue", tbl="revenue_by_borough")
print(f"revenue_by_borough_tbl: {revenue_by_borough_tbl}")
Output in dev environment:
transport.dev_paldevibe_branchname_0e7768a7_revenue.revenue_by_borough
Output in prod environment:
transport.revenue.revenue_by_borough
In dev (and all environments except prod), the database name is prefixed with username, branch and commit ref. The automatic prefixes prevents notebooks running in development mode from overwriting production data.
Deployment functions
Auto-deploying a spark pipeline
from brickops.dataops.deploy.autojob import autojob
response = autojob()
This job will automatically name and generate a job based on a deployment.yml file in the folder, e.g. orgs/acme/domains/transport/projects/taxinyc/flows/revenue/deployment.yml.
In development, the job name created will be:
acme_transport_taxinyc_dev_abirkhan_branchname_4c6799ab_revenue
In production, the job name created will be:
acme_transport_taxinyc_revenue
The automatic prefixes in dev prevents development jobs from overwriting production jobs.
Getting started
This project uses uv. It might be easies to use the devcontainer,
defined in .devcontainer, which is supported by VSCode and other toos.
If you want a local install, follow the installation instructions for your platform on the project homepage.
Next, make sure you are in the project root and run the following command in the terminal:
uv sync
This will create a virtual environment and install the required packages in it.
The project configuration can be found in pyproject.toml.
You can now run the tests with
uv run pytest
How to get into devcontainer from command line
make start-devcontainer
make devcontainer-shell
Configuration options for naming and mesh levels
Naming of resources (catalogs, db/schemas, jobs, pipelines) can be configured in a file called .brickopscfg/config.yaml in the root of tour repo. example configurations can be found in tests/.brickopscfg/config.yml.
Mesh levels refers here to the granularity/depth of your organization represented in the repo structure, e.g. organization, domain and project.
An example configuration could be:
naming:
job:
prod: "{domain}_{project}_{env}"
other: "{domain}_{project}_{env}_{username}_{gitbranch}_{gitshortref}"
pipeline:
prod: "{domain}_{project}_{env}_dlt"
other: "{domain}_{project}_{env}_{username}_{gitbranch}_{gitshortref}_dlt"
catalog:
prod: "{domain}"
other: "{domain}"
db:
prod: "{db}"
other: "{env}_{username}_{gitbranch}_{gitshortref}_{db}"
Let us now see what resource names would be produced from a notebook located at
something/domains/marketing/projects/projectfoo/flows/prep/foo_notebook.
For catalogs the configuration above means the domain section of a path is used, for jobs a combination of domain, project and env.
The resource names would become:
-
job name:
- prod:
marketing_projectfoo_prod - dev:
marketing_projectfoo_env_paldevibe_branchname_82e5d310
- prod:
-
pipeline name:
- prod:
marketing_projectfoo_prod_dlt - dev:
marketing_projectfoo_env_paldevibe_branchname_82e5d310_dlt
- prod:
-
catalog name:
- prod:
sales - dev:
sales
- prod:
-
db name for a database/schema called
customers:- prod:
customers - dev:
customers_env_paldevibe_branchname_82e5d310
- prod:
-
With org support, in the following notebook:
/Repos/test@foobar.foo/dataplatform/something/org/acme/domains/sales/projects/projectfoo/flows/testflow/foo_notebook, a config of{org}_{domain}_{project}_{env}would result inacme_sales_projectfoo_prodfor a production environment.
Development tools
Ruff
How to run ruff:
make ruff
Without make:
uv run ruff check --output-format=github .
Mypy
How to run mypy:
make mypy
Without make:
mypy .
Underlying philosophy
The framework is partly based on the thoughts presented in the article Data Platform Urbanism - Sustainable Plans for your Data Work.
It can be explored in the open source workshop Databricks DataOps course.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file brickops-0.3.18.tar.gz.
File metadata
- Download URL: brickops-0.3.18.tar.gz
- Upload date:
- Size: 22.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c9d0c930a3149bed1de7cf9f6852ea4f757d9830fc7764fcbd7c21cdf9cc000
|
|
| MD5 |
5aec41c0215571256c9215879a568f52
|
|
| BLAKE2b-256 |
0ec9145ccdca390d2ee28c4df5657f571df820d7f689bebbb6d706d896f55624
|
Provenance
The following attestation bundles were made for brickops-0.3.18.tar.gz:
Publisher:
publish-package.yml on brickops/brickops
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
brickops-0.3.18.tar.gz -
Subject digest:
1c9d0c930a3149bed1de7cf9f6852ea4f757d9830fc7764fcbd7c21cdf9cc000 - Sigstore transparency entry: 335943320
- Sigstore integration time:
-
Permalink:
brickops/brickops@96a99d92d672dfc9d81507098c1b26374c7571bb -
Branch / Tag:
refs/tags/v0.3.18 - Owner: https://github.com/brickops
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@96a99d92d672dfc9d81507098c1b26374c7571bb -
Trigger Event:
push
-
Statement type:
File details
Details for the file brickops-0.3.18-py3-none-any.whl.
File metadata
- Download URL: brickops-0.3.18-py3-none-any.whl
- Upload date:
- Size: 29.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
392fd6c558c356c0de90d948a9fd58f4fbc1bca17380263f4e391fea76f2f892
|
|
| MD5 |
34d3656733552e71c5d2ed4dc811bbe2
|
|
| BLAKE2b-256 |
a740f99e415e752f9f2b84e06a0f602a81171d8eff0cb9defabc8a62fd2658eb
|
Provenance
The following attestation bundles were made for brickops-0.3.18-py3-none-any.whl:
Publisher:
publish-package.yml on brickops/brickops
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
brickops-0.3.18-py3-none-any.whl -
Subject digest:
392fd6c558c356c0de90d948a9fd58f4fbc1bca17380263f4e391fea76f2f892 - Sigstore transparency entry: 335943336
- Sigstore integration time:
-
Permalink:
brickops/brickops@96a99d92d672dfc9d81507098c1b26374c7571bb -
Branch / Tag:
refs/tags/v0.3.18 - Owner: https://github.com/brickops
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@96a99d92d672dfc9d81507098c1b26374c7571bb -
Trigger Event:
push
-
Statement type: