No project description provided
Project description
PyJaws: A Pythonic Way to Define Databricks Jobs and Workflows
- PyJaws enables declaring Databricks Jobs and Workflows as Python code, allowing for:
- Code Linting
- Formatting
- Parameter Validation
- Modularity and reusability
- In addition to those, PyJaws also provides some nice features such as cycle detection out of the box.
Folks who have used Python-based orchestration tools such as Apache Airflow, Luigi and Mage will be familiar with the concepts and the API if PyJaws.
- PyJaws leverages some existing libraries in order to allow for modularisation, reusability and validation, such as:
Documentation
- Work in progress. Stay tuned!
Development & Testing
- PyJaws can be tested locally for development purposes. To run unit tests, make sure
tox
,pytest
,pytest-cov
, andcoverage
are installed and from a bash terminal, simply runtox
.
Getting Started
- First step is installing
pyjaws
:
pip install pyjaws
- Once it's installed, define your Databricks Workspace authentication variables:
export DATABRICKS_HOST = ...
export DATABRICKS_TOKEN = ...
- Last, define your Workflow Tasks (see
examples
) and run:
pyjaws create path/to/your/workflow_definitions
Example
from pyjaws.api.base import (
Cluster,
Runtime,
Task,
Workflow
)
cluster = Cluster(
job_cluster_key = "ai_cluster",
spark_version = Runtime.DBR_13_ML,
num_workers = 2,
node_type_id = "Standard_DS3_v2",
cluster_log_conf = {
"dbfs": {
"destination": "dbfs:/home/cluster_log"
}
}
)
# Create a Task object.
ingest_task = Task(
key = "ingest",
cluster = cluster,
entrypoint = "iot",
task_name = "ingest",
parameters = [
f"my_parameter_value",
"--output-table", "my_table"
]
)
transform_task = Task(
key = "transform",
cluster = cluster,
entrypoint = "iot",
task_name = "ingest",
dependencies = [ingest_task],
parameters = [
f"my_parameter_value2",
"--input-table", "my_table"
"--output-table", "output_table"
]
)
# Create a Workflow object to define dependencies
# between previously defined tasks.
workflow = Workflow(
name = "my_workflow",
tasks = [ingest_task, transform_task]
)
Sample Results
Running pyjaws create examples/simple_workflow
will result in the following Workflow being deployed to Databricks:
By default, pyjaws also includes some useful tags into the workflows indicating which Git Repo hosts the Python definition, commit hash and when the workflow was last updated. For example:
Disclaimer
- PyJaws is not developed, endorsed not supported by Databricks. It is provided as-is; no warranty is derived from using this package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyjaws-0.0.9.tar.gz
(9.6 kB
view hashes)
Built Distribution
pyjaws-0.0.9-py3-none-any.whl
(9.3 kB
view hashes)