Get to da cluster! Remote function execution for Databricks.
Project description
Choppa
Get to da cluster
Run Python in Databricks straight from your laptop
Because Running Code Shouldn't Be Hard
So you want to run something in Databricks? Strap in because they expect you to build jobs with their nifty homebrew orchestrator, deploy environments using better-than-Terraform bundles, develop in their hosted monaco UI (which is waaay better than whatever VSCode has), and, oh. Remote development? Like from your laptop? Did we mention their hosted notebooks already? They come with AI and serverless
You don't want to do any of that. You want to write some code and run it. Like a normal person.
Installation
pip install choppa
Configuration
Choppa needs to know what cluster to run stuff on. In-order of precedence, Choppa will use the cluster:
- set via the
cluster_idparameter when you instanciateChoppa - whatever you put in the environment variable
DATABRICKS_CLUSTER_ID - the value of
cluster_idin~/.databrickscfg- if the environment variable
DATABRICKS_CONFIG_PROFILEis set, using that profile - otherwise using the
DEFAULTprofile
- if the environment variable
Usage
from choppa import Choppa
dutch = Choppa()
@dutch.remote
def add(a: int, b: int) -> int:
return a + b
add(1, 2) # 3
Donezo. You can probably stop reading now because that covers 99% of the frustration of Databricks development with just a freaking decorator
Advanced Usage
Scope
Choppa only instantiates remote environments for contexts that are possible to scope without having to inspect frames or mess with function ASTs. Or, put another way: Only functions and arguments are in-scope.
from choppa import Choppa
EXPONENT = 10
dutch = Choppa()
# This version works but is pretty boring
@dutch.remote
def an_option(a: int, exponent: int) -> int:
return a ** exponent
# This one uses ONE WEIRD TRICK to always produce the exact same result!
@dutch.remote
def another_option(a: int) -> int:
return a ** EXPONENT
Caching
Consider this straightforward workflow
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
def get_stuff() -> list[Row]:
return spark.table("huge_table").limit(1_000_000_000).collect()
def analyze_stuff(rows: list[Row]):
return len(rows)
data = get_stuff()
result = analyze_stuff(data)
I bet you're having fun downloading those billion rows from Databricks! Haven't even gotten to your analysis yet and you're already wishing computers came with hardware...
You'd maybe get to your analysis() sooner if you could cache the result on Databricks and only pass a reference over the network
from choppa import Choppa
choppa = Choppa(
artifact_dir="/Workspace/Users/you@company/artifacts",
max_result_size=2**10
)
@choppa.remote
def get_stuff() -> list[Row]:
return spark.table("huge_table").limit(1_000_000_000).collect()
def analyze_stuff(rows: list[Row]):
return len(rows)
ref = get_stuff() # type: ArtifactRef
Since a literal billion rows will blow through 1K bytes the result isn't returned. But then you have this ArtifactRef thing and need analyze_stuff() to use your actual data. Your could always materialize the artifact and run your analysis locally
data = ref.dereference() # hahaha, that's right- it's C all over again. sucker!
result = analyze_stuff(data)
Yeah, it's a cute trick but doesn't have a lot of value since you still need to download data eventually. Hmm.... I know! You could let Choppa automagically deal with ArtifactRefs behind the scenes (it does), run everything on Databricks (you should), and just run your code (the freakin' dream)
from choppa import Choppa
choppa = Choppa(
artifact_dir="/Workspace/Users/you@company/artifacts",
result_size_max=2**10
)
@choppa.remote
def get_stuff() -> list[Row]:
return spark.table("huge_table").limit(1_000_000_000).collect()
@choppa.remote
def analyze_stuff(rows: list[Row]):
return len(rows)
data = get_stuff()
result = analyze_stuff(data)
There are actually 2 decorators you can use if you want to be a bit more certain with what is returned as a reference
- choppa.artifact will always cache results, returning an
ArtifactRefobject- choppa.remote will opportunistically return your data but fall back to an
ArtifactRefif the serialized value is larger > thanresult_size_max. If you don't setresult_size_maxor set it toNonethenchoppa.remotewill always return your data
Context Managers
There's not a ton of savings to be had but you can use a context manager to group remote calls together. This does not invalidate the stuff I said about variables not being in-scope. What you get is faster execution because the remote process is reused for multiple function calls. You could probably get cute and create globals inside remote functions and have them persist in memory without having to write to disk or be sent over the network.. That's actually a pretty good idea. I'll think about it for version 2. Anyway, here's an example
from choppa import Choppa
dutch = Choppa()
@dutch.remote
def some_math(a: int, b: int) -> int:
return a + b
with dutch.session():
x = [some_math(y,1) for y in range(1_000)]
Async / Fire-and-Forget
And because my wife loves the idea of me turning off my laptop on occasion, maybe you just want to yeet a hard job at Databricks and walk away for a while. Easy peasy
from choppa import Choppa
dutch = Choppa()
@dutch.submit
def slow_job():
# ... hours of processing ...
return results
# Returns immediately
handle = slow_job() # type: RemoteHandle
# Later...
ref = handle.wait()
data = dutch.dereference(ref)
# or another option
while not handle.done():
pass
ref = handle.get_pointer() # type: ArtifactRef
data = dutch.dereference(ref)
Requirements
- Python 3.10+
databricks-sdk>= 0.20.0- Authenticated workspace (env vars, profile, or Azure CLI)
License
MIT
Hey, boss, I just made literally every researcher's job easier, made them more productive, made them happier. Every IC who works for you and a significant chunk of data science people across the BU. I'm just talking out loud here but maybe now I can get that promotion?
(huh? what are 'people skills'...)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file choppa-0.1.0.tar.gz.
File metadata
- Download URL: choppa-0.1.0.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79d1c588f30447bbce8a7dbb374a5c1da2589867bcdd0343996bba7c162bc39a
|
|
| MD5 |
f0315b28eeb5d16d26edad0bfbe8feff
|
|
| BLAKE2b-256 |
7f7a000160ca7a97120ea150ab19f923f90c261a7c524536b80efc5405ca5a9a
|
File details
Details for the file choppa-0.1.0-py3-none-any.whl.
File metadata
- Download URL: choppa-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0f0493c487b580375efa5383dd81a6bf42635262d0c91669bb7fb90d9ec0373
|
|
| MD5 |
aa8cc72691aba6858647229611515e76
|
|
| BLAKE2b-256 |
dfb5c06991c01eebd1d0f60dec1405bef0fc507bd87a9aabfbe11da1899999d6
|