Skip to main content

Get to da cluster! Remote function execution for Databricks.

Project description

Choppa

Get to da cluster

Run Python in Databricks straight from your laptop

PyPI version Python 3.10+ License: MIT

Because Running Code Shouldn't Be Hard

So you want to run something in Databricks? Strap in because they expect you to build jobs with their nifty homebrew orchestrator, deploy environments using better-than-Terraform bundles, develop in their hosted monaco UI (which is waaay better than whatever VSCode has), and, oh. Remote development? Like from your laptop? Did we mention their hosted notebooks already? They come with AI and serverless

You don't want to do any of that. You want to write some code and run it. Like a normal person.

Installation

pip install choppa

Configuration

Choppa needs to know what cluster to run stuff on. In-order of precedence, Choppa will use the cluster:

  • set via the cluster_id parameter when you instanciate Choppa
  • whatever you put in the environment variable DATABRICKS_CLUSTER_ID
  • the value of cluster_id in ~/.databrickscfg
    • if the environment variable DATABRICKS_CONFIG_PROFILE is set, using that profile
    • otherwise using the DEFAULT profile

Usage

from choppa import Choppa

dutch = Choppa()

@dutch.remote
def add(a: int, b: int) -> int:
    return a + b

add(1, 2)  # 3

Donezo. You can probably stop reading now because that covers 99% of the frustration of Databricks development with just a freaking decorator

Advanced Usage

Scope

Choppa only instantiates remote environments for contexts that are possible to scope without having to inspect frames or mess with function ASTs. Or, put another way: Only functions and arguments are in-scope.

from choppa import Choppa

EXPONENT = 10

dutch = Choppa()

# This version works but is pretty boring
@dutch.remote
def an_option(a: int, exponent: int) -> int:
    return a ** exponent

# This one uses ONE WEIRD TRICK to always produce the exact same result!
@dutch.remote
def another_option(a: int) -> int:
    return a ** EXPONENT

Caching

Consider this straightforward workflow

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()

def get_stuff() -> list[Row]:
    return spark.table("huge_table").limit(1_000_000_000).collect()

def analyze_stuff(rows: list[Row]):
    return len(rows)

data = get_stuff()
result = analyze_stuff(data)

I bet you're having fun downloading those billion rows from Databricks! Haven't even gotten to your analysis yet and you're already wishing computers came with hardware...

You'd maybe get to your analysis() sooner if you could cache the result on Databricks and only pass a reference over the network

from choppa import Choppa

choppa = Choppa(
    artifact_dir="/Workspace/Users/you@company/artifacts",
    max_result_size=2**10
)

@choppa.remote
def get_stuff() -> list[Row]:
    return spark.table("huge_table").limit(1_000_000_000).collect()

def analyze_stuff(rows: list[Row]):
    return len(rows)


ref = get_stuff() # type: ArtifactRef

Since a literal billion rows will blow through 1K bytes the result isn't returned. But then you have this ArtifactRef thing and need analyze_stuff() to use your actual data. Your could always materialize the artifact and run your analysis locally

data = ref.dereference() # hahaha, that's right- it's C all over again. sucker!
result = analyze_stuff(data)

Yeah, it's a cute trick but doesn't have a lot of value since you still need to download data eventually. Hmm.... I know! You could let Choppa automagically deal with ArtifactRefs behind the scenes (it does), run everything on Databricks (you should), and just run your code (the freakin' dream)

from choppa import Choppa

choppa = Choppa(
    artifact_dir="/Workspace/Users/you@company/artifacts",
    result_size_max=2**10
)

@choppa.remote
def get_stuff() -> list[Row]:
    return spark.table("huge_table").limit(1_000_000_000).collect()

@choppa.remote
def analyze_stuff(rows: list[Row]):
    return len(rows)

data = get_stuff()
result = analyze_stuff(data)

There are actually 2 decorators you can use if you want to be a bit more certain with what is returned as a reference

  • choppa.artifact will always cache results, returning an ArtifactRef object
  • choppa.remote will opportunistically return your data but fall back to an ArtifactRef if the serialized value is larger > than result_size_max. If you don't set result_size_max or set it to None then choppa.remote will always return your data

Context Managers

There's not a ton of savings to be had but you can use a context manager to group remote calls together. This does not invalidate the stuff I said about variables not being in-scope. What you get is faster execution because the remote process is reused for multiple function calls. You could probably get cute and create globals inside remote functions and have them persist in memory without having to write to disk or be sent over the network.. That's actually a pretty good idea. I'll think about it for version 2. Anyway, here's an example

from choppa import Choppa

dutch = Choppa()

@dutch.remote
def some_math(a: int, b: int) -> int:
    return a + b

with dutch.session():
    x = [some_math(y,1) for y in range(1_000)]

Async / Fire-and-Forget

And because my wife loves the idea of me turning off my laptop on occasion, maybe you just want to yeet a hard job at Databricks and walk away for a while. Easy peasy

from choppa import Choppa

dutch = Choppa()

@dutch.submit
def slow_job():
    # ... hours of processing ...
    return results

# Returns immediately
handle = slow_job() # type: RemoteHandle

# Later...
ref = handle.wait()
data = dutch.dereference(ref)

# or another option
while not handle.done():
    pass
ref = handle.get_pointer() # type: ArtifactRef
data = dutch.dereference(ref)

Requirements

  • Python 3.10+
  • databricks-sdk >= 0.20.0
  • Authenticated workspace (env vars, profile, or Azure CLI)

License

MIT


Hey, boss, I just made literally every researcher's job easier, made them more productive, made them happier. Every IC who works for you and a significant chunk of data science people across the BU. I'm just talking out loud here but maybe now I can get that promotion?

(huh? what are 'people skills'...)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

choppa-0.1.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

choppa-0.1.0-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file choppa-0.1.0.tar.gz.

File metadata

  • Download URL: choppa-0.1.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for choppa-0.1.0.tar.gz
Algorithm Hash digest
SHA256 79d1c588f30447bbce8a7dbb374a5c1da2589867bcdd0343996bba7c162bc39a
MD5 f0315b28eeb5d16d26edad0bfbe8feff
BLAKE2b-256 7f7a000160ca7a97120ea150ab19f923f90c261a7c524536b80efc5405ca5a9a

See more details on using hashes here.

File details

Details for the file choppa-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: choppa-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.5

File hashes

Hashes for choppa-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0f0493c487b580375efa5383dd81a6bf42635262d0c91669bb7fb90d9ec0373
MD5 aa8cc72691aba6858647229611515e76
BLAKE2b-256 dfb5c06991c01eebd1d0f60dec1405bef0fc507bd87a9aabfbe11da1899999d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page