Skip to main content

Basic CLI tool that can sync a local Python package directory with a folder on dbfs

Project description

dbfs-package-sync

Scenario: You are developing a Python package that you need in a couple of notebooks. You (of course) want to develop your package in an IDE. You can write unit tests and perhaps schedule some notebooks to run integration tests, however, during the development process you also want to prototype and run your in-development-code on a cluster.

There is the extended CLI tool for Databricks (dbx) that has syncing functionality, but the problem there is that you have to have your code as a Repo inside Databricks. There is also a plugin for Visual Studio Code, but maybe you don't want to use VSC and there too you need your repo linked to Databricks.

So what is the problem with linking your repo to Databricks, it's convenient right? If you place a test-notebook at the same level as where your package folder is located you can import it without the need for activating autoreload or setting the system path. That is a great feature that gives me hope for Databricks in the future, but sometimes you have your Python package as part of a larger repo that you do not necessarily want in Databricks.

Also, Databricks still does not support Poetry, so if you want to install the dependencies of your package on the cluster, you first need to install poetry, create a requirements file and install the dependencies using %pip.

I am basically naming a few excuses to justify me making this package for myself. I just want to sync my package to dbfs, no more, no less. Why all the fuss?

So, that's what this CLI does. When I need to test the current state of my package in some Databricks notebook, I run dbfsps and I have my local state on DBFS.

Caveats:

  • Only works for poetry packages
  • At the top of the notebook you will of course need to install the dependencies once and set autoreload. The dbfsps command will send along a requirements file and create a helper notebook though, so you only need a single %run command at the top of your notebook.
  • I opted for running a single command every time you need to sync your code instead of a continuous syncing process running in the background. To keep track of which files need to be uploaded/removed, dbfsps creates a hidden text file .dbfsps_file_status.

Example

If I want to sync the package of this repo, I would run the following from the root

poetry run dbfsps \
  --profile some-profile-dev \
  --package-location dbfsps \
  --remote-path dbfs:/FileStore/jmeidam/packages \
  dbfs-package-sync

This would create a remote folder at dbfs:/FileStore/jmeidam/packages/dbfs_package_sync. This folder would contain a requirements.txt file and a folder named "dbfsps" with the contents of that directory in this repo.

I can then upload the init_dbfs_package_sync.py notebook and run it at the top of any other notebook that uses imports from the dbfsps package:

%run init_dbfs_package_sync

Databricks CLI

dbfsps makes use of Databricks Command Line Interface. To be able to sync your package with DBFS, you will need to set up a databricks-cli profile.

If you haven't done so already, generate a token on Databricks (can be found under user settings). Store the token somewhere and run

databricks configure --profile some-profile-dev --token

This will ask for that token and the host, e.g. https://adb-8765432123456789.12.azuredatabricks.net.

You will need the profile name in the --profile option of dbfsps

More detailed documentation on the CLI can be found here:

https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html#databricks-cli

Notes

The following three files are generated by the dbfsps command, you may want to add those to your .gitignore file:

  • .dbfsps_file_status
  • requirements.txt
  • init_<package_name>.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbfs_package_sync-0.2.0.tar.gz (12.6 kB view hashes)

Uploaded Source

Built Distribution

dbfs_package_sync-0.2.0-py3-none-any.whl (14.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page