Basic CLI tool that can sync a local Python package directory with a folder on dbfs
Project description
dbfs-package-sync
Scenario: You are developing a Python package that you need in a couple of notebooks. You (of course) want to develop your package in an IDE. You can write unit tests and perhaps schedule some notebooks to run integration tests, however, during the development process you also want to prototype and run your in-development-code on a cluster.
There is the extended CLI tool for Databricks (dbx) that has syncing functionality, but the problem there is that you have to have your code as a Repo inside Databricks. There is also a plugin for Visual Studio Code, but maybe you don't want to use VSC and there too you need your repo linked to Databricks.
So what is the problem with linking your repo to Databricks, it's convenient right? If you place a test-notebook
at the same level as where your package folder is located you can import it without the need for activating
autoreload
or setting the system path.
That is a great feature that gives me hope for Databricks in the future, but
sometimes you have your Python package as part of a larger repo that you do not necessarily
want in Databricks.
Also, Databricks still does not support Poetry, so if you want to install the dependencies of your package on the
cluster, you first need to install poetry, create a requirements file and install the dependencies using %pip
.
I am basically naming a few excuses to justify me making this package for myself. I just want to sync my package to dbfs, no more, no less. Why all the fuss?
So, that's what this CLI does. When I need to test the current state of my package in some Databricks notebook,
I run dbfsps
and I have my local state on DBFS.
Caveats:
- Only works for poetry packages
- At the top of the notebook you will of course need to install the dependencies once and set
autoreload
. Thedbfsps
command will send along a requirements file and create a helper notebook though, so you only need a single%run
command at the top of your notebook. - I opted for running a single command every time you need to sync your code instead of a continuous syncing process
running in the background. To keep track of which files need to be uploaded/removed,
dbfsps
creates a hidden text file.dbfsps_file_status
.
Example
If I want to sync the package of this repo, I would run the following from the root
poetry run dbfsps \
--profile some-profile-dev \
--package-location dbfsps \
--remote-path dbfs:/FileStore/jmeidam/packages \
dbfs-package-sync
This would create a remote folder at dbfs:/FileStore/jmeidam/packages/dbfs_package_sync
. This folder would contain
a requirements.txt
file and a folder named "dbfsps" with the contents of that directory in this repo.
I can then upload the init_dbfs_package_sync.py
notebook and run it at the top of any other notebook that uses
imports from the dbfsps
package:
%run init_dbfs_package_sync
Databricks CLI
dbfsps
makes use of Databricks Command Line Interface. To be able to sync your package with DBFS, you will need to
set up a databricks-cli profile.
If you haven't done so already, generate a token on Databricks (can be found under user settings). Store the token somewhere and run
databricks configure --profile some-profile-dev --token
This will ask for that token and the host, e.g. https://adb-8765432123456789.12.azuredatabricks.net
.
You will need the profile name in the --profile
option of dbfsps
More detailed documentation on the CLI can be found here:
https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html#databricks-cli
Notes
The following three files are generated by the dbfsps
command, you may want to add those to your .gitignore
file:
.dbfsps_file_status
requirements.txt
init_<package_name>.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dbfs_package_sync-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 086e15f25ad3dd4eb7a2eb21680894a894f57823b8b1678f23172bbe4a64f0c8 |
|
MD5 | 3e7141cf1cafe50280a3d83676e4d026 |
|
BLAKE2b-256 | f94264f0647cfcac68d414849338450825b5c2ecfcb3fccf37e34088f246f270 |