`fsspec`-based file system interface for Databricks file system"s"
Project description
File system interface for Databricks file system"s"
fsspec-databricks provides a fsspec compliant, unified file system
interface for accessing the following file
systems in
Databricks.
- Unity Catalog Volumes
- Workspace files
- Legacy DBFS (Databricks File System)
- Note: DBFS has already been deprecated and not recommended by Databricks.
Features
- Provides seamless access to files in different Databricks file systems with DBFS URLs (
dbfs:/path/to/file) or POSIX paths (/path/to/file).- Automatically routes file operations to appropriate file systems based on file path patterns.
- Implements file operations across different file systems, for example, copying a file from Workspace to Unity Catalog Volume or vice versa.
- Use the local file system access when running within a Databricks workspace.
- Implemented on Databricks Python SDK.
Getting started
You can install fsspec-databricks from PyPI.
pip install fsspec-databricks
Then you can directly instantiate DatabricksFileSystem in fsspec_databricks module.
from fsspec_databricks import DatabricksFileSystem
fs = DatabricksFileSystem()
Or, you can register DatabricksFileSystem as the default file system implementation for dbfs:/ URL scheme
by calling fsspec_databricks.use().
import fsspec
import fsspec_databricks
fsspec_databricks.use()
fs = fsspec.filesystem("dbfs") # DatabricksFileSystem
Supported file paths
fsspec-databricks supports file paths with dbfs:/ scheme.
It uses the same path patterns as Databricks workspaces to map file paths to supported file systems.
| URL pattern | Mapped file system |
|---|---|
dbfs:/Volumes/(catalog)/(schema)/(volume)/path/to/file |
Unity Catalog Volume file system |
dbfs:/Workspace/path/to/file |
Databricks Workspace file system |
dbfs:/... (other than above) |
Legacy DBFS (deprecated) |
Examples:
fs.ls("dbfs:/Volumes/my_catalog/my_schema/my_volume/path") # Access Unity Catalog Volume files
fs.ls("dbfs:/Workspace/Users/user-a/path") # Access workspace files
fs.ls("dbfs:/data/path") # Access legacy DBFS files
fsspec-databricks supports also stripped, POSIX-like paths without dbfs:/ scheme.
| Path pattern | Mapped file system |
|---|---|
/Volumes/(catalog)/(schema)/(volume)/path/to/file |
Unity Catalog Volume file system |
/Workspace/path/to/file |
Databricks Workspace file system (only in DBFS-disabled workspace) |
/... (other than above) |
Legacy DBFS (deprecated) |
Examples:
fs.ls("/Volumes/my_catalog/my_schema/my_volume/path") # Access Unity Catalog Volume files
fs.ls("/Workspace/Users/user-a/path") # Access workspace files (only in DBFS-disabled workspace)
fs.ls("/data/path") # Access legacy DBFS files
For more details aboutdbfs:/ and POSIX path support in Databricks, see
the official documentation.
Authentication
fsspec-databricks uses Databricks Unified Authentication provided by Databricks Python SDK.
You can find information about supported authentication parameters and environment variables in the Databricks Python SDK documentation.
Default authentication
If the Databricks Unified Authentication is configured in the runtime environment, fsspec-databricks will
automatically pick up credentials from the default pofile and authenticate with Databricks.
from fsspec_databricks.spec import DatabricksFileSystem
fs = DatabricksFileSystem()
with fs.open("dbfs:/Volumes/...") as f:
...
By constructor parameters
You can programmatically configure authentication by passing parameters to DatabricksFileSystem constructor.
# Authentication with PAT
fs = DatabricksFileSystem(host=host_url, token=access_token)
# Use different profile
fs = DatabricksFileSystem(profile="production")
By environment variables
Or, you can configure authentication by via environment variables.
# Shell
export DATABRICKS_CONFIG_PROFILE=production
# Then in Python
fs = DatabricksFileSystem() # will use the "production" profile
By fsspec configuration
You can use the fsspec's configuration model to configure and persist authentication parameters.
With WorkspaceClient
You can create DatabricksFileSystem by explicitly setting Databricks SDK's WorkspaceClient object.
The created DatabricksFileSystem instance will use the authentication configured in the provided WorkspaceClient
object.
from databricks.sdk import WorkspaceClient
client = WorkspaceClient(...)
...
fs = DatabricksFileSystem(client=client)
Note that as WorkspaceClient object is not serializable, DatabricksFileSystem object created in this way
is also not serializable. If you need to serialize file system objects (for example, Dask), use different
methods to configure authentication instead.
Configuration options
In addition to the authentication parameters, fsspec-databricks supports the following configuration options.
Options for general file system behavior
| Parameter name | Description | Default |
|---|---|---|
| config | An optional pre-configured Databricks SDK Config object. If provided, it will be used for authentication. |
None |
| client | An optional pre-configured Databricks SDK WorkspaceClient object. If provided, it will be used for accessing the Databricks Workspace API. |
None |
| use_local_fs_in_workspace | Access files from the local file system rather than the remote Databricks API when running within a Databricks workspace. | True |
| verbose_debug_log | Whether to enable verbose debug logging for file system operations. | False |
Options for Unity Catalog Volume file system
| Parameter name | Description | Default |
|---|---|---|
| volume_fs_max_read_concurrency | The maximum number of concurrent file read operations on a Unity Catalog Volume file. | 10 |
| volume_fs_min_read_block_size | The minimum data size to read for each read operation on a Unity Catalog Volume file. | 512 * 1024 (512 kb) |
| volume_fs_max_read_block_size | The maximum data size to read for each read operation on a Unity Catalog Volume file. | 8 * 1024 * 1024 (8 mb) |
| volume_fs_max_write_concurrency | The maximum number of concurrent file write operations on a Unity Catalog Volume file. | 10 |
| volume_fs_min_write_block_size | The minimum data size to write for each write operation on a Unity Catalog Volume file. | 5 * 1024 * 1024 (5 mb) |
| volume_fs_max_write_block_size | The maximum data size to write for each write operation on a Unity Catalog Volume file. | 32 * 1024 * 1024 (32 mb) |
Differences from the original DatabricksFileSystem in fsspec
fsspec provides its own implementation of DatabricksFileSystem (
fsspec.implementations.DatabricksFileSystem).
The main difference between DatabricksFileSystem in fsspec-databricks and the original one in fsspec is that
the original one is for legacy DBFS (Databricks File System),
which Databricks has already deprecated.
Databricks currently supports workspace files and Unity Catalogue volumes in addition to the legacy DBFS,
and it continues to use the dbfs:/ URL scheme for both legacy DBFS and the other file systems
(documentation).
fsspec-databricks primarily aims to support new file systems (workspace files and Unity Catalogue volumes)
and enable seamless access to them using the same dbfs:/ URL scheme supported in Databricks workspaces.
Project status
The current status of this library is early beta. Its API and behavior are subject to change as the following underlying components are not yet released version.
- Databricks Python SDK (beta)
- Unity Catalog Files REST API (beta)
- Multipart upload API for Unity Catalog Volume file write (undocumented)
In addition, the following features are not yet implemented or have not been tested well.
- Resumable file upload for Unity Catalog Volume files (used in Databricks on GCP)
- Legacy DBFS support (deprecated by Databricks and not recommended for use)
We are actively developing and testing the library, and we welcome contributions and feedback from the community.
Development
Some tests in this library require access to actual Databricks workspaces to verify its file system operations in the real Databricks environment. You need to configure access to a Databricks workspace and create work directories within it before running the tests.
Work directories in Databricks workspace
You need to create work directories in your Databricks workspace and Unity Catalog to use for the tests and set the POSIX paths (not DBFS URLs) of the directories to the following environment variables.
| Location | Environment variable name | default |
|---|---|---|
| Unity Catalog Volume | FSSPEC_DATABRICKS_VOLUME_TEST_ROOT |
/Volumes/fsspec_test_catalog/fsspec_test_schema/test |
| Workspace files | FSSPEC_DATABRICKS_WORKSPACE_TEST_ROOT |
/fsspec-databricks-test |
Local development
Configure Databricks Unified authentication locally, and set environment variable
FSSPEC_DATABRICKS_VOLUME_TEST_ROOT and FSSPEC_DATABRICKS_WORKSPACE_TEST_ROOT to specify the
location of work directories to use.
You can set authentication parameters and the environment variables above to .env file
in the project root directory.
GitHub Actions
You need a Databricks service principal that has read-write access to the work directories.
Set the following GitHub Actions secrets and variables in the repository settings.
| Secret name | Description |
|---|---|
DATABRICKS_HOST |
The URL of the Databricks workspace |
DATABRICKS_CLIENT_ID |
The client ID of the Databricks service principal to use for testing |
DATABRICKS_CLIENT_SECRET |
The client secret of the Databricks service principal to use for testing |
CODECOV_TOKEN |
The repository upload token for Codecov |
| Variable name | Description |
|---|---|
FSSPEC_DATABRICKS_VOLUME_TEST_ROOT |
The POSIX path of the work directory in Unity Catalog Volume to use for testing. |
FSSPEC_DATABRICKS_VOLUME_TEST_ROOT |
The POSIX path of the directory in Databricks Workspace files to use for testing. |
License
Apache License 2.0. See LICENSE for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fsspec_databricks-0.1.0.tar.gz.
File metadata
- Download URL: fsspec_databricks-0.1.0.tar.gz
- Upload date:
- Size: 49.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
442e952ef7797edcbbbd905ca9cc6c9b5d3ba2a8cb3099ee55a877bc5f5a4f93
|
|
| MD5 |
2c4fa622a8d27b9081f0b3cf28617300
|
|
| BLAKE2b-256 |
5916354a6a598b6f2fd4c30db9a76d06deadce0e929801ad2e98d9284e7ce430
|
File details
Details for the file fsspec_databricks-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fsspec_databricks-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e51b67b5535ee6de0f3515bb286e44e4dd01d6488d8ae7137ebf458dc378ea9a
|
|
| MD5 |
3acf7677e06673e39bcd58f27a057544
|
|
| BLAKE2b-256 |
44b3a0e3d6b2860ba487f306c8186c65fd69e307543e6679e4fe8f9b569d9911
|