Skip to main content

`fsspec`-based file system interface for Databricks file system"s"

Project description

fsspec-databricks

PyPI - Version codecov

File system interface for Databricks file system"s".

fsspec-databricks provides a fsspec-compliant file system implementation that unifies access to Databricks file systems, including:

The library routes dbfs:/ and POSIX-style paths to the appropriate Databricks file system implementation and supports copying and streaming between them.

Features

  • Provides seamless access to files in different Databricks file systems with DBFS URLs (dbfs:/path/to/file) or POSIX paths (/path/to/file).
    • Automatically routes file operations to appropriate file systems based on file path patterns.
    • Implements file operations across different file systems, for example, copying a file from Workspace to Unity Catalog Volume or vice versa.
  • Falls back to the local file system when running inside a Databricks workspace.
  • Implemented on Databricks Python SDK.

Compatibility

  • Python 3.10 to 3.14
  • databricks-sdk: 0.99.0 or later
  • fsspec: 2024.6.0 or later
  • aiohttp: 3.12.0 or later
  • Databricks workspace: Tested on the following environments at the moment.
    • Databricks Free Edition
    • Azure Databricks
    • Databricks on Google Cloud

Project status

The current status of this library is early beta. Its API and behavior are subject to change during further development and testing.

  • The current version relies on the undocumented multipart upload API for Unity Catalog Volume file write, which Databricks does not officially support and may change without notice.
  • For more details about the current limitations, see the Limitations section below.

Getting started

Installation

You can install fsspec-databricks from PyPI.

# with pip
pip install fsspec-databricks
# with UV
uv add fsspec-databricks

Usage

Then you can directly instantiate DatabricksFileSystem in fsspec_databricks module.

from fsspec_databricks import DatabricksFileSystem

fs = DatabricksFileSystem()

Or, you can register DatabricksFileSystem as the default file system implementation for dbfs:/ URL scheme by calling fsspec_databricks.use().

import fsspec
import fsspec_databricks

fsspec_databricks.use()

fs = fsspec.filesystem("dbfs")  # DatabricksFileSystem

For more details on how to use the fsspec file system objects, see fsspec's documentation.

Supported file paths

fsspec-databricks supports file paths with dbfs:/ scheme.

It uses the same path patterns as Databricks to map dbfs:/ and POSIX paths to the appropriate file system implementation.

URL pattern Mapped file system
dbfs:/Volumes/(catalog)/(schema)/(volume)/path/to/file Unity Catalog Volume file system
dbfs:/Workspace/path/to/file Databricks Workspace file system
dbfs:/... (other than above) Legacy DBFS (deprecated)

Examples:

fs.ls("dbfs:/Volumes/my_catalog/my_schema/my_volume/path")  # Access Unity Catalog Volume files
fs.ls("dbfs:/Workspace/Users/user-a/path")  # Access workspace files
fs.ls("dbfs:/data/path")  # Access legacy DBFS files

fsspec-databricks supports also stripped, POSIX-like paths without dbfs:/ scheme.

Path pattern Mapped file system
/Volumes/(catalog)/(schema)/(volume)/path/to/file Unity Catalog Volume file system
/Workspace/path/to/file Databricks Workspace file system (only in DBFS-disabled workspace)
/... (other than above) Legacy DBFS (deprecated)

Examples:

fs.ls("/Volumes/my_catalog/my_schema/my_volume/path")  # Access Unity Catalog Volume files
fs.ls("/Workspace/Users/user-a/path")  # Access workspace files (only in DBFS-disabled workspace)
fs.ls("/data/path")  # Access legacy DBFS files

For more details about dbfs:/ and POSIX path support in Databricks, see the official documentation.

Authentication

fsspec-databricks uses Databricks Unified Authentication provided by Databricks Python SDK.

You can find information about supported authentication parameters and environment variables in the Databricks Python SDK documentation.

Default authentication

If Databricks Unified Authentication is configured, fsspec-databricks will pick up credentials from the default profile. For more, see the above Databricks SDK docs.

from fsspec_databricks import DatabricksFileSystem

fs = DatabricksFileSystem()

with fs.open("dbfs:/Volumes/...") as f:
    ...

Via constructor parameters

You can programmatically configure authentication by passing parameters to DatabricksFileSystem constructor.

# Authentication with PAT
fs = DatabricksFileSystem(host=host_url, token=access_token)

# Use different profile
fs = DatabricksFileSystem(profile="production")

Via environment variables

Or, you can configure authentication via environment variables.

# Shell
export DATABRICKS_CONFIG_PROFILE=production
# Then in Python
fs = DatabricksFileSystem()  # will use the "production" profile

By fsspec configuration

You can use the fsspec's configuration model to configure and persist authentication parameters.

With WorkspaceClient

You can create DatabricksFileSystem by explicitly setting Databricks SDK's WorkspaceClient object. The created DatabricksFileSystem instance will use the authentication configured in the provided WorkspaceClient object.

from databricks.sdk import WorkspaceClient

client = WorkspaceClient(...)
...

fs = DatabricksFileSystem(client=client)

Note: a DatabricksFileSystem created with a WorkspaceClient will generally not be serializable, because WorkspaceClient instances are not serializable. Consider using other configuration methods if you need serializable filesystem objects.

Configuration options

In addition to the authentication parameters, fsspec-databricks supports the following configuration options.

Options for general file system behavior

Parameter name Description Default
config An optional pre-configured Databricks SDK Config object. If provided, it will be used for authentication. None
client An optional pre-configured Databricks SDK WorkspaceClient object. If provided, it will be used for accessing the Databricks Workspace API. None
use_local_fs_in_workspace Access files from the local file system rather than the remote Databricks API when running within a Databricks workspace. True
verbose_debug_log Whether to enable verbose debug logging for file system operations. False

Options for Unity Catalog Volume file system

Parameter name Description Default
volume_fs_max_read_concurrency The maximum number of concurrent file read operations on a Unity Catalog Volume file. 24
volume_fs_min_read_block_size The minimum data size to read for each read operation on a Unity Catalog Volume file. 1024 * 1024 (1 mb)
volume_fs_max_read_block_size The maximum data size to read for each read operation on a Unity Catalog Volume file. 4 * 1024 * 1024 (4 mb)
volume_fs_max_write_concurrency The maximum number of concurrent file write operations on a Unity Catalog Volume file. 24
volume_fs_min_write_block_size The minimum data size to write for each write operation on a Unity Catalog Volume file. 5 * 1024 * 1024 (5 mb)
volume_fs_max_write_block_size The maximum data size to write for each write operation on a Unity Catalog Volume file. 16 * 1024 * 1024 (16 mb)
volume_min_multipart_upload_size The minimum file size to use multipart upload for uploading files to Unity Catalog Volume. 5 * 1024 * 1024 (5 mb)

Differences from the original DatabricksFileSystem in fsspec

fsspec provides its own implementation of DatabricksFileSystem (fsspec.implementations.DatabricksFileSystem).

The main difference between DatabricksFileSystem in fsspec-databricks and the original one in fsspec is that the original one is for legacy DBFS (Databricks File System), which Databricks has already deprecated.

Databricks currently supports workspace files and Unity Catalog volumes in addition to the legacy DBFS, and it continues to use the dbfs:/ URL scheme for both legacy DBFS and the other file systems (documentation).

fsspec-databricks primarily aims to support new file systems (workspace files and Unity Catalog volumes) and enable seamless access to them using the same dbfs:/ URL scheme supported in Databricks workspaces.

Limitations

The following features are not yet implemented or have not been tested yet.

  • Compatibility with Databricks on AWS (not tested)
  • Legacy DBFS support (not tested)
  • Use of the storage proxy when running inside a Databricks workspace or notebook (not implemented)

We are actively developing and testing the library, and we welcome contributions and feedback from the community.

License

Apache License 2.0. See LICENSE for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fsspec_databricks-0.1.8.tar.gz (56.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fsspec_databricks-0.1.8-py3-none-any.whl (40.9 kB view details)

Uploaded Python 3

File details

Details for the file fsspec_databricks-0.1.8.tar.gz.

File metadata

  • Download URL: fsspec_databricks-0.1.8.tar.gz
  • Upload date:
  • Size: 56.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for fsspec_databricks-0.1.8.tar.gz
Algorithm Hash digest
SHA256 8b0a74c411b9c317a81d0ef9e4e1126260198371b51da84e499d8bb47ce97bcc
MD5 894e307c1af128d166079d81cdf93515
BLAKE2b-256 f6239acf07ca76ea798b347a2b60f0f602422dd7e11ff2231490e7c3f5cc250d

See more details on using hashes here.

File details

Details for the file fsspec_databricks-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: fsspec_databricks-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 40.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for fsspec_databricks-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 96641a33d9f6f1829641b42d4fba20ae8ed8b226dba72f29b1623c2e41f6ff27
MD5 1ce53ffbb383104c900642ec0c9108af
BLAKE2b-256 cfa961f5b6754327fc23729176f99a81a6530e66f959b9922392c1958ee52533

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page