Skip to main content

Databricks Utils does not support few crucial file system operations like recursive directory listing, pattern-matching for files, listing only directories or files, and more. This package provides seamless execution of these tasks.

Project description

Introduction

Databricks stands out as a leading service for big data processing, yet it lacks support for several essential file system operations commonly needed by developers. Consequently, developers often have to craft custom solutions to fill this gap. The missing operations include:

  • Recursively listing directory contents
  • Listing files matching specified start and end patterns
  • Performing case-sensitive or case-insensitive file pattern matches
  • Listing only directories or only files or both
  • Generating sorted output of listing

Fortunately, with the availability of this package, you can effortlessly execute these operations.

Package installation and configuration

  1. Install the package using the pip command
pip install databricks-filesystem
  1. Import the package
from databricks_filesystem import DatabricksFilesystem
  1. Configure the package by passing the databricks Utils (dbutils) as a parameter
adb_fs = DatabricksFilesystem(dbutils=dbutils)
  1. Execute the filesystem_list function of the package to recursively list files and directories. Below are examples demonstrating its compatibility with DBFS and various external storage systems such as Azure Data Lake Storage (ADLS), Azure Blob Storage, AWS S3, Google Storage, and more.
# List DBFS directory
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/")

# List Azure Data Lake Storage directory (ADLS)
adb_fs.filesystem_list(filesystem_path="abfss://<container>@<storage-account>.dfs.core.windows.net/<directory>/")

# List AWS S3
adb_fs.filesystem_list(filesystem_path="s3a://<aws-bucket-name>/<path>")

# List Google Storage
adb_fs.filesystem_list(filesystem_path="gs://<bucket-name>/<path>")

filesystem_list function

filesystem_list(self, filesystem_path: str, recursive_flag: bool = True, list_directories: bool = True, list_files: bool = True, files_starts_with: Union[str, List[str]] = None, files_ends_with: Union[str, List[str]] = None, skip_files_starts_with: Union[str, List[str]] = None, skip_files_ends_with: Union[str, List[str]] = None, case_sensitive_comparison: bool = True, sorted_output: bool = True) -> list

Below are the parameters accepted by the filesystem_list function:

  • filesystem_path (str - Mandatory): Specify the file system path for listing.

  • recursive_flag (bool - Optional (Default: True)): When set to True, this flag enables recursive listing of the file system path, including all subdirectories.

  • list_directories (bool - Optional (Default: True)): When set to True, this determines whether directories will be included in the output. If enabled, directories will be listed in the output.

  • list_files (bool - Optional (Default: True)): When set to True, this determines whether files will be included in the output. If enabled, files will be listed in the output.

  • files_starts_with (str or List[str] - Optional (Default: None)): The provided pattern or list of patterns dictates that only files starting with it will be listed in the output. This parameter operates exclusively when the "list_files" parameter is set to True, ensuring selective listing based on the specified pattern or list of patterns.

  • files_ends_with (str or List[str] - Optional (Default: None)): The provided pattern or list of patterns dictates that only files ending with it will be listed in the output. This parameter operates exclusively when the "list_files" parameter is set to True, ensuring selective listing based on the specified pattern or list of patterns.

  • skip_files_starts_with (str or List[str] - Optional (Default: None)): The provided pattern or list of patterns dictates that files starting with it will be excluded in the output. This parameter operates exclusively when the "list_files" parameter is set to True, ensuring selective listing based on the specified pattern or list of patterns.

  • skip_files_ends_with (str or List[str] - Optional (Default: None)): The provided pattern or list of patterns dictates that files ending with it will be excluded in the output. This parameter operates exclusively when the "list_files" parameter is set to True, ensuring selective listing based on the specified pattern or list of patterns.

  • case_sensitive_comparison (bool - Optional (Default: True)): When set to True, this parameter determines whether case-sensitive comparison will be applied for file pattern matching. It only functions when the "list_files" parameter is True and values are provided for "files_starts_with", "files_ends_with", "skip_files_starts_with", or "skip_files_ends_with".

  • sorted_output (bool - Optional (Default: True)): When set to True, this parameter determines whether the output will be sorted. If enabled, the output will be returned in a sorted manner, facilitating easier navigation and analysis of the results.

The function returns the list of file paths and directory paths.

Examples

  1. Recursively listing
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/")
  1. Non-recurisve listing
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", recursive_flag=False)
  1. Recursively list only files
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False)
  1. Recursively list only directories
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_files=False)
  1. List all CSV files recursively
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, files_ends_with=".csv")
  1. List all CSV, Parquet, and JSON files recursively
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, files_ends_with=[".csv", ".parquet", ".json"])
  1. Recursively list all files that start with the word "test"
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, files_starts_with="test")
  1. Recursively list all files that start with the word "test" or "temp"
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, files_starts_with=["test", "temp"])
  1. Recursively list files that start with "part" and end with ".parquet"
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, files_starts_with="part", files_ends_with=".parquet")
  1. Recursively list files that start with "part" or "test" and end with ".parquet" or ".json"
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, files_starts_with=["part", "test"], files_ends_with=[".parquet", ".json"])
  1. Recursively list files, but skip those with a ".json" or ".parquet" extension
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, skip_files_ends_with=[".json", ".parquet"])
  1. Recursively list files, but skip those starting with "test" or "temp" and also skip ".crc" files
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, skip_files_starts_with=["test", "temp"], skip_files_ends_with=[".crc"])
  1. Perform above file pattern match case-insensitively
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", list_directories=False, files_starts_with="part", files_ends_with=".parquet", case_sensitive_comparison=False)
  1. Get the non-sorted output of the listing
adb_fs.filesystem_list(filesystem_path="dbfs:/FileStore/", sorted_output=False)

Additional Information

You can get more information about this package here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks_filesystem-0.0.4.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

databricks_filesystem-0.0.4-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file databricks_filesystem-0.0.4.tar.gz.

File metadata

  • Download URL: databricks_filesystem-0.0.4.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for databricks_filesystem-0.0.4.tar.gz
Algorithm Hash digest
SHA256 5abb39f49abb2ac97c1ffe2037928cf45a53151cd985b9824933a39f590d6405
MD5 ffae50db790adf95d4632f288c405cb6
BLAKE2b-256 f68c0ca1f212c30262f614a1dcaa5ec98116ed1c5bb2a987599edc721baec203

See more details on using hashes here.

File details

Details for the file databricks_filesystem-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for databricks_filesystem-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 33b26ff5d1eeaabdfb92bfd0bc04f815409998be2f83cf26703c8ae776767700
MD5 a1e2b6f70fbf8dd7edbe93ce61af9a6b
BLAKE2b-256 81f4a8539defbe50adcb7469e4397f83a927053b7194fe61e771de677c513193

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page