Skip to main content

DataSentics Lab - experimental open-source repo

Project description

datasentics-lab

dslab is a fully open-source package that simplifies everyday tasks for data scientists that rely on Databricks. It contains experimental code primarily developed and maintained by DataSentics.

All contributions and contributors are very welcome!

Installation

Pyspark is the only dependency that needs to be preinstalled.

The package is available on PyPI:

pip install datasentics-lab

Utilities

DBPath

DBPath is a QoL utility that simplifies and unifies files handling in Databricks.

It's design and API is inspired by pathlib.Path.

Showcase

from dslab.dbpath import DBPath

DBPath.set_spark_session(spark)  # used to initialize dbutils instance

path = DBPath('dbfs:/FileStore/')

path.ls() # lists files in directory in human-readable format

path.tree(max_depth=2) # prints indented directory tree

file = path / 'tmp' / 'my_file'

with file.open('wt') as f:
    f.write('It really is this simple!')
    
print(file.read_text())

file.write_text('And this is even easier!')

print(file.read_text())

print(f'{file} exists: {file.exists()}, is dir: {file.is_dir()}, is in filestore: {file.in_filestore}')

And that is just a taste! See full list of features below vvvv.

Features

from dslab.dbpath import DBPath
help(DBPath)
A Utility class for working with DataBricks API paths directly and in a unified manner.

The Design is inspired by pathlib.Path

>>> path = DBPath('abfss://...')
>>> path = DBPath('dbfs:/...')
>>> path = DBPath('file:/...')
>>> path = DBPath('s3:/...')
>>> path = DBPath('s3a:/...')
>>> path = DBPath('s3n:/...')


INITIALIZATION:

>>> from dslab import DBPath

Provide spark session for dbutils instance
>>> DBPath.set_spark_session(spark)

set FileStore base download url for your dbx workspace
>>> DBPath.set_base_download_url('https://adb-1234.5.azuredatabricks.net/files/')


PROPERTIES:

path - the whole path
name - just the filename (last part of path)
parent - the parent (DBPath)
children - sorted list of children files (list(DBPath)), empty list for non-folders
in_local, in_dbfs, in_filestore, in_lake, in_bucket - predicates for location of file


BASE METHODS:

exists() - returns True if file exists
is_dir() - returns True if file exists and is a directory
ls() - prints human readable list of contained files for folders, with file sizes
tree(max_depth=5, max_files_per_dir=50) - prints the directory structure, up to `max_depth` and 
        `max_files_per_dir` files in each directory
cp(destination, recurse=False) - same as dbutils.fs.cp(str(self), str(destination), recurse)
rm(recurse=False) - same as dbutils.fs.rm(str(self), recurse)
mkdirs() - same as dbutils.fs.mkdirs(str(self))
iterdir() - sorted generator over files (also DBPath instances) - similar to Path.iterdir()
reiterdir(regex) - sorted generator over files (DBPath) that match `bool(re.findall(regex, file))`


IO METHODS:

open(method='rt', encoding='utf-8') - context manager for working with any DB API file locally
read_text(encoding='utf-8') - reads the file as text and returns contents
read_bytes() - reads the file as bytes and returns contents
write_text(text) - writes text to the file
write_bytes(bytedata) - writes bytes to the file
download_url() - for FileStore records returns a direct download URL
make_download_url() - copies a file to FileStore and returns a direct download URL
backup() - creates a backup copy in the same folder, named by following convention
    {filename}[.extension] -> {filename}_YYYYMMDD_HHMMSS[.extension]
restore(timestamp) - restore a previous backup of this file by passing backup timestamp string (`'YYYYMMDD_HHMMSS'`)


CLASS METHODS:

set_spark_session(spark) - necessary to call upon initialization
clear_tmp_download_cache() - clear all files created using `make_download_url()`
temp_file - context manager that returns a temporary DBPath
- set_base_download_url - call once upon initialization, sets base url for filestore direct downloads
  (e.g. 'https://adb-1234.5.azuredatabricks.net/files/')
- set_protocol_temp_path - call once upon initialization for each filesystem you want to create temp files/dirs in
  ('dbfs' and 'file' are set by default).

Feedback

All feedback is extremely welcome, please raise an issue on github or contact me at adam.volny@datasentics.com

Contribution

Contributions, extensions are welcome, don't hesitate to post a PR and we will discuss adding the feature.

Local Environment Setup

The following software needs to be installed first:

Clone the repo now and prepare the package environment:

  • On Windows, use Git Bash.
  • On Linux/Mac, the use standard console
$ git clone git@github.com:DataSentics/datasentics-lab.git
$ cd datasentics-lab
$ ./env-init.sh

After the environment setup is complete, activate the Conda environment:

$ conda activate ./.venv

Semantic Commit Messages

We decided to use semantic commit messages for easier long-term maintenance.

We're looking forward to your contributions!

Format: <type>(<scope>): <subject>

<scope> is optional

Example

feat: add hat wobble
^--^  ^------------^
|     |
|     +-> Summary in present tense.
|
+-------> Type: chore, docs, feat, fix, refactor, style, or test.

More Examples:

  • feat: (new feature for the user, not a new feature for build script)
  • fix: (bug fix for the user, not a fix to a build script)
  • docs: (changes to the documentation)
  • style: (formatting, missing semi colons, etc; no production code change)
  • refactor: (refactoring production code, eg. renaming a variable)
  • test: (adding missing tests, refactoring tests; no production code change)
  • cicd: (updating workflows; no production code change)
  • release: (changing version in pyproject.toml and commit message: "release: vMAJOR.MINOR.PATCH")

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

datasentics_lab-0.1.3-py3-none-any.whl (10.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page