Skip to main content

Data Science Framework & Abstractions

Project description

DSLIBRARY

Installation

# normal install
pip install dslibrary

# to include a robust set of data connectors:
pip install dslibrary[all]

Data Science Framework and Abstraction of Data Details

Data science code is supposed to focus on the data, but it frequently gets bogged down in repetitive tasks like juggling parameters, working out file formats, and connecting to cloud data sources. This library proposes some ways to make those parts of life a little easier, and to make the resulting code a little shorter and more readable.

Some of this project's goals:

  • make it possible to create 'situation agnostic' code which runs unchanged across many platforms, against many data sources and in many data formats
  • remove the need to code some of the most often repeated mundane chores, such as parameter parsing, read/write in different file formats with different formatting options, cloud data access
  • enhance the ability to run and test code locally
  • support higher security and cross-cloud data access
  • compatibility with mlflow.tracking, with the option to delegate to mlflow or not

If you use dslibrary with no configuration it will revert to very straightforward behaviors that a person would expect while doing local development. But it can be configured to operate in a wide range of environments.

Data Cleaning Example

Here's a simple data cleaning example. You can run it from the command line, or call it's clean() method and it will truncate the values in a column of the supplied data. But so far it only works on local files, it only supports one file format (CSV), and it uses read_csv()'s default formatting arguments, which will not always work.

# clean_it.py
import pandas

def clean(upper=100, in="in.csv", out="out.csv"):
    df = pandas.read_csv(in)
    df.loc[df.x > upper, 'x'] = upper
    df.to_csv(out)

if __name__ == "__main__":
    # INSERT ARGUMENT PARSING CODE HERE
    clean(...)

Here it is converted to use dslibrary:

import dslibrary as dsl

def clean(upper: float=100, in="in.csv", out="out.csv"):
    df = dsl.load_dataframe(in)
    df.loc[df.x > upper, 'x'] = upper
    dsl.write_resource(out, df)

if __name__ == "__main__":
    clean(**dsl.get_parameters())

Now if we execute that code through dslibrary's ModelRunner class, we can point it to data in the cloud, and set different file formatting options:

from dslibrary import ModelRunner
import clean_it

runner = ModelRunner()
runner.set_parameter("upper", 50)
runner.set_input("in.csv", "s3://bucket/raw.csv", format_options={"delim_whitespace": True})
runner.set_output("out.csv", "s3://bucket/clipped.csv", format_options={"sep": "\t"})

# run the method directly
runner.run_method(clean_it.clean)

Or I can invoke it as a subprocess:

runner.run_local("path/to/clean_it.py")

This will also work with notebooks:

runner.run_local("path/to/clean_it.ipynb")

More examples

Report a metric about some data

Report the average temperature of some data:

import dslibrary as dsl
data = dsl.load_dataframe("input")
with dsl.start_run():
    dsl.log_metric("avg_temp", data.temperature.mean())

Call it with some SQL data:

from dslibrary import ModelRunner
runner = ModelRunner()
runner.set_input(
    "input",
    uri="mysql://username:password@mysql-server/climate",
    sql="select temperature from readings order by timestamp desc limit 1000"
))
runner.run_local("avg_temp.py")

Change format & filename for metrics output (format is implied by filename):

runner.set_output(dslibrary.METRICS_ALIAS, "metrics.csv", format_optons={"sep": "\t"})

We could send the metrics to mlflow instead:

runner = ModelRunner(mlflow=True)

Reconfigure Everything

If all the essential connections to the outside from your code are 'abstracted' and can be repointed elsewhere, then your code will run everywhere.

The entire implementation of dslibrary can be changed through environment variables. In fact, all the ModelRunner class really does is set environment variables.

These are the main types of interface data science code has to the outside world. Dslibrary offers methods to manage all of these, and they can all be handled differently through configuration:

  • parameters - if you think of your unit of work as a function, it's going to have some arguments. Whether they are for configuration, feature values or hyperparameters, there are some values that need to get to your entry point.
  • resources - file-like data, which might be here, there or on the cloud, and in any format
  • connections - filesystems like S3, or databases like PostGres
  • metrics & logging - all the usual tracking information
  • model data - pickled binaries and such

Data Security and Cross-Cloud Data

The normal way of accessing data in the cloud is to store CSP credentials in, say, "~/.aws/credentials", and then the intervening library is able to read and write to s3 buckets. You have to make sure this setup is done, that the right packages are in your environment, and write your code accordingly. Here are the main problems:

The setup is annoying

It can be time consuming to ensure that every system running the code has this credential configuration in place, and one system may need to access multiple accounts for the same CSP. And especially if you are on one CSP trying to access data in another CSP there is no automated setup you can count on.

The usual solution is to require that all the data science code add support for some particular cloud provider, and accept credentials as secrets. It's a lot of overhead.

The way dslibrary aims to help is by separating out all the information about a particular data source or target and providing ways to bundle and un-bundle it so that it can be sent where it is needed. The data science code itself should not have to worry about these settings or need to change just because the data moved or changed format.

Do you trust the code?

The code often has access to those credentials. Maybe you trust the code not to "lift" those credentials and use them elsewhere, maybe you don't. Maybe you can ensure those credentials are locked down to no more than s3 bucket read access, or maybe you can't. Even secret management systems will still expose the credentials to the code.

The solution dslibrary facilitates is to have a different, trusted system perform the data access. In dslibrary there is an extensible/customizable way to "transport" data access to another system. By setting an environment variable or two (one for the remote URL, another for an access token), the data read and write operations can be managed by that other system. Before executing the code, one sends the URIs, credentials and file format information to the data access system.

The transport.to_rest class will send dslibrary calls to a REST service.

The transport.to_volume class will send dslibrary calls through a shared volume to a Kubernetes sidecar.

COPYRIGHT

(c) Accenture 2021

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dslibrary-0.0.35.tar.gz (100.4 kB view hashes)

Uploaded Source

Built Distribution

dslibrary-0.0.35-py3-none-any.whl (68.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page