Skip to main content

Tools for writing Dotscience workloads

Project description

Installation

You can get the Dotscience Python library in one of three ways.

Use the Dotscience Jupyterlab environment

If you are using Dotscience in a Jupyter notebook via the Dotscience web interface, the Python library is already installed (it's installed in the container that you are executing in, on your runner). In this case, there is no need to install anything: just import dotscience as ds in your notebook.

If you are using Dotscience to track a model whose source code is a script other than a Jupyter notebook, use one of the following installation methods:

Use the ready-made Docker image

We've made a Docker image by taking the stock python:3 image and pre-installing the Dotscience library like so:

$ docker run -ti quay.io/dotmesh/dotscience-python3:latest
Python 3.7.0 (default, Aug  4 2018, 02:33:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dotscience as ds

Install it from PyPi

$ pip install dotscience
Collecting dotscience
  Downloading https://files.pythonhosted.org/packages/b2/e9/81db25b03e4c2b0115a7cd9078f0811b709a159493bb1b30e96f0906e1a1/dotscience-0.0.1-py3-none-any.whl
Installing collected packages: dotscience
Successfully installed dotscience-0.0.1
$ python
Python 3.7.0 (default, Sep  5 2018, 03:25:31) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dotscience as ds

Quick Start

The most basic usage is to record what data files you read and write, and maybe to declare some summary statistics about how well the job went.

import dotscience as ds
import pandas as pd

ds.interactive()

ds.start()

# Wrap the names of files you read with ds.input() - it just returns the filename:
df = pd.read_csv(ds.input('input_file.csv'))

# Likewise with files you write to:
df.to_csv(ds.output('output_file.csv'))

# Record a summary statistic about how well your job went
ds.add_summary('f-score', f_score)

ds.publish('Did some awesome data science!')

Don't forget to call ds.interactive() and ds.start() at the top if you're using Jupyter, and ds.publish() at the end, or your results won't get published! (The run description string passed in is optional, so leave it out if you can't think of anything nice to say).

Interactive vs. Script mode

The library has two modes - interactive and script. That call to ds.interactive() in the example above puts it in interactive mode, which tells it there isn't a script file that the code is coming from. But when you're writing code in a Python script file, you should call ds.script() instead.

This instructs the library to record the script filename (from sys.argv[0]) in the output runs, so they can be tracked back to the originating script. You don't need this in interactive mode, because Dotscience knows which Jupyter notebook you're using - and sys.argv[0] points to the Jupyter Python kernel in that case, which isn't useful to record in the run!

If sys.argv[0] isn't helpful in some other situation, you can call ds.script('FILENAME') to specify the script file, relative to the current working directory. In fact, in a Jupyter notebook, you could specify ds.script('my-notebook.ipynb') to manually specify the notebook file and override the automatic recording of the notebook file, but there wouldn't be any point!

All the things you can record

There's a lot more than just data files and summary stats that Dotscience will keep track of for you - and there's a choice of convenient ways to specify each thing, so it can fit neatly into your code. Here's the full list:

Start and end time

The library will try to guess when your job starts and ends, from when you call start() until you call publish() (although it gets a bit more complex with multiple runs; see below).

If you're running in Jupyter, that means it'll include the time you spend working on your notebook, thinking, and so on as well as the time actually spent running the steps, which probably isn't what you want. To get better tracking of the run times, to keep track of what operations are slow and to cross-reference the time periods against other stuff running on the same computers to see if the workloads are interfering, it's a good idea to explicitly tell Dotscience when your steps start and stop.

Even when running Python scripts through ds run, it still helps to declare start and end times - if your script does a lot of tedious setup and teardown, you probably don't want those included in the run times.

Just call start() and end() before and after the real work and the library will record the current time when those functions are called. If you miss the end() it'll assume that your run has finished when you call publish(); this is often a good assumption, so you can often just call start() at the start and publish() at the end and be done with it.

import dotscience as ds

ds.script()

...setup code...

ds.start()

...the real work...

ds.end()

...cleanup code...

ds.publish('Did some awesome data science!')

or:

import dotscience as ds

ds.script()

...setup code...

ds.start()

...the real work...

ds.publish('Did some awesome data science!')

Dotscience will still record the start and end times of the actual execution of your workload (which is the entire script for a command workload, or the time between saves for a Jupyter workload) as well, but that's kept separately.

Errors

Sometimes, a run fails, but you still want to record that it happened (perhaps so you know not to do the same thing again...). You can declare a run as failed like so:

import dotscience as ds

ds.script()
ds.start()

...
ds.set_error('The data wasn't correctly formatted')
...

ds.publish('Tried, in vain, to do some awesome data science!')

If you're assembling an error message to use for some other reason, the dotscience library can just grab a copy of it before you use it, with this convenience function that returns its argument:

import dotscience as ds

ds.script()
ds.start()

...
raise DataFormatError(ds.error('The data wasn't correctly formatted'))
...

ds.publish('Tried, in vain, to do some awesome data science!')

Describing the run

It's good to record a quick human-readable description of what your run did, which helps people who are viewing the provenance graph. We've already seen how to pass a description into publish():

import dotscience as ds

ds.script()
ds.start()

ds.publish('Did some awesome data science!')

But you can set up a description before then and just call publish() with no arguments:

import dotscience as ds

ds.script()
ds.start()

...
ds.set_description('Did some awesome data science!')
...

ds.publish()

If you're already making a descriptive string to send somewhere else, you can also use this function, that returns its argument:

import dotscience as ds

ds.script()
ds.start()

...
log.write(ds.description('Did some awesome data science!'))
...

ds.publish()

And if you wish, you can also pass the description to start(), although it can feel weird using the past tense for something you're about to do:

import dotscience as ds

ds.script()
ds.start('Did some awesome data science!')

...

ds.publish()

Input and Output files

In order to correctly track the provenance of data files, Dotscience needs you to correctly declare what data files your jobs read and write.

The most convenient way to do this is with input() and output(), which accept the name of a file to be used for input or output respectively, and return it:

import dotscience as ds

ds.script()
ds.start()

df.from_csv(ds.input('input_file.csv'))

df.to_csv(ds.output('output_file.csv'))

ds.publish('Did some awesome data science!')

But you can also declare them explicitly with add_input() and add_output():

import dotscience as ds

ds.script()
ds.start()

ds.add_input('input_file.csv')

ds.add_output('output_file.csv')

ds.publish('Did some awesome data science!')

Or declare several at once with add_inputs() and add_outputs():

import dotscience as ds

ds.script()
ds.start()

ds.add_inputs('input_file_1.csv', 'input_file_2.csv')

ds.add_outputs('output_file_1.csv', 'output_file_2.csv')

ds.publish('Did some awesome data science!')

Labels

You can attach arbitrary labels to your runs, which can be used to search for them in the Dotscience user interface. As usual, this can be done while returning the label value with label(), explicitly with add_label(), or en mass with add_labels():

import dotscience as ds

ds.script()
ds.start()

some_library.set_mode(ds.label("some_library_mode", foo))

ds.add_label("algorithm_version","1.3")

ds.add_labels(experimental=True, mode="test")

ds.publish('Did some awesome data science!')

Summary statistics

Often, your job will be able to measure its own performance in some way - perhaps testing how well a model trained on some training data works when tested on some test data. If you declare those summary statistics to Dotscience, it can help you keep track of which runs produced the best results.

As usual, this can be done while returning the summary value with summary(), explicitly with add_summary(), or en mass with add_summaries():

import dotscience as ds

ds.script()
ds.start()

print('Fit: %f%%' % (ds.summary('fit%', fit),))

ds.add_summary('fit%', fit)

ds.add_summaries(fit=computeFit(), error=computeMeanError())

ds.publish('Did some awesome data science!')

Parameters

Often, the work of a data scientist involves running the same algorithm while tweaking some input parameters to see what settings work best. If you declare your input parameters to Dotscience, it can keep track of them and help you find the best ones!

As usual, this can be done while returning the parameter value with parameter(), explicitly with add_parameter(), or en mass with add_parameters():

import dotscience as ds

ds.script()
ds.start()

some_library.set_smoothing(ds.parameter("smoothing", 2.0))

ds.add_parameter("outlier_threshold",1.3)

ds.add_parameters(prefilter=True, smooth=True, smoothing_factor=12)

ds.publish('Did some awesome data science!')

Multiple runs

There's nothing to stop you from doing more than one "run" in one go; just call start() at the beginning and publish() at the end of each.

This might look like this:

import dotscience as ds

ds.script()

ds.start()
data = load_csv(ds.input('training.csv'))
model = train_model(data)
model.save(ds.output('model.mdl'))
ds.publish('Trained the model')

ds.start()
test_data = load_csv(ds.input('test.csv'))
accuracy = test_model(model, test_data)
ds.add_summary('accuracy', accuracy)
ds.publish('Tested the model')

Or it might look like this:

import dotscience as ds

ds.script()

# Load the data, but don't report it to Dotscience (yet)
data = load_csv('training.csv')
test_data = load_csv('test.csv')

for smoothing in [1.0, 1.5, 2.0]:
    ds.start()
    # Report that we use the already-loaded data
    ds.add_input('training.csv')
    ds.add_input('test.csv')

    # Train model with the configured smoothing level (informing Dotscience of the input parameter)
    model = train_model(data, smoothing=ds.parameter('smoothing', smoothing))

    # Test model
    accuracy = test_model(model, test_data)

    # Inform Dotscience of the accuracy
    ds.add_summary('accuracy', accuracy)

    # Publish this run
    ds.publish('Tested the model with smoothing %s' % (smoothing,))

In that example, we've loaded the data into memory once and re-used it in each run - so we've done that before the call to start() to record when the actual run starts; we could have put a call to end() just before publish(), but publish() assumes the run is ended when you publish it anyway.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dotscience-0.2.1.tar.gz (30.5 kB view details)

Uploaded Source

Built Distribution

dotscience-0.2.1-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file dotscience-0.2.1.tar.gz.

File metadata

  • Download URL: dotscience-0.2.1.tar.gz
  • Upload date:
  • Size: 30.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for dotscience-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f6056f7dd8b609e0f84497fe4c08f4a9b6439a9385711aff7f36c8b6c09e8818
MD5 6405fc8a54613fccef2e83a9a6d8d96a
BLAKE2b-256 c1656fc52a86d2cf0e7c2e49bc1cbea964f8b7ceb8d7a321584d2a6ca53bf81a

See more details on using hashes here.

File details

Details for the file dotscience-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: dotscience-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for dotscience-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5f678108c79348ece1b960e531eea46180ef8de502107dc3eeab3d5ffb494a37
MD5 d2ea3aac77bf8310084033181166d16e
BLAKE2b-256 7cf873a96ef7173ca0898c024056bedb593ca6077aecac8bbc1b438a6a5e7255

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page