Skip to main content

loop like a pro, make parameter studies fun: set up and run a parameter study/sweep/scan, save a database

Project description

About

This package helps you to set up and run parameter studies.

Mostly, you’ll start with a script and a for-loop and ask “why do I need a package for that”? Well, soon you’ll want housekeeping tools and a database for your runs and results. This package exists because sooner or later, everyone doing parameter scans arrives at roughly the same workflow and tools.

This package deals with commonly encountered boilerplate tasks:

  • write a database of parameters and results automatically

  • make a backup of the database and all results when you repeat or extend the study

  • append new rows to the database when extending the study

  • simulate a parameter scan

Otherwise, the main goal is to not constrain your flexibility by building a complicated framework – we provide only very basic building blocks. All data structures are really simple (dicts), as are the provided functions. The database is a normal pandas DataFrame.

Getting started

A trivial example: Loop over two parameters ‘a’ and ‘b’:

#!/usr/bin/env python3

import random
from itertools import product
from psweep import psweep as ps


def func(pset):
    return {'result': random.random() * pset['a'] * pset['b']}


if __name__ == '__main__':
    a = ps.seq2dicts('a', [1,2,3,4])
    b = ps.seq2dicts('b', [8,9])
    params = ps.loops2params(product(a,b))
    df = ps.run(func, params)
    print(df)

This produces a list params of parameter sets (dicts {'a': ..., 'b': ...}) to loop over:

[{'a': 1, 'b': 8},
 {'a': 1, 'b': 9},
 {'a': 2, 'b': 8},
 {'a': 2, 'b': 9},
 {'a': 3, 'b': 8},
 {'a': 3, 'b': 9},
 {'a': 4, 'b': 8},
 {'a': 4, 'b': 9}]

and a database of results (pandas DataFrame df, pickled file calc/results.pk by default):

                           _calc_dir                              _pset_id  \
2018-07-22 20:06:07.401398      calc  99a0f636-10b3-438c-ab43-c583fda806e8
2018-07-22 20:06:07.406902      calc  6ec59d2b-7562-4262-b8d6-8f898a95f521
2018-07-22 20:06:07.410227      calc  d3c22d7d-bc6d-4297-afc3-285482e624b5
2018-07-22 20:06:07.412210      calc  f2b2269b-86e3-4b15-aeb7-92848ae25f7b
2018-07-22 20:06:07.414637      calc  8e1db575-1be2-4561-a835-c88739dc0440
2018-07-22 20:06:07.416465      calc  674f8a2c-bc21-40f4-b01f-3702e0338ae8
2018-07-22 20:06:07.418866      calc  b4d3d11b-0f22-4c73-a895-7363c635c0c6
2018-07-22 20:06:07.420706      calc  a265ca2f-3a9f-4323-b494-4b6763c46929

                                                         _run_id  \
2018-07-22 20:06:07.401398  3e09daf8-c3a7-49cb-8aa3-f2c040c70e8f
2018-07-22 20:06:07.406902  3e09daf8-c3a7-49cb-8aa3-f2c040c70e8f
2018-07-22 20:06:07.410227  3e09daf8-c3a7-49cb-8aa3-f2c040c70e8f
2018-07-22 20:06:07.412210  3e09daf8-c3a7-49cb-8aa3-f2c040c70e8f
2018-07-22 20:06:07.414637  3e09daf8-c3a7-49cb-8aa3-f2c040c70e8f
2018-07-22 20:06:07.416465  3e09daf8-c3a7-49cb-8aa3-f2c040c70e8f
2018-07-22 20:06:07.418866  3e09daf8-c3a7-49cb-8aa3-f2c040c70e8f
2018-07-22 20:06:07.420706  3e09daf8-c3a7-49cb-8aa3-f2c040c70e8f

                                            _time_utc  a  b     result
2018-07-22 20:06:07.401398 2018-07-22 20:06:07.401398  1  8   2.288036
2018-07-22 20:06:07.406902 2018-07-22 20:06:07.406902  1  9   7.944922
2018-07-22 20:06:07.410227 2018-07-22 20:06:07.410227  2  8  14.480190
2018-07-22 20:06:07.412210 2018-07-22 20:06:07.412210  2  9   3.532110
2018-07-22 20:06:07.414637 2018-07-22 20:06:07.414637  3  8   9.019944
2018-07-22 20:06:07.416465 2018-07-22 20:06:07.416465  3  9   4.382123
2018-07-22 20:06:07.418866 2018-07-22 20:06:07.418866  4  8   2.713900
2018-07-22 20:06:07.420706 2018-07-22 20:06:07.420706  4  9  27.358240

You see the columns ‘a’ and ‘b’, the column ‘result’ (returned by func) and a number of reserved fields for book-keeping such as

_run_id
_pset_id
_calc_dir
_time_utc

and a timestamped index.

Observe that one call ps.run(func, params) creates one _run_id – a UUID identifying this run. Inside that, each call func(pset) creates a unique _pset_id, a timestamp and a new row in the DataFrame (the database).

Concepts

The basic data structure for a param study is a list params of dicts (called “parameter sets” or short pset).

params = [{'a': 1, 'b': 'lala'},  # pset 1
          {'a': 2, 'b': 'zzz'},   # pset 2
          ...                     # ...
         ]

Each pset contains values of parameters (‘a’ and ‘b’) which are varied during the parameter study.

You need to define a callback function func, which takes exactly one pset such as:

{'a': 1, 'b': 'lala'}

and runs the workload for that pset. func must return a dict, for example:

{'result': 1.234}

or an updated pset:

{'a': 1, 'b': 'lala', 'result': 1.234}

We always merge (dict.update) the result of func with the pset, which gives you flexibility in what to return from func.

The psets form the rows of a pandas DataFrame, which we use to store the pset and the result from each func(pset).

The idea is now to run func in a loop over all psets in params. You do this using the ps.run helper function. The function adds some special columns such as _run_id (once per ps.run call) or _pset_id (once per pset). Using ps.run(... poolsize=...) runs func in parallel on params using multiprocessing.Pool.

This package offers some very simple helper functions which assist in creating params. Basically, we define the to-be-varied parameters (‘a’ and ‘b’) and then use something like itertools.product to loop over them to create params, which is passed to ps.run to actually perform the loop over all psets.

>>> from itertools import product
>>> from psweep import psweep as ps
>>> x=ps.seq2dicts('x', [1,2,3])
>>> y=ps.seq2dicts('y', ['xx','yy','zz'])
>>> x
[{'x': 1}, {'x': 2}, {'x': 3}]
>>> y
[{'y': 'xx'}, {'y': 'yy'}, {'y': 'zz'}]
>>> ps.loops2params(product(x,y))
[{'x': 1, 'y': 'xx'},
 {'x': 1, 'y': 'yy'},
 {'x': 1, 'y': 'zz'},
 {'x': 2, 'y': 'xx'},
 {'x': 2, 'y': 'yy'},
 {'x': 2, 'y': 'zz'},
 {'x': 3, 'y': 'xx'},
 {'x': 3, 'y': 'yy'},
 {'x': 3, 'y': 'zz'}]

The logic of the param study is entirely contained in the creation of params. E.g., if parameters shall be varied together (say x and y), then instead of

>>> product(x,y,z)

use

>>> product(zip(x,y), z)

The nesting from zip() is flattened in loops2params().

>>> z=ps.seq2dicts('z', [None, 1.2, 'X'])
>>> ps.loops2params(product(zip(x,y),z))
[{'x': 1, 'y': 'xx', 'z': None},
 {'x': 1, 'y': 'xx', 'z': 1.2},
 {'x': 1, 'y': 'xx', 'z': 'X'},
 {'x': 2, 'y': 'yy', 'z': None},
 {'x': 2, 'y': 'yy', 'z': 1.2},
 {'x': 2, 'y': 'yy', 'z': 'X'},
 {'x': 3, 'y': 'zz', 'z': None},
 {'x': 3, 'y': 'zz', 'z': 1.2},
 {'x': 3, 'y': 'zz', 'z': 'X'}]

If you want a parameter which is constant, use a list of length one:

>>> c=ps.seq2dicts('c', ['const'])
>>> ps.loops2params(product(zip(x,y),z,c))
[{'a': 1, 'c': 'const', 'y': 'xx', 'z': None},
 {'a': 1, 'c': 'const', 'y': 'xx', 'z': 1.2},
 {'a': 1, 'c': 'const', 'y': 'xx', 'z': 'X'},
 {'a': 2, 'c': 'const', 'y': 'yy', 'z': None},
 {'a': 2, 'c': 'const', 'y': 'yy', 'z': 1.2},
 {'a': 2, 'c': 'const', 'y': 'yy', 'z': 'X'},
 {'a': 3, 'c': 'const', 'y': 'zz', 'z': None},
 {'a': 3, 'c': 'const', 'y': 'zz', 'z': 1.2},
 {'a': 3, 'c': 'const', 'y': 'zz', 'z': 'X'}]

So, as you can see, the general idea is that we do all the loops before running any workload, i.e. we assemble the parameter grid to be sampled before the actual calculations. This has proven to be very practical as it helps detecting errors early.

You are, by the way, of course not restricted to use itertools.product. You can use any complicated manual loop you can come up with. The point is: you generate params, we run the study.

_pset_id, _run_id and repeated runs

See examples/vary_2_params_repeat.py.

It is important to get the difference between the two special fields _run_id and _pset_id, the most important one being _pset_id.

Both are random UUIDs. They are used to uniquely identify things.

By default, ps.run() writes a database calc/results.pk (a pickled DataFrame) with the default calc_dir='calc'. If you run ps.run() again

df = ps.run(func, params)
df = ps.run(func, other_params)

it will read and append to that file. The same happens in an interactive session when you pass in df again:

df = ps.run(func, params) # default is df=None -> create empty df
df = ps.run(func, other_params, df=df)

Once per ps.run call, a _run_id is created. Which means that when you call ps.run multiple times using the same database as just shown, you will see multiple (in this case two) _run_id values.

_run_id                               _pset_id
afa03dab-071e-472d-a396-37096580bfee  21d2185d-b900-44b3-a98d-4b8866776a77
afa03dab-071e-472d-a396-37096580bfee  3f63742b-6457-46c2-8ed3-9513fe166562
afa03dab-071e-472d-a396-37096580bfee  1a812d67-0ffc-4ab1-b4bb-ad9454f91050
afa03dab-071e-472d-a396-37096580bfee  995f5b0b-f9a6-45ee-b4d1-5784a25be4c6
e813db52-7fb9-4777-a4c8-2ce0dddc283c  7b5d8f76-926c-44e2-a0e3-2e68deb86abb
e813db52-7fb9-4777-a4c8-2ce0dddc283c  f46bb714-4677-4a11-b371-dd2d41a83d19
e813db52-7fb9-4777-a4c8-2ce0dddc283c  5fdcc88b-d467-4117-aa03-fd256656299b
e813db52-7fb9-4777-a4c8-2ce0dddc283c  8c5c07ca-3862-4726-a7d0-15d60e281407

Each ps.run call in turn calls func(pset) for each pset in params. Each func invocation created a unique _pset_id. Thus, we have a very simple, yet powerful one-to-one mapping and a way to refer to a specific pset.

Best practices

The following workflows and practices come from experience. They are, if you will, the “framework” for how to do things. However, we decided to not codify any of these ideas but to only provide tools to make them happen easily, because you will probably have quite different requirements and workflows.

Please also have a look at the examples/ dir where we document these and more common workflows.

Save data on disk, use UUIDs

See examples/save_data_on_disk.py.

Assume that you need to save results from a run not only in the returned dict from func (or even not at all!) but on disk, for instance when you call an external program which saves data on disk. Consider this example:

import os
import subprocess
from psweep import psweep as ps


def func(pset):
    fn = os.path.join(pset['_calc_dir'],
                      pset['_pset_id'],
                      'output.txt')
    cmd = "mkdir -p $(dirname {fn}); echo {a} > {fn}".format(a=pset['a'],
                                                             fn=fn)
    pset['cmd'] = cmd
    subprocess.run(cmd, shell=True)
    return pset

In this case, you call an external program (here a dummy shell command) which saves its output on disk. Note that we don’t return any output from the external command from func. We only update pset with the shell cmd we call to have that in the database.

Also note how we use the special fields _pset_id and _calc_dir, which are added in ps.run to pset before func is called.

After the run, we have four dirs for each pset, each simply named with _pset_id:

calc
├── 63b5daae-1b37-47e9-a11c-463fb4934d14
│   └── output.txt
├── 657cb9f9-8720-4d4c-8ff1-d7ddc7897700
│   └── output.txt
├── d7849792-622d-4479-aec6-329ed8bedd9b
│   └── output.txt
├── de8ac159-b5d0-4df6-9e4b-22ebf78bf9b0
│   └── output.txt
└── results.pk

This is a useful pattern. History has shown that in the end, most naming conventions start simple but turn out to be inflexible and hard to adapt later on. I have seen people write scripts which create things like:

calc/param_a=1.2_param_b=66.77
calc/param_a=3.4_param_b=88.99

i.e. encode the parameter values in path names, because they don’t have a database. Good luck parsing that. I don’t say this cannot be done – sure it can (in fact the example above easy to parse). It is just not fun – and there is no need to. What if you need to add a “column” for parameter ‘c’ later? Impossible (well, painful at least). This approach makes sense for very quick throw-away test runs, but gets out of hand quickly.

Since we have a database, we can simply drop all data in calc/<_pset_id> and be done with it. Each parameter set is identified by a UUID that will never change. This is the only kind of naming convention which makes sense in the long run.

Iterative extension of a parameter study

See examples/{10,20}multiple_1d_scans_with_backup.py.

We recommend to always use backup_calc_dir:

df = ps.run(func, params, backup_calc_dir=True)

backup_calc_dir will save a copy of the old calc_dir to calc_<last_date_in_old_database>, i.e. something like calc_2018-09-06T20:22:27.845008Z before doing anything else. That way, you can track old states of the overall study, and recover from mistakes.

For any non-trivial work, you won’t use an interactive session. Instead, you will have a driver script which defines params and starts ps.run(). Also in a common workflow, you won’t define params and run a study once. Instead you will first have an idea about which parameter values to scan. You will start with a coarse grid of parameters and then inspect the results and identify regions where you need more data (e.g. more dense sampling). Then you will modify params and run the study again. You will modify the driver script multiple times, as you refine your study. To save the old states of that script, use backup_script:

df = ps.run(func, params, backup_calc_dir=True, backup_script=__file__)

backup_script will save a copy of the script which you use to drive your study to calc/backup_script/<_run_id>.py. Since each ps.run() will create a new _run_id, you will have a backup of the code which produced your results for this _run_id (without putting everything in a git repo, which may be unpleasant if your study produces large amounts of data).

Simulate / Dry-Run: look before you leap

See examples/vary_1_param_simulate.py.

When you fiddle with finding the next good params and even when using backup_calc_dir, appending to the old database might be a hassle if you find that you made a mistake when setting up params. You need to abort the current run, delete calc_dir and copy the last backup back:

$ rm -r calc
$ mv calc_2018-09-06T20:22:27.845008Z calc

Instead, while you tinker with params, use another calc_dir, e.g.

df = ps.run(func, params, calc_dir='calc_test')

But what’s even better: keep everything as it is and just set simulate=True

df = ps.run(func, params, backup_calc_dir=True, backup_script=__file__,
            simulate=True)

This will copy only the database, not all the (possible large) data in calc/ to calc.simulate/ and run the study there, but w/o actually calling func(). So you still append to your old database as in a real run, but in a safe separate dir which you can delete later.

Give runs names for easy post-processing

See examples/vary_1_param_study_column.py.

Post-processing is not the scope of the package. The database is a DataFrame and that’s it. You can query it and use your full pandas Ninja skills here (e.g. “give me all psets where parameter ‘a’ was between 10 and 100, while ‘b’ was constant, which were run last week and the result was not < 0” … you get the idea.

To ease post-processing, it is useful practice to add a constant parameter named e.g. “study” or “scan” to label a certain range of runs. If you, for instance, have 5 runs where you scan values for parameter ‘a’ while keeping parameters ‘b’ and ‘c’ constant, you’ll have 5 _run_id values. When querying the database later, you could limit by _run_id if you know the values:

>>> df = df[(df._run_id=='afa03dab-071e-472d-a396-37096580bfee') |
            (df._run_id=='e813db52-7fb9-4777-a4c8-2ce0dddc283c') |
            ...
            ]

This doesn’t look like fun. It shows that the UUIDs (_run_id and _pset_id) are rarely ment to be used directly. Instead, you should (in this example) limit by the constant values of the other parameters:

>>> df = df[(df.b==10) & (df.c=='foo')]

Much better! This is what most post-processing scripts will do.

But when you have a column “study” which has the value 'a' all the time, it is just

>>> df = df[df.study=='a']

You can do more powerful things with this approach. For instance, say you vary parameters ‘a’ and ‘b’, then you could name the “study” field ‘fine_scan=a:b’ and encode which parameters (thus column names) you have varied. Later in the post-processing

>>> study = 'fine_scan=a:b'
# cols = ['a', 'b']
>>> cols = study.split('=')[1].split(':')
>>> values = df[cols].values

So in this case, a naming convention is useful in order to bypass possibly complex database queries. But it is still flexible – you can change the “study” column at any time, or delete it again.

Pro tip: You can manipulate the database at any later point and add the “study” column after all runs have been done.

Super Pro tip: Make a backup of the database first!

Install

$ pip3 install psweep

Dev install of this repo:

$ pip3 install -e .

See also https://github.com/elcorto/samplepkg.

Tests

# apt-get install python3-nose
$ nosetests3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

psweep-0.3.0.tar.gz (16.6 kB view hashes)

Uploaded Source

Built Distribution

psweep-0.3.0-py3-none-any.whl (13.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page