SEM: streamline your scientific experiment management
Project description
SEM: Scientific Experiment Manager
Streamline IO operations, storage, and retrieval of your scientific results
SEM helps streamline IO operations and organization of scientific results
in Python.
At its core, SEM is based on
regular expressions
and simply creates, parses and manages intricate folder structures containing
experimental results.
Minimal example
Consider the results organized in the example/example_results
folder.
These are different directories containing the results of the same experiment, where two
parameters are varied: the random seed
and a threshold value eps
. Every one of the
folders contains some output files from
example_results
│
└───seed=111
│ └───eps_1.3
│ │ └───...
│ └───eps_7.4
│ └───...
│
└───seed=222
│ └───...
│
└───seed=333
│ └───...
│
└───useless_files
SEM does not take care of loading and/or saving files.
Rather, it takes care of the folder structure, leaving the user the freedom to manage the result's
format.
To retrieve the parameters relative to these results, ResultManager
parses the folders'
names and only returns the path relative to those that match.
import re
from pathlib import Path
from sem.manager import ResultManager
example_res = Path("./example_results")
parsers = [re.compile(r"seed=(?P<seed_value>\d+)"), re.compile(r"eps_(?P<eps>\d+.\d+)")]
manager = ResultManager(root_dir=example_res, parsers=parsers)
manager.parse_paths()
In the case above, the parser for seed_value
expects a positive integer, specified by
the regular expression "\d+"
, and eps
a float format.
The results are stored in
manager.df
, a pandas
DataFrame, which
contains the parsed parameter values, as well as the path to the deepest sub-directories
__PATH__ seed_value eps
0 example_results/seed=333/eps_1.1 333 1.1
1 example_results/seed=333/eps_0.3 333 0.3
2 example_results/seed=222/eps_7.4 222 7.4
3 example_results/seed=222/eps_2.7 222 2.7
4 example_results/seed=111/eps_1.3 111 1.3
5 example_results/seed=111/eps_7.4 111 7.4
...
Directories whose names don't match the patterns are ignored, e.g.
example_results/useless_files
.
Notice that, since they are the results of parsing, all the values in the data frame are
strings.
The conversion to a different data type can be performed after parsing:
manager.df["seed_value"] = manager.df["seed_value"].astype(int)
manager.df["eps"] = manager.df["eps"].astype(float)
Utilizing the parsed directories
Once the directory names have been parsed, the main utility of the manager is to have a
coupling between the parameters and the results.
For example, one can read and insert the computational time of every experiment in the
data frame:
def read_comp_time(res_dir):
with open(res_dir / "computational_time.txt", "r") as file:
time = float(file.read())
return time
manager.df["time"] = manager.df["__PATH__"].map(read_comp_time)
From there, conventional pandas operations can be used. For example, the average
computational time for seed 111
is given by
df = manager.df
times = df["time"].loc[df["seed_value"] == 111]
times.mean()
Loading more complex objects
Pandas data frames can contain arbitrary objects. For example, one can create a column of numpy arrays from a model:
import numpy as np
def load_mat(path):
return np.load(path / "result_params.npy")
df["mat"] = df["__PATH__"].map(load_mat)
Creating default paths
Standardizing result structure reduces the amount of code needed for
simple IO operations, and eases compatibility across machines, e.g. local vs cloud or
cluster results.
To this end, SEM offers a way to create saving paths
which only depend on the parameters specified by the user.
For example, the paths of a repository with three levels and different parameters,
can be created as:
root_dir = Path(".") / "tmp"
for param1 in [True, False]:
for param2 in ["a", "b"]:
for param3 in [1, 2, 3]:
values = [
{"param1": param1, "param2": param2},
"results_of_my_experiments",
{"param3": param3},
]
new_path = ResultManager.create_default_path(
root_dir, values, auto_sort=True
)
new_path.mkdir(parents=True)
print(new_path)
which produces
tmp/param1=True_param2=a/results_of_my_experiments/param3=1
tmp/param1=True_param2=a/results_of_my_experiments/param3=2
tmp/param1=True_param2=a/results_of_my_experiments/param3=3
tmp/param1=True_param2=b/results_of_my_experiments/param3=1
...
tmp/param1=False_param2=a/results_of_my_experiments/param3=1
...
If desired, the argument auto_sort
imposes a uniform order at every directory level.
For example, using
{"param2": param2, "param1": param1}
would produce the same paths a above if
auto_sort=True
.
Parsing directories with this structure is similarly easy:
manager = ResultManager.from_arguments(
root_dir,
arguments=[
{"param1": "True|False", "param2": "a|b"},
"results_of_my_experiments",
{"param3": r"\d+"},
],
auto_sort=True
)
manager.parse_paths()
which yields
__PATH__ param1 param2 param3
0 tmp/param1=False_param2=b/results_of_my_experi... False b 1
1 tmp/param1=False_param2=b/results_of_my_experi... False b 3
2 tmp/param1=False_param2=b/results_of_my_experi... False b 2
3 tmp/param1=True_param2=b/results_of_my_experim... True b 1
...
Initialization
Notice that the advantage of using the default directory naming, as opposed to a custom
one, is that the ResultManager
can be initialized as above, by only specifying the
arguments in ResultManager.from_arguments
.
A more flexible initialization for custom paths, can be performed by giving as input
regular expression patterns. For example, an equivalent initialization to that above is
given by:
parsers = [
re.compile("param1=(?P<param1>True|False)_param2=(?P<param2>a|b)"),
re.compile("results_of_my_experiments"),
re.compile("param3=(?P<param3>\d+)"),
]
manager = ResultManager(root_dir, parsers)
manager.parse_paths()
Other utilities and tricks
Filtering results
Another useful ResultManager
method is ResultManager.filter
. This method filters the
rows of the results' data frame. Results can be selected by specifying exact
column values or a list of possible values. For example, for a manager whose data frame
has columns
__PATH__ param1 param2 param3
0 tmp/param1=False_param2=b/results_of_my_experi... False b 1
1 tmp/param1=False_param2=b/results_of_my_experi... False b 3
2 tmp/param1=False_param2=b/results_of_my_experi... False b 2
3 tmp/param1=True_param2=b/results_of_my_experim... True b 1
...
the query
manager.filter_results(
equal={"param1": True},
contained={"param3": [1, 3]}
)
yields a filtered data frame
__PATH__ param1 param2 param3
3 tmp/param1=True_param2=b/results_of_my_experim... True b 1
4 tmp/param1=True_param2=b/results_of_my_experim... True b 3
9 tmp/param1=True_param2=a/results_of_my_experim... True a 1
10 tmp/param1=True_param2=a/results_of_my_experim... True a 3
Loading fewer results
While results can be filtered a posteriori as just explained, one can also load fewer
results in the first place.
This is done by specifying an appropriate regular expression
parser in the first place.
For example, to select only configurations where
param1
is equal to True
, one can write
parsers = [
re.compile("param1=(?P<param1>True)_param2=(?P<param2>a|b)"),
re.compile("results_of_my_experiments"),
re.compile("param3=(?P<param3>\d+)"),
]
manager = ResultManager(root_dir, parsers)
In general, any regular expression with named groups is considered valid, check the docs for further details.
Common parsing patterns
Some common regular expression patterns are available at sem.re_patterns
.
These are strings that can be utilized for initializing parsers
from sem.re_patterns import INT_PATTERN
parsers = [
re.compile("param1=(?P<param1>True|False)_param2=(?P<param2>a|b)"),
re.compile("results_of_my_experiments"),
re.compile(f"param3=(?P<param3>{INT_PATTERN})"),
]
manager = ResultManager(root_dir, parsers)
or ResultManager
arguments
manager = ResultManager.from_arguments(
root_dir,
arguments=[
{"param1": "True|False", "param2": "a|b"},
"results_of_my_experiments",
{"param3": INT_PATTERN},
],
)
Common type conversion from string
Some common type conversion functions from string are available at sem.str_to_type
.
These are useful in combination with the argparse
package, for command line inputs
from argparse import ArgumentParser
from sem.str_to_type import bool_type, unit_float_or_positive_integer, none_or_type
parser = ArgumentParser()
parser.add_argument("--flag", type=bool_type)
parser.add_argument("--train_size", type=unit_float_or_positive_integer)
parser.add_argument("--K", type=none_or_type(int))
Importantly, bool_type
correctly converts both string inputs "0"
or "1"
, as well
as the case-insensitive strings "true"
, "True"
, "False"
, etc.
Alternatively, these functions can also be used for type conversion inside pandas data frames
manager = ResultManager(root_dir, parsers)
manager.parse_paths()
manager.df["flag"] = manager.df["flag"].map(bool_type)
Installation
You can install this package by downloading the GitHub repository and, from inside the downloaded folder, running
pip install .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scientific-experiment-manager-0.1.0.tar.gz
.
File metadata
- Download URL: scientific-experiment-manager-0.1.0.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db188e5d6d40645fb50203244d2e3197e3faea994909ee652253d879d378d3bb |
|
MD5 | 5e30548568025103f62171465f3b09c7 |
|
BLAKE2b-256 | cf8d55967d2592f52ef0c5ee7f76dd7e0bee5b4886194a67a8aba3e00cb6baae |
File details
Details for the file scientific_experiment_manager-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: scientific_experiment_manager-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52d9181a1e3b68f10267831fe98be9bec1fbb815661eb3014ae1eccd685555ac |
|
MD5 | 591f41075d543b64af1302210eec45e1 |
|
BLAKE2b-256 | 8efbbb8c8f23267e5d04403af2a5d2d70ab3450038d6bcd15287d3947c4d8071 |