Save everything in a filterable way
Project description
What is this?
I needed an efficient data logger for my machine learning experiments. Specifically one that
- could log in a hierarchical way (not one big global logging variable)
- while still having a flat table-like structure for performing queries/summaries
- without having tons of duplicated data
This library would likely work well with PySpark
What is a Use-case Example?
Lets say you're going to perform
- 3 experiments
- each experiment has 10 episodes
- each episode has 100,000 timesteps
- there is an an
x
and ay
value at each timestep
Example goal:
- We want to get the average
x
value across all timesteps in episode 2 (we don't care what experiment they're from)
Our timestamp data could look like:
record1 = { "x":1, "y":1 } # first timestep
record2 = { "x":2, "y":2 } # second timestep
record3 = { "x":3, "y":3 } # third timestep
Problem
Those records don't contain the experiment number or the episode number (and we need those for our goal)
Bad Solution
Duplicating the data would provide a flat structure, but (for 100,000 timesteps) thats a huge memory cost
record1 = { "x":1, "y":1, "episode":1, "experiment": 1, } # first timestep
record2 = { "x":2, "y":2, "episode":1, "experiment": 1, } # second timestep
record3 = { "x":3, "y":3, "episode":1, "experiment": 1, } # third timestep
Good-ish Solution
We could use references to be both more efficient and allow adding parent data after the fact
# parent data
experiment_data = { "experiment": 1 }
episode_data = { "episode":1, "parent": experiment_data }
record1 = { "x":1, "y":1, "parent": episode_data } # first timestep
record2 = { "x":2, "y":2, "parent": episode_data } # second timestep
record3 = { "x":3, "y":3, "parent": episode_data } # third timestep
We could reduce the cost of key duplication by having shared keys
# parent data
experiment_data = { "experiment": 1 }
episode_data = { "episode":1, "parent": experiment_data }
episode_keeper = {"parent": episode_data} # timestep 0
episode_keeper = { "x":[1], "y":[1], "parent": episode_data} # first timestep (keys added on-demand)
episode_keeper = { "x":[1,2], "y":[1,2], "parent": episode_data} # second timestep
episode_keeper = { "x":[1,2,3], "y":[1,2,3], "parent": episode_data} # third timestep
How does Rigorous Recorder Fix This?
The "Good-ish Solution" above is still crude, this library cleans it up
- The
Recorder
class in this library is the core/pure data structure - The
ExperimentCollection
class automates common boilerplate for saving (python pickle), catching errors, managing experiments, etc
from rigorous_recorder import Recorder
recorder = Recorder()
# parent data
experiment_recorder = Recorder(experiment=1).set_parent(recorder)
episode_recorder = Recorder(episode=1).set_parent(experiment_recorder)
episode_recorder.push(x=1, y=1) # timestep1
episode_recorder.push(x=2, y=2) # timestep2
episode_recorder.push(x=3, y=3) # timestep3
recorder.save_to("where/ever/you_want.pickle")
How do I use this?
pip install rigorous-recorder
Super simple usage:
from rigorous_recorder import RecordKeeper
record_keeper = RecordKeeper().live_write_to("where/ever/you_want.yaml", as_yaml=True)
record_keeper.push(x=1, y=1)
Project/Experiment collection usage:
from rigorous_recorder import RecordKeeper, ExperimentCollection
from statistics import mean as average
from random import random, sample, choices
collection = ExperimentCollection("data/my_study") # <- filepath
number_of_new_experiments = 1
for _ in range(number_of_new_experiments):
# at the end (even when an error is thrown), all data is saved to disk automatically
# experiment number increments based on the last saved-to-disk experiment number
# running again (after error) won't double-increment the experiment number (same number until non-error run is achieved)
with collection.new_experiment() as experiment_recorder:
# we can create a hierarchy like this:
#
# experiment_recorder
# / \
# model1_recorder model2_recorder
# / | | \
# m1_train_recorder m1_test_recorder m2_test_recorder m2_train_recorder
#
model1_recorder = RecordKeeper(model="model1").set_parent(experiment_recorder)
model2_recorder = RecordKeeper(model="model2").set_parent(experiment_recorder)
#
# training
#
model1_train_recorder = RecordKeeper(training=True).set_parent(model1_recorder)
model2_train_recorder = RecordKeeper(training=True).set_parent(model2_recorder)
for each_index in range(100_000):
# one approach
model1_train_recorder.push(index=each_index, loss=random())
# alternative approach (same outcome)
model2_train_recorder.add(index=each_index)
# - this way is very handy for adding data in one method (like a loss func)
# while calling .commit() in a different method (like update weights)
model2_train_recorder.add({ "loss": random() })
model2_train_recorder.commit()
#
# testing
#
model1_test_recorder = RecordKeeper(testing=True).set_parent(model1_recorder)
model2_test_recorder = RecordKeeper(testing=True).set_parent(model2_recorder)
for each_index in range(500):
# one method
model1_test_recorder.push(
index=each_index,
accuracy=random(),
)
# alternative way (same outcome)
model2_test_recorder.add(index=each_index, accuracy=random())
model2_test_recorder.commit()
#
#
# Analysis
#
#
all_records = collection.records
print("first record", all_records[0]) # behaves just like a regular dictionary
# slice across both models (first 500 training records from both models)
records_first_half_of_time = tuple(each for each in all_records if each["training"] and each["index"] < 500)
# average loss across both models
first_half_average_loss = average(tuple(each["loss"] for each in records_first_half_of_time))
# average only for model 1
model1_first_half_loss = average(tuple(each["loss"] for each in records_first_half_of_time if each["model"] == "model1"))
# average only for model 2
model2_first_half_loss = average(tuple(each["loss"] for each in records_first_half_of_time if each["model"] == "model2"))
What are some other details?
The ExperimentCollection
adds 6 keys as a parent to every record:
experiment_number # int
error_number # int, is only incremented for back-to-back error runs
had_error # boolean for easy filtering
experiment_start_time # the output of time.time() from python's time module
experiment_end_time # the output of time.time() from python's time module
experiment_duration # the difference between start and end (for easy graphing/filtering)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for rigorous_recorder-1.4.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c07ec72f82eb48a6df44a707953fc6e14d9c5e8e91886a989e11e3c0d4d06a2c |
|
MD5 | a2f09dc8e3a2176ffae7bfca9d0a88c2 |
|
BLAKE2b-256 | 5ea81a828006b4d4970ed50c824d601a61e8aec018f5cbec72795a49aa66928c |