Save everything in a filterable way

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

What is this?

I needed an efficient data logger for my machine learning experiments. Specifically one that

could log in a hierarchical way (not one big global logging variable)
while still having a flat table-like structure for performing queries/summaries
without having tons of duplicated data

This library would work well with PySpark

What is a Use-case Example?

Lets say you're going to perform

3 experiments
each experiment has 10 episodes
each episode has 100,000 timesteps
there is an an x and y value at each timestep

Example goal:

We want to get the average x value across all timesteps in episode 2 (I don't care what experiment they're from)

Our timestamp data could look like:

record1 = { "x":1, "y":1 } # first timestep
record2 = { "x":2, "y":2 } # second timestep
record3 = { "x":3, "y":3 } # third timestep

Problem

Those records don't contain the experiment number or the episode number (which we need for our goal)

Bad Solution

Duplicating the data would provide a flat structure, but (for 100,000 timesteps) thats a huge memory cost

record1 = { "x":1, "y":1, "episode":1, "experiment": 1, } # first timestep
record2 = { "x":2, "y":2, "episode":1, "experiment": 1, } # second timestep
record3 = { "x":3, "y":3, "episode":1, "experiment": 1, } # third timestep

Good-ish Solution

We can use references to both be more efficient, and allow editing data after the fact

# parent data
experiment_data = { "experiment": 1 }
episode_data    = { "episode":1, }

record1 = { "x":1, "y":1, "parents": [experiment_data, episode_data] } # first timestep
record2 = { "x":2, "y":2, "parents": [experiment_data, episode_data] } # second timestep
record3 = { "x":3, "y":3, "parents": [experiment_data, episode_data] } # third timestep

How does Rigorous Recorder help?

The "Good-ish Solution" above is still very crude

The RecordKeeper class in this library provides a much cleaner implmentation.
The ExperimentCollection class helps a lot saving, handling errors, managing experiments etc

from rigorous_recorder import RecordKeeper
keeper = RecordKeeper()

# parent data
experiment_keeper = keeper.sub_record_keeper(experiment=1)
episode_keeper    = experiment_keeper.sub_record_keeper(episode=1)

episode_data.add_record({ "x":1, "y":1, }) # timestep1
episode_data.add_record({ "x":2, "y":2, }) # timestep2
episode_data.add_record({ "x":3, "y":3, }) # timestep3

How do I use this?

pip install rigorous-recorder

from rigorous_recorder import RecordKeeper, ExperimentCollection

from statistics import mean as average
from random import random, sample, choices

collection = ExperimentCollection("records/my_study") # <- this string is a filepath 

# automatically increments from the previous experiment number
# data is saved to disk automatically, even when an error is thrown
# running again (after error) won't double-increment the experiment number (same number until non-error run is achieved)
with collection.new_experiment() as record_keeper:
    model1 = record_keeper.sub_record_keeper(model="model1")
    model2 = record_keeper.sub_record_keeper(model="model2")
    # splits^ in two different ways (like siblings in a family tree)

    # 
    # training
    # 
    model_1_losses = model1.sub_record_keeper(training=True)
    model_2_losses = model2.sub_record_keeper(training=True)
    for each_index in range(1000):
        # one approach
        model_2_losses.add_record({
            "index": each_index,
            "loss": random(),
        })

        # alternative approach (same outcome)
        # - this way is very handy for adding data in one class method (loss func)
        #   while calling commit_record in a different class method (update weights)
        model_1_losses.pending_record["index"] = each_index
        model_1_losses.pending_record["loss"] = random()
        model_1_losses.commit_record()
    # 
    # testing
    # 
    model_1_evaluation = model1.sub_record_keeper(testing=True)
    model_2_evaluation = model2.sub_record_keeper(testing=True)
    for each_index in range(50):
        # one method
        model_2_evaluation.add_record({
            "index": each_index,
            "accuracy": random(),
        })

        # alternative way (same outcome)
        model_1_evaluation.pending_record["index"] = each_index
        model_1_evaluation.pending_record["accuracy"] = random()
        model_1_evaluation.commit_record()


# 
# 
# Analysis
# 
# 

all_records = collection.records
print(all_records[0]) # prints first record, which behaves just like a regular dictionary

# first 500 training records (from both models)
records_first_half_of_time = tuple(each for each in all_records if each["training"] and each["index"] < 500)
# not a great example, but this wouldn't care if the loss was from model1 or model 2
first_half_average_loss = average(tuple(each["loss"] for each in records_first_half_of_time))
# only for model 1
model_1_first_half_loss = average(tuple(each["loss"] for each in records_first_half_of_time if each["model"] == "model1"))
# only for model 2
model_2_first_half_loss = average(tuple(each["loss"] for each in records_first_half_of_time if each["model"] == "model2"))

What are some other details?

The ExperimentCollection adds 6 keys as a parent to every record:

experiment_number     # int
error_number          # int, is only incremented for back-to-back error runs
had_error             # boolean for easy filtering
experiment_start_time # the output of time.time() from python's time module
experiment_end_time   # the output of time.time() from python's time module
experiment_duration   # the difference between start and end (for easy graphing/filtering)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.4.4

Apr 12, 2024

1.4.3

Apr 12, 2024

1.4.2

May 28, 2023

1.4.1

May 25, 2023

1.4.0

Apr 24, 2023

1.3.2

Apr 12, 2023

1.3.1

Apr 12, 2023

1.3.0

Mar 19, 2023

1.2.1

Mar 19, 2023

1.2.0

May 10, 2022

1.1.1

Apr 15, 2022

1.1.0

Apr 11, 2022

1.0.2

Apr 11, 2022

1.0.1

Apr 11, 2022

1.0.0

Apr 11, 2022

This version

0.0.2

Jan 7, 2022

0.0.1

Jan 7, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rigorous_recorder-0.0.2.tar.gz (10.1 kB view hashes)

Uploaded Jan 7, 2022 Source

Built Distribution

rigorous_recorder-0.0.2-py3-none-any.whl (8.4 kB view hashes)

Uploaded Jan 7, 2022 Python 3

Hashes for rigorous_recorder-0.0.2.tar.gz

Hashes for rigorous_recorder-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`ba02649f8d92b7c617fd73c6be846a81d8d536aff44b53fae9149707d058e055`
MD5	`fec0cad15694132b9f3dcb6965ce34a4`
BLAKE2b-256	`ee282b6bda602feb95a06669d475ff14c0d8aeab3785d58d74a540e71814e4f4`

Hashes for rigorous_recorder-0.0.2-py3-none-any.whl

Hashes for rigorous_recorder-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`16abaee072fefaad9741740d9b379e82ac69c7f6de4b27f86b651be48f46376c`
MD5	`353288444256bdee2ba4d765757cf12a`
BLAKE2b-256	`5b0180829aecbd3a36e980ef1892e6365f03e36dc2b0a0fcfdbe5820833e2556`