Persist expensive operations on disk.

These details have not been verified by PyPI

Project links

Homepage

Project description

Installation

pip install . or pip install persist-to-disk

By default, a folder called .cache/persist_to_disk is created under your home directory, and will be used to store cache files. If you want to change it, see "Global Settings" below.

Global Settings

To set global settings (for example, where the cache should go by default), please do the following:

import persist_to_disk as ptd
ptd.config.generate_config()

Then, you could (optionally) change the settings in the generated config.ini:

persist_path: where to store the cache. All projects you have on this machine will have a folder under persist_path by default, unless you specify it within the project (See examples below).
hashsize: How many hash buckets to use to store each function's outputs. Default=500.
lock_granularity: How granular the lock is. This could be call, func or global.
- call means each hash bucket will have one lock, so only only processes trying to write/read to/from the same hash bucket will share the same lock.
- func means each function will have one lock, so if you have many processes calling the same function they will all be using the same lock.
- global all processes share the same lock (I tested that it's OK to have nested mechanism on Unix).

Quick Start

Basic Example

Using persist_to_disk is very easy. For example, if you want to write a general training function:

import torch

@ptd.persistf()
def train_a_model(dataset, model_cls, lr, epochs, device='cpu'):
    ...
    return trained_model_or_key

if __name__ == '__main__':
    train_a_model('MNIST', torch.nn.Linear, 1e-3, 30)

Suppose the above is in a file with path ~/project_name/pipeline/train.py. If we are in ~/project_name and run python -m pipeline.train, a cache folder will be created under PERSIST_PATH, like the following:

PERSIST_PATH(=ptd.config.get_persist_path())
├── project_name-[autoid]
│   ├── pipeline
│   │   ├── train
│   │   │   ├── train_a_model
│   │   │   │   ├──[hashed_bucket].pkl

Note that in the above, [autoid] is a auto-generated id. [hashed_bucket] will be an int in [0, hashsize).

Multiprocessing

Note that ptd.persistf can be used with multiprocessing directly.

Advanced Settings

`config.set_project_path` and `config.set_persist_path`

There are two important paths for each workspace/project: project_path and persist_path. You could set them by calling ptd.config.set_project_path and ptd.config.set_persist_path.

On a high level, persist_path determines where the results are cached/persisted, and project_path determines the structure of the cache file tree. Following the basic example, ptd.config.persist_path(PERSIST_PATH) will only change the root directory. On the other hand, supppose we add a line of ptd.config.set_project_path("./pipeline") to train.py and run it again, the new file structure will be created under PERSIST_PATH, like the following:

PERSIST_PATH(=ptd.config.get_persist_path())
├── pipeline-[autoid]
│   ├── train
│   │   ├── train_a_model
│   │   │   ├──[hashed_bucket].pkl

Alternatively, it is also possible that we store some notebooks under ~/project_name/notebook/. In this case, we could set the project_path back to ~/project_name. You could check the mapping from projects to autoids in ~/.persist_to_disk/project_to_pids.txt.

Additional Parameters

persist take additional arguments. For example, consider the new function below:

@ptd.persistf(groupby=['dataset', 'epochs'], expand_dict_kwargs=['model_kwargs'], skip_kwargs=['device'])
def train_a_model(dataset, model_cls, model_kwargs, lr, epochs, device='cpu'):
    model = model_cls(**model_kwargs)
    model.to(device)
    ... # train the model
    model.save(path)
    return path

The kwargs we passed to persistf has the following effects:

groupby: We will create more intermediate directories basing on what's in groupby. In the example above, the new cache structure will look like

PERSIST_PATH(=ptd.config.get_persist_path())
├── project_name-[autoid]
│   ├── pipeline
│   │   ├── train
│   │   │   ├── train_a_model
│   │   │   │   ├── MNIST
│   │   │   │   │   ├── 20
│   │   │   │   │   │   ├──[hashed_bucket].pkl
│   │   │   │   │   ├── 10
│   │   │   │   │   │   ├──[hashed_bucket].pkl
│   │   │   │   ├── CIFAR10
│   │   │   │   │   ├── 30
│   │   │   │   │   │   ├──[hashed_bucket].pkl

expand_dict_kwargs: This simply allows the dictionary to be passed in. This is because we cannot hash a dictionary directly, so there are additionally preprocessing steps for these arguments within ptd. Note that you can also set expand_dict_kwargs='all' to avoid specifying individual dictionary arguements. However, please only do so IF YOU KNOW what you are passing in - a very big nested dictionary can make the cache-retrievement very slow and use a lot of disk space unnecessarily.
skip_kwargs: This specifies arguments that will be ignored. For examplte, if we call train_a_model(..., device='cpu') and train_a_model(..., device='cuda:0'), the second run will simply read the cache, as device is ignored.

Other useful parameters:

hash_size: Defaults to 500. If a function has a lot of cache files, you can also increase this if necessary to reduce the number of .pkl files on disk.

0.0.7

==================

Shared cache vs local cache (the latter specified by persist_path_local in the config). This assumes local reads faster. Can be skipped
Add support for argparse.Namespace to support a common practice.
Add support for argument alt_dirs for persistf. For example, if the function is called func1 and its default cache path is /path/repo-2/module/func1, and we have cache from a similar code base at a different location, whose cache looks like /path/repo-1/module/func1. Then, we could do:
```
@ptd.persistf(alt_dirs=["/path/repo-1/module/func1"])
def func1(a=1):
    print(1)
```
A call to func1 will read cache from repo-1 and write it to repo-2.
Add support for argument alt_root for manual_cache. It could be a function that modifies the default path.

0.0.6

==================

Added the json serialization mode. This could be specified by hash_method when calling persistf.
If a function is specified to be cache=ptd.READONLY, no file lock will be used (to avoid unncessary conflict).

0.0.5

==================

lock_granularity can be set differently for each function.
Changed the default cache folder to .cache/persist_to_disk.

0.0.4

==================

Changed the behavior of switch_kwarg. Now, this is not considered an input to the wrapped function. For example, the correct usage is
```
@ptd.persistf(switch_kwarg='switch')
def func1(a=1):
    print(1)
func1(a=1, switch=ptd.NOCACHE)
```
Note how switch is not an argument of func1.
Fix the path inference step, which now finds the absolute paths for project_path or file_path (the path to the file contaning the function) before inferencing the structure.

0.0.3

==================

Added set_project_path to config.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.7

Jun 2, 2024

0.0.6

Feb 8, 2024

0.0.5

May 27, 2023

0.0.4

Mar 8, 2023

0.0.3

Feb 14, 2023

0.0.2

Feb 8, 2023

0.0.1

Feb 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

persist_to_disk-0.0.7.tar.gz (15.9 kB view details)

Uploaded Jun 2, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

persist_to_disk-0.0.7-py3-none-any.whl (14.6 kB view details)

Uploaded Jun 2, 2024 Python 3

File details

Details for the file persist_to_disk-0.0.7.tar.gz.

File metadata

Download URL: persist_to_disk-0.0.7.tar.gz
Upload date: Jun 2, 2024
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.8

File hashes

Hashes for persist_to_disk-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`20f87ca913a66b4460b675507a86898111fd2014617def6ec84dc3961b91d3c1`
MD5	`802f9190db3233c5e75bfbdcdcbb45b4`
BLAKE2b-256	`fa5df2c92a996833fd9b9d257eb51c239a41ca5793da1185f1cbe1abfd0c4466`

See more details on using hashes here.

File details

Details for the file persist_to_disk-0.0.7-py3-none-any.whl.

File metadata

Download URL: persist_to_disk-0.0.7-py3-none-any.whl
Upload date: Jun 2, 2024
Size: 14.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.8

File hashes

Hashes for persist_to_disk-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4cbe320fff6690dc25e26eb76d75174a10932129afd826ec675abd9740178409`
MD5	`1d1270c7d5f06344a681b8871a1df19f`
BLAKE2b-256	`0a6398b8bdedcca653d0efcef18a5afdf856002b2a52d0c022f15ebf11d43955`

See more details on using hashes here.

persist-to-disk 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Installation

Global Settings

Quick Start

Basic Example

Multiprocessing

Advanced Settings

`config.set_project_path` and `config.set_persist_path`

Additional Parameters

Other useful parameters:

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

persist-to-disk 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Installation

Global Settings

Quick Start

Basic Example

Multiprocessing

Advanced Settings

config.set_project_path and config.set_persist_path

Additional Parameters

Other useful parameters:

0.0.7

0.0.6

0.0.5

0.0.4

0.0.3

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`config.set_project_path` and `config.set_persist_path`