Persist expensive operations on disk.
Project description
Installation
pip install .
or pip install persist-to-disk
By default, a folder called .cache/persist_to_disk
is created under your home directory, and will be used to store cache files.
If you want to change it, see "Global Settings" below.
Global Settings
To set global settings (for example, where the cache should go by default), please do the following:
import persist_to_disk as ptd
ptd.config.generate_config()
Then, you could (optionally) change the settings in the generated config.ini
:
-
persist_path
: where to store the cache. All projects you have on this machine will have a folder underpersist_path
by default, unless you specify it within the project (See examples below). -
hashsize
: How many hash buckets to use to store each function's outputs. Default=500. -
lock_granularity
: How granular the lock is. This could becall
,func
orglobal
.call
means each hash bucket will have one lock, so only only processes trying to write/read to/from the same hash bucket will share the same lock.func
means each function will have one lock, so if you have many processes calling the same function they will all be using the same lock.global
all processes share the same lock (I tested that it's OK to have nested mechanism on Unix).
Quick Start
Basic Example
Using persist_to_disk
is very easy.
For example, if you want to write a general training function:
import torch
@ptd.persistf()
def train_a_model(dataset, model_cls, lr, epochs, device='cpu'):
...
return trained_model_or_key
if __name__ == '__main__':
train_a_model('MNIST', torch.nn.Linear, 1e-3, 30)
Suppose the above is in a file with path ~/project_name/pipeline/train.py
.
If we are in ~/project_name
and run python -m pipeline.train
, a cache folder will be created under PERSIST_PATH
, like the following:
PERSIST_PATH(=ptd.config.get_persist_path())
├── project_name-[autoid]
│ ├── pipeline
│ │ ├── train
│ │ │ ├── train_a_model
│ │ │ │ ├──[hashed_bucket].pkl
Note that in the above, [autoid]
is a auto-generated id.
[hashed_bucket]
will be an int in [0, hashsize
).
Multiprocessing
Note that ptd.persistf
can be used with multiprocessing directly.
Advanced Settings
config.set_project_path
and config.set_persist_path
There are two important paths for each workspace/project: project_path
and persist_path
.
You could set them by calling ptd.config.set_project_path
and ptd.config.set_persist_path
.
On a high level, persist_path
determines where the results are cached/persisted, and project_path
determines the structure of the cache file tree.
Following the basic example, ptd.config.persist_path(PERSIST_PATH)
will only change the root directory.
On the other hand, supppose we add a line of ptd.config.set_project_path("./pipeline")
to train.py
and run it again, the new file structure will be created under PERSIST_PATH
, like the following:
PERSIST_PATH(=ptd.config.get_persist_path())
├── pipeline-[autoid]
│ ├── train
│ │ ├── train_a_model
│ │ │ ├──[hashed_bucket].pkl
Alternatively, it is also possible that we store some notebooks under ~/project_name/notebook/
.
In this case, we could set the project_path
back to ~/project_name
.
You could check the mapping from projects to autoids in ~/.persist_to_disk/project_to_pids.txt
.
Additional Parameters
persist
take additional arguments.
For example, consider the new function below:
@ptd.persistf(groupby=['dataset', 'epochs'], expand_dict_kwargs=['model_kwargs'], skip_kwargs=['device'])
def train_a_model(dataset, model_cls, model_kwargs, lr, epochs, device='cpu'):
model = model_cls(**model_kwargs)
model.to(device)
... # train the model
model.save(path)
return path
The kwargs we passed to persistf
has the following effects:
groupby
: We will create more intermediate directories basing on what's ingroupby
. In the example above, the new cache structure will look like
PERSIST_PATH(=ptd.config.get_persist_path())
├── project_name-[autoid]
│ ├── pipeline
│ │ ├── train
│ │ │ ├── train_a_model
│ │ │ │ ├── MNIST
│ │ │ │ │ ├── 20
│ │ │ │ │ │ ├──[hashed_bucket].pkl
│ │ │ │ │ ├── 10
│ │ │ │ │ │ ├──[hashed_bucket].pkl
│ │ │ │ ├── CIFAR10
│ │ │ │ │ ├── 30
│ │ │ │ │ │ ├──[hashed_bucket].pkl
-
expand_dict_kwargs
: This simply allows the dictionary to be passed in. This is because we cannot hash a dictionary directly, so there are additionally preprocessing steps for these arguments withinptd
. Note that you can also setexpand_dict_kwargs='all'
to avoid specifying individual dictionary arguements. However, please only do so IF YOU KNOW what you are passing in - a very big nested dictionary can make the cache-retrievement very slow and use a lot of disk space unnecessarily. -
skip_kwargs
: This specifies arguments that will be ignored. For examplte, if we calltrain_a_model(..., device='cpu')
andtrain_a_model(..., device='cuda:0')
, the second run will simply read the cache, asdevice
is ignored.
Other useful parameters:
hash_size
: Defaults to 500. If a function has a lot of cache files, you can also increase this if necessary to reduce the number of.pkl
files on disk.
0.0.7
==================
- Shared cache vs local cache (the latter specified by
persist_path_local
in the config). This assumes local reads faster. Can be skipped - Add support for
argparse.Namespace
to support a common practice. - Add support for argument
alt_dirs
forpersistf
. For example, if the function is calledfunc1
and its default cache path is/path/repo-2/module/func1
, and we have cache from a similar code base at a different location, whose cache looks like/path/repo-1/module/func1
. Then, we could do:
A call to@ptd.persistf(alt_dirs=["/path/repo-1/module/func1"]) def func1(a=1): print(1)
func1
will read cache fromrepo-1
and write it torepo-2
. - Add support for argument
alt_root
formanual_cache
. It could be a function that modifies the default path.
0.0.6
==================
- Added the json serialization mode. This could be specified by
hash_method
when callingpersistf
. - If a function is specified to be
cache=ptd.READONLY
, no file lock will be used (to avoid unncessary conflict).
0.0.5
==================
lock_granularity
can be set differently for each function.- Changed the default cache folder to
.cache/persist_to_disk
.
0.0.4
==================
- Changed the behavior of
switch_kwarg
. Now, this is not considered an input to the wrapped function. For example, the correct usage is
Note how@ptd.persistf(switch_kwarg='switch') def func1(a=1): print(1) func1(a=1, switch=ptd.NOCACHE)
switch
is not an argument offunc1
. - Fix the path inference step, which now finds the absolute paths for
project_path
orfile_path
(the path to the file contaning the function) before inferencing the structure.
0.0.3
==================
- Added
set_project_path
to config.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file persist_to_disk-0.0.7.tar.gz
.
File metadata
- Download URL: persist_to_disk-0.0.7.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20f87ca913a66b4460b675507a86898111fd2014617def6ec84dc3961b91d3c1 |
|
MD5 | 802f9190db3233c5e75bfbdcdcbb45b4 |
|
BLAKE2b-256 | fa5df2c92a996833fd9b9d257eb51c239a41ca5793da1185f1cbe1abfd0c4466 |
File details
Details for the file persist_to_disk-0.0.7-py3-none-any.whl
.
File metadata
- Download URL: persist_to_disk-0.0.7-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4cbe320fff6690dc25e26eb76d75174a10932129afd826ec675abd9740178409 |
|
MD5 | 1d1270c7d5f06344a681b8871a1df19f |
|
BLAKE2b-256 | 0a6398b8bdedcca653d0efcef18a5afdf856002b2a52d0c022f15ebf11d43955 |