Skip to main content

Memoize package supporting data dependencies

Project description

doorget

build release codecov

license pypi python supported Code style: black

Python package which memoizes functions and supports data dependencies across them. With the cache decorator the memoization of your code is low touch.

Memoization

Definition

In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur again. Memoization has also been used in other contexts, such as in simple mutually recursive descent parsing. Wikipedia

Example

from doorget import cache
import panda as pd

@cache
def fetch(name: str) -> pd.DataFrame:
    pass

# do it
foo = fetch('foo') # The function is called
# get it
foo_again = fetch('foo') # The previous returned data is read from the cache and the function is not called.
# do it
bar = fetch('bar') # The function is called because the input is not known yet

Storage modes

The package propose you 3 built in storage modes and the ability to proivde your customized storages.

  • Memory: The fastest mode.
  • Disk: The unlimited mode.
  • Identity The tracker mode for data dependencies and cascades.
  • Custom: Your best mode.

Memory

The memory storage use the RAM memory to memoize data.

This is the default storage mode. It is the fastest way to retreive data from an existing input. But this mode has 2 caveats. The first caveat is that the data is cached within your current process only. Once your process ends all cached data are lost. It avoids you sharing the cached data between processes neither. The second caveat is the limited memory, constrained by your hardware to a couple of Giga Bytes maximum.

from doorget import cache, StorageMode
import panda as pd


@cache # Memory by default
def foo(name: str) -> pd.DataFrame:
    pass

@cache(mode=StorageMode.Memory)
def bar(name: str) -> pd.DataFrame:
    pass

Disk

The disk storage use the disk to memoize data.

This is the most common usage of the memoization. It is slower than the Memory mode because the data is read from the disk but it solves the 2 caveats. The built in Disk mode uses parquet format if a pandas DataFrame is returned, pickle otherwise.

If no cache folder is specified, a global folder is used instead. When the global folder is prefered, a sub folder is created by function from its name and module to guaranty an unique storage location.

The global folder can change at any time with the function setup_disk_storage.

from doorget import cache, StorageMode, setup_disk_storage
import panda as pd


@cache(mode=StorageMode.Disk) # Use the global folder
def fetch_default(name: str) -> pd.DataFrame:
    pass

@cache(cache_folder='./my_custom/folder/bar') # Overrides the global folder
def fetch_custom(name: str) -> pd.DataFrame:
    pass

# do it
fetch_default('foo')
# do it
fetch_custom('bar')

setup_disk_storage('./my_default/folder')

# do it
fetch_default('foo') # Function is called because the cache folder changed
# get it
fetch_custom('bar') # The function is not called because the custom folder is unchanged

Custom

The custom storage allows you to provide you own cache where the data is memoized.

When you specify this mode you have to specify the storage argument. This argument takes any kind of storage which could fits better your use case, for performances, infrastructure, sharing, ... purpose.

Your custom storage imlplementation must inherit from the Storage class.

from typing import Any, List
from dataclasses import dataclass
from doorget import CacheKey

@dataclass
class Storage:
    name: str

    # Core functions
    def contains(self, key: CacheKey) -> bool:
        pass
    def fetch(self, key: CacheKey) -> Any:
        pass
    def store(self, key: CacheKey, data: Any) -> None:
        pass

    # Cache management helpers
    def clear(self) -> None:
        pass
    def remove(self, key: CacheKey) -> bool:
        pass
    def keys(self) -> List[CacheKey]:
        pass

Data dependency

The added value from a simple memoized package is the carrying of data dependencies.

When a complex object, like a pandas DataFrame, is passed as an argument, a simple memoization won't work. With doorget when a complex object is passed as an argument, it is not used directly to build the key, but substitued with its own memoized key.

from doorget import cache
import panda as pd

@cache
def fetch(name: str) -> pd.DataFrame:
    pass

@cache
def summarize(by: str, df: pd.DataFrame) -> pd.DataFrame:
    pass

df = fetch('foo') # key for df: <fetch('foo')>
summary = summarize('date', df) # key for summary: <summarize(<fetch('foo')>, 'daily')>

Under the hood a memoized function keep a track of any object reference (Identity in python) returned by attaching its memoized key. If the function returns a Tuple, the contained items are tracked as well.

Cascade data identity

Sometime it is faster to transform a data than memoizing it. If this transformed data is used as an argument for another memoized function, the data dependency is lost. To avoid breaking a dependency, you can isolate your data transformation within a memoized function with the Identity storage mode. It will carry no additional storage than what your program is currently using. Under the hood the data graph is tracked by their references or Identity in python.

from doorget import cache, StorageMode
import panda as pd

@cache(mode=StorageMode.Identity)
def transform(x: pd.DataFrame) -> pd.DataFrame:
    return x.copy()

df = fetch('foo') # key for df: <fetch('foo')>
df_copy = transform(df) # key for df_copy: <transform(<fetch('foo')>)>

summary = summarize('date', df_copy) # key for summary: <summarize(<transform(<fetch('foo')>)>, 'daily')>

Administrate your caches

The package provides you some backdoors to administrate your caches and more precisely your storages.

  • list_memory_storages()
  • list_disk_storages()
  • list_custom_storages()

Each memoized function has its owned storage list. From a storage you can list all the contained keys,
remove some items or clear them all. A function call, is stored in a CacheKey object. A CacheKey object contains the function name with its arguments combination as a Tuple. If the key has data depedencies, the dependencies are nested by using Tuple recursively.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doorget-1.1.tar.gz (15.9 kB view hashes)

Uploaded Source

Built Distribution

doorget-1.1-py3-none-any.whl (11.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page