Skip to main content

Easy sqlite-backed persistent cache for dataclasses

Project description

CircleCI

Cachew: quick NamedTuple/dataclass cache

TLDR: cachew can persistently cache any sequence (an Iterator) over NamedTuples or dataclasses into an sqlite database on your disk. Database schema is automatically inferred from type annotations (PEP 526).

It works in a similar manner to functools.lru_cache: caching your data is just a matter of decorating it.

The difference from functools.lru_cache is that data is preserved between program runs.

Motivation

I often find myself processing big chunks of data, computing some aggregates on it or extracting only bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.

Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files, comparing on the next run and returning cached data if nothing changed.

Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.

Example

Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive. Parsing it (extract_links function) takes hours, however, the archive is presumably updated not very frequently.

With this library your can achieve it through single @cachew decorator.

>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
...     url : str
...     text: str
...
>>> @cachew
... def extract_links(archive: str) -> Iterator[Link]:
...     for i in range(5):
...         import time; time.sleep(1) # simulate slow IO
...         yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive='wikipedia_20190830.zip')) # that would take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]

>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive='wikipedia_20190830.zip'))).timeit(number=1) # second run is cached, so should take less time
>>> print(f"took {int(res)} seconds to query cached items")
took 0 seconds to query cached items

How it works

Basically, your data objects get flattened out and python types are mapped onto sqlite types and back

When the function is called, cachew computes the hash of your function's arguments and compares it against the previously stored hash value.

If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.

Features

Using

See docstring for up-to-date documentation on parameters and return types. You can also use extensive unit tests as a reference.

Some highlights:

  • cache_path can be a filename, or you can specify a callable returning path and depending on function's arguments.

    It's not required to specify the path (it will be created in /tmp) but recommended.

  • hashf by default just hashes all the arguments, you can also specify a custom callable.

    For instance, it can be used to discard cache the input file was modified.

  • cls is deduced from return type annotations by default, but can be specified if you don't control the code you want to cache.

Installing

Package is available on pypi.

pip install cachew

Developing

I'm using tox to run tests, and circleci.

Implementation

  • why tuples and dataclasses?

    Tuples are natural in Python for quickly grouping together return results. NamedTuple and dataclass specifically provide a very straighforward and self documenting way way to represent a bit of data in Python. Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.

    If you want to find out more why you should use more dataclasses in your code I suggest these links: What are data classes?, basic data classes.

  • why not pickle?

    Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python.

  • why sqlite database for storage?

    It's pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.

  • why not pandas.DataFrame?

    DataFrames are great and can be serialised to csv or pickled. They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature. They also can't be nested.

  • why not ORM?

    ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.

    • E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class. Also it doesn't support nested types.
  • why not marshmallow?

    Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilise type annotations, but didn't find them covering all I wanted:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

cachew-0.5-py2.py3-none-any.whl (22.7 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page