unique serialization of python objetcs
Project description
binfootprint - unique serialization of python objects
Why unique serialization
When caching computationally expansive function calls, the input arguments (*args, **kwargs) serve as key to look up the result of the function. To perform efficient lookups these keys (often a large number of nested python objects) needs to be hashable. Since python's build-in hash function is randomly seeded (and applies to a few data types only) it is not suited for persistent caching. Alternatively, standard hash functions, as provided by the hashlib library, can be used. As they relay on byte sequences as input, python objects need to be converted to such a sequence first. Surely, python's pickle module provides such a serialization which, for our purpose, has the drawback that the byte sequence is not guaranteed to be unique (e.g., a dictionary can be stored as different byte sequences, as the order of the (key, value) pairs is irrelevant).
The binfootprint module fills that gap. It guarantees that a particular python object will have a unique binary representation which can serve as input for any hash function.
Quick start
binfootprint.dump(data)
generate a unique binary representation
(binary footprint) of data
.
b = binfootprint.dump(['hallo', 42])
Its output can serve as suitable input for a hash function.
hashlib.sha256(b).hexdigest()
binfootprint.load(data)
reconstructs the original python object.
ob = binfootprint.load(b)
Numpy array can be serialized.
a = numpy.asarray([0, 2.3, 4])
b = binfootprint.dump(a)
Classes which implement __getstate__
(pickle interface) or __bfkey__
can be
serialized too.
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def __getstate__(self):
return [self.x, self.y]
ob = Point(4, -2)
b = binfootprint.dump(ob)
If __bfkey__
is implemented, it is used over __getstate__
.
New since version 1.2.0:
functools.partial
objects can now be serialized too. This allows to cache a function which takes a functools.partial
as argument.
def gaussian(x, a, sigma, x0):
return a * math.exp(-(x-x0)**2 / 2 / sigma**2)
@binfootprint.util.ShelveCacheDec()
def quad(f, x_min, x_max, dx):
r = 0
x = x_min
while x < x_max:
r += f(x)
x += dx
return dx*r
g = functools.partial(gaussian, a=1, sigma=1, x0=-2.34)
quad(g, x_min=-10, x_max=10, dx=0.001)
cache decorator
Utilizing the unique binary representation of python objects, a persistent
cache for quite general functions is provided by the ShelveCache
class.
The decorator ShelveCacheDec
makes it really easy to use:
@binfootprint.ShelveCacheDec(path='.cache')
def area(p):
return p.x * p.y
parameter base class
To conveniently organize a set of parameters suitable as key for caching
you can subclass ABCParameter
. Why should you do that?
- The
__bfkey__
method ofABCParameter
ignores parameters that areNone
. This allows to extend your function interface without loosing access to cached results from earlier stages. - You can add informative information to the
__non_key__
member which are not included in the binary representation of the parameter class.
class Param(bf.ABCParameter):
__slots__ = ["x", "y", "__non_key__"]
def __init__(self, x, y, msg=""):
super().__init__()
self.x = x
self.y = y
self.__non_key__ = dict()
self.__non_key__["msg"] = msg
Which data types can be serialized
Python's fundamental data types are supported
- integer
- float (64bit)
- complex (128bit)
- strings
- byte arrays
- special build-in constants:
True
,False
,None
as well as their nested combination by means of the native data structures
- tuple
- list
- dictionary
- namedtuple.
In addition, the following types are supported:
- numpy
ndarray
: The serialization makes use of numpy's format.write_array() function using version 1.0. functools.partial
objects (new since version 1.2.0)
Furthermore, any class that implements
__getstate__
(python's pickle interface)
can be serialized as well, given that the returned data from __getstate__
can be serialized
and the returned data is not None
Distinction between objects is realized by adding the class name and the name of the module which defines
the class to the binary data.
This in turn allows to also reconstruct the original object by means of the __setstate__
method.
In case the __getstate__
method is not suitable, you can implement
__bfkey__
which should return the necessary data to distinguish different objects.
The spirit of __bfkey__
is very similar to that of __getstate__
, although it is meant
for serialization only, and to for reconstruction the original object.
Note that, if __bfkey__
is implemented it will be used, regardless of __getstate__
.
Note: dumping older version is not supported anymore. If backwards compatibility is needed check out older code from git. If needed converters should/will be written.
be carefull with functions
Since a function objects seem to implement __getstate__
which, however, returns None
,
dumping a function will fail.
Whether this makes sense, can be discussed.
Implementing your own callable ore using partioal
objects can circumvent this.
Installation
pip
install the latest version using pip
pip install binfootprint
poetry
Using poetry allows you to include this package in your project as a dependency.
git
check out the code from github
git clone https://github.com/richard-hartmann/binfootprint.git
dependencies
- python3
- numpy
How to use the binfootprint module
data serialization
Generating the binary footprint is done using the dump(obj)
method.
very simple
import binfootprint as bf
bf.dump(['hallo', 42])
more complex
import hashlib
import binfootprint as bf
SIGMA_Z = 0x34
data = {
'Færøerne': {
'area': (1399, 'km^2'),
'population': 54000
},
SIGMA_Z: [[-1, 0],
[0, 1]],
'usefulness': None
}
b = bf.dump(data)
print("MD5 check sum:", hashlib.md5(b).hexdigest())
reconstruct serialized data
Although the primary focus of this module is the binary representation,
for reasons of convenience or debugging it might be useful restore the original
python object from the binary data.
Calling the load(bin_data)
function achieves that task.
import binfootprint as bf
data = ['hallo', 42]
b = bf.dump(data)
data_prime = bf.load(b)
print(data_prime)
python objects - __getstate__
Since __getstate__
is assumed to uniquely represent the state of an
object by means of the returned data, it can be used to generate a unique binary
representation.
import binfootprint as bf
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def __getstate__(self):
return [self.x, self.y]
def __setstate__(self, state):
self.x = state[0]
self.y = state[1]
ob = Point(4, -2)
b = bf.dump(ob)
Since __setstate__
is implemented as well, the original object can be
reconstructed.
ob_prime = bf.load(b)
print("type:", type(ob_prime))
# type: <class '__main__.Point'>
print("member x:", ob_prime.x)
# member x: 4
print("member y:", ob_prime.y)
# member y: -2
implement __bfkey__
if __getstate__
is not suited
In case __getstate__
returns data which is not sufficient to uniquely label
an object or if the data cannot be serialized by the binaryfootprint module,
the method __bfkey__
should be implemented.
It is expected to return serializable data which uniquely identifies the state
of the object.
Note that, if __bfkey__
is present, __getstate__
is ignored.
Importantly, when deserializing the binary data from an object implementing
__bfkey__
, the python object is not returned, since there is no
__setstate__
equivalent. Instead, the class name, the name of the module defining
the class and the data returned by __bfkey__
is recovered.
This should not pose a problem, since the main focus of the binfootprint module is
the unique binary serialization of an object.
To ensure deserialization use python's pickle
module.
class Point2(Point):
def __bfkey__(self):
return {'x': self.x, 'y': self.y}
ob = Point2(5, 3)
b = bf.dump(ob)
ob_prime = bf.load(b)
print("load on bfkey:", ob_prime)
# load on bfkey: ('Point2', '__main__', {'x': 5, 'y': 3})
numpy ndarrays
Numpy's ndarray
are supported by relaying on numpy's binary serialization
using format.write_array().
import binfootprint as bf
import numpy as np
a = np.asarray([0, 1, 1, 0])
b1 = bf.dump(a)
As expected, changing the shape or data type yield a different binary representation
a2 = a1.reshape(2,2)
b2 = bf.dump(a2)
a3 = np.asarray(a1, dtype=np.complex128)
b3 = bf.dump(a3)
print(" MD5 of int array :", hashlib.md5(b1).hexdigest())
# 949bfba1237c48007a066398f744a161
print("MD5 of int array shape (2,2) :", hashlib.md5(b2).hexdigest())
# e9049a19f82c6f282d65466a72360cd8
print(" MD5 of complex array :", hashlib.md5(b3).hexdigest())
# 2274ea54925d88ec4d53853050e55a82
caching
With the binaryfootprint module, caching function calls is straight forward.
An implementation of such a cache using python's shelve
for persistent storage
is provided by the ShelveCacheDec
class.
@binfootprint.ShelveCacheDec()
def area(p):
print(" * f(p(x={},y={})) called".format(p.x, p.y))
return p.x * p.y
It is safe to use the ShelveCacheDec
with the same data location (path
)
on different functions, since the name of the function and the name of the
module defining the function determined the name of the underlying database.
In addition to caching the decorator extends the function signature by the
kwarg _cache_flag
which modifies the caching behavior as follows:
_cache_flag = 'no_cache'
: Simple call offnc
with no caching._cache_flag = 'update'
: Callfnc
and update the cache with recent return value._cache_flag = 'has_key'
: ReturnTrue
if the call has already been cached, otherwiseFalse
._cache_flag = 'cache_only'
: Raises aKeyError
if the result has not been cached yet.
p = Point(10, 10)
print("first call results in")
print(area(p))
# * f(p(x=10,y=10)) called
# 100
print("second call results in")
print(area(p))
# 100
p = Point(10, 11)
print("f(p(10, 11)) is in cache?")
print(area(p, _cache_flag='has_key'))
# False
pitfalls
ints and floats
Since the binary representation between ints and floats is different, 1
and 1.0
will be treated as different things.
This means that the cached value of a function call with an argument being 1
is
not found when passing 1.0
as argument.
Although the result of the function will most likely be the same.
Obviously, the same holds true for numpy array of different dtype
.
Parameter class
Tha abstract base class ABCParameter
allows to conveniently manage a set
of parameters.
Relevant parameters, explicitly specified as data member via __slots__
mechanism, are returned by __bfkey__
method (see above).
Their order in the __slots__
definition is irrelevant.
Importantly, class members are included only if they are not None
.
In this way a parameter class definition can be extended while still being
able to reproduce the binary footprint of an older class definition.
If present, the class member __non_key__
has a special meaning.
It is not included in the parameter-values list returned by __bfkey__
.
It is expected to be dictionary-like and allows storing
additional / informative information.
This is also reflected by the string representation of the class.
class Param(binfootprint.ABCParameter):
__slots__ = ["x", "y", "__non_key__"]
def __init__(self, x, y, msg=""):
super().__init__()
self.x = x
self.y = y
self.__non_key__ = dict()
self.__non_key__['msg'] = msg
p = Param(3, 4.5)
bfp = binfootprint.dump(p)
print("{}\n has hex hash value {}...".format(
p, binfootprint.hash_hex_from_bin_data(bfp)[:6])
)
# x : 3
# y : 4.5
# --- extra info ---
# msg :
# has hex hash value 38dbe8...
p = Param(3, 4.5, msg="I told you, don't use x=3!")
bfp = binfootprint.dump(p)
print("{}\n has hex hash value {}...".format(
p, binfootprint.hash_hex_from_bin_data(bfp)[:6])
)
# x : 3
# y : 4.5
# --- extra info ---
# msg : I told you, don't use x=3!
# has hex hash value 38dbe8...
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file binfootprint-1.2.2.tar.gz
.
File metadata
- Download URL: binfootprint-1.2.2.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.3 Linux/6.1.0-13-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c4b7da6f5cde19aa48e002430d6629125cd1bacc364940c001fd53cd4b953b7 |
|
MD5 | 43b99b3dbb6899dcd016d79c3a4afdb6 |
|
BLAKE2b-256 | 52456daa67b0eb61a9572fd9f0aa92c1ab8ca29743b64dc2f4aac41007edf118 |
File details
Details for the file binfootprint-1.2.2-py3-none-any.whl
.
File metadata
- Download URL: binfootprint-1.2.2-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.3 Linux/6.1.0-13-amd64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52e4234ad1732ab51f739dac39ddb271640c578b1b92d5b1fec138313601a778 |
|
MD5 | 3661d4341d1e609a091900337e4783ed |
|
BLAKE2b-256 | e5110aec70173ab93c29227497b962d05fac06cb9bc5eb01419d2b4f60711372 |