JSON (de)serialization extensions
Project description
Turbo Broccoli 🥦
JSON (de)serialization extensions, originally aimed at numpy
and tensorflow
objects.
Installation
pip install turbo-broccoli
Usage
import json
import numpy as np
import turbo_broccoli as tb
obj = {
"an_array": np.array([[1, 2], [3, 4]], dtype="float32")
}
json.dumps(obj, cls=tb.TurboBroccoliEncoder)
# or even simpler:
tb.to_json(obj)
produces the following string (modulo indentation):
{
"an_array": {
"__numpy__": {
"__type__": "ndarray",
"__version__": 3,
"data": {
"__bytes__": {
"__version__": 1,
"data": "PAAAAA..."
}
}
}
}
}
For deserialization, simply use
json.loads(json_string, cls=tb.TurboBroccoliDecoder)
# or even simpler:
tb.from_json(json_string)
Guarded calls
Consider an expensive function f
that returns a TurboBroccoli/JSON-izable
dict
. Wrapping/decorating f
using produces_document
essentially saves the
result at a specified path and when possible, loads it instead of calling f
.
For example:
_f = produces_document(f, "out/result.json")
_f(*args, **kwargs)
will only call f
if the out/result.json
does not exist, and otherwise,
loads and returns out/result.json
. However, if out/result.json
exists and
was produced by calling _f(*args, **kwargs)
, then
_f(*args2, **kwargs2)
will still return the same result. If you want to keep a different file for
each args
/kwargs
, set check_args
to True
as in
_f = produces_document(f, "out/result.json")
_f(*args, **kwargs)
In this case, the arguments must be TurboBroccoli/JSON-izable, i.e. the document
{
"args": args,
"kwargs": kwargs,
}
must be TurboBroccoli/JSON-izable. The resulting file is no longer
out/result.json
but rather out/result.json/<hash>
where hash
is the MD5
hash of the serialization of the args
/kwargs
document above.
Supported types
-
bytes
-
collections.deque
,collections.namedtuple
-
Dataclasses. Serialization is straightforward:
@dataclass class C: a: int b: str doc = json.dumps({"c": C(a=1, b="Hello")}, cls=tb.TurboBroccoliEncoder)
For deserialization, first register the class:
tb.register_dataclass_type(C) json.loads(doc, cls=tb.TurboBroccoliDecoder)
-
Generic object, serialization only. A generic object is an object that has the
__turbo_broccoli__
attribute. This attribute is expected to be a list of attributes whose values will be serialized. For example,class C: __turbo_broccoli__ = ["a"] a: int b: int x = C() x.a, x.b = 42, 43 json.dumps(x, cls=tb.TurboBroccoliEncoder)
produces the following string (modulo indentation):
{ "__generic__": { "__version__": 1, "data": { "a": 42 } } }
Registered attributes can of course have any type supported by Turbo Broccoli, such as numpy arrays. Registered attributes can be
@property
methods. -
keras.Model
; standard subclasses ofkeras.layers.Layer
,keras.losses.Loss
,keras.metrics.Metric
, andkeras.optimizers.Optimizer
-
numpy.number
,numpy.ndarray
with numerical dtype -
pandas.DataFrame
andpandas.Series
, but with the following limitations:- the following dtypes are not supported:
complex
,object
,timedelta
- the column / series names must be strings and not numbers. The following
is not acceptable:
df = pd.DataFrame([[1, 2], [3, 4]])
becauseprint([c for c in df.columns]) # [0, 1] print([type(c) for c in df.columns]) # [int, int]
- the following dtypes are not supported:
-
tensorflow.Tensor
with numerical dtype, but nottensorflow.RaggedTensor
-
torch.Tensor
, WARNING: loaded tensors are automatically placed on the CPU and gradients are lost;torch.nn.Module
, don't forget to register your module type usingturbo_broccoli.register_pytorch_module_type
:# Serialization class MyModule(torch.nn.Module): ... module = MyModule() # Must be instantiable without arguments doc = json.dumps(x, cls=tb.TurboBroccoliEncoder) # Deserialization tb.register_pytorch_module_type(MyModule) module = json.loads(doc, cls=tb.TurboBroccoliDecoder)
WARNING: It is not possible to register and deserialize standard pytorch module containers directly. Wrap them in your own custom module class.
-
scipy.sparse.csr_matrix
-
EXPERIMENTAL
sklearn
estimators (i.e. that descent fromsklean.base.BaseEstimator
). To make sure which class is supported, take a look at the unit tests Doesn't work with:- All CV classes because the
score_
attribute is a dict indexed withnp.int64
, whichjson.JSONEncoder._iterencode_dict
rejects. - All estimator classes that have mandatory arguments:
ClassifierChain
,ColumnTransformer
,FeatureUnion
,GridSearchCV
,MultiOutputClassifier
,MultiOutputRegressor
,OneVsOneClassifier
,OneVsRestClassifier
,OutputCodeClassifier
,Pipeline
,RandomizedSearchCV
,RegressorChain
,RFE
,RFECV
,SelectFromModel
,SelfTrainingClassifier
,SequentialFeatureSelector
,SparseCoder
,StackingClassifier
,StackingRegressor
,VotingClassifier
,VotingRegressor
. - Everything that is parametrized by an arbitrary object/callable/estimator:
FunctionTransformer
,TransformedTargetRegressor
. - Everything that stores a random state (in the form of a
RandomState
object):BisectingKMeans
,MiniBatchDictionaryLearning
,LatentDirichletAllocation
,NeighborhoodComponentsAnalysis
,MLPClassifier
,MLPRegressor
,SparseRandomProjection
,GaussianRandomProjection
. - Everything with trees and forest since
Tree
objects are not JSON serializable:ExtraTreesClassifier
,ExtraTreesRegressor
,RandomForestClassifier
,RandomForestRegressor
,RandomTreesEmbedding
,IsolationForest
,AdaBoostClassifier
,AdaBoostRegressor
,DecisionTreeClassifier
,DecisionTreeRegressor
. - Other classes that have non JSON-serializable attributes:
Class Non-serializable attr. Birch
_CFNode
GaussianProcessRegressor
Sum
GaussianProcessClassifier
Product
Perceptron
Hinge
SGDClassifier
Hinge
SGDOneClassSVM
Hinge
PoissonRegressor
HalfPoissonLoss
GammaRegressor
HalfGammaLoss
TweedieRegressor
HalfTweedieLossIdentity
KernelDensity
KDTree
SplineTransformer
BSpline
- Some classes have AttributeErrors?
Class Attribute IsotonicRegression
f_
KernelPCA
_centerer
KNeighborsClassifier
_y
KNeighborsRegressor
_y
KNeighborsTransformer
_tree
LabelPropagation
X_
LabelSpreading
X_
LocalOutlierFactor
_lrd
MissingIndicator
_precomputed
NuSVC
_sparse
NuSVR
_sparse
OneClassSVM
_sparse
PowerTransformer
_scaler
RadiusNeighborsClassifier
_tree
RadiusNeighborsRegressor
_tree
RadiusNeighborsTransformer
_tree
SVC
_sparse
SVR
_sparse
- Other errors:
FastICA
: I'm not sure why...BaggingClassifier
:IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
.GradientBoostingClassifier
:Exception: dtype object is not covered
. *GradientBoostingRegressor
:Exception: dtype object is not covered
.HistGradientBoostingClassifier
: Problems with deserialization of_BinMapper
object?PassiveAggressiveClassifier
: some unknown label type error...KBinsDiscretizer
:Exception: dtype object is not covered
.KBinsDiscretizer
:Exception: dtype object is not covered
.
- All CV classes because the
Secrets
Basic Python types can be wrapped in their corresponding secret type according to the following table
Python type | Secret type |
---|---|
dict |
turbo_broccoli.secret.SecretDict |
float |
turbo_broccoli.secret.SecretFloat |
int |
turbo_broccoli.secret.SecretInt |
list |
turbo_broccoli.secret.SecretList |
str |
turbo_broccoli.secret.SecretStr |
The secret value can be recovered with the get_secret_value
method. At
serialization, the this value will be encrypted. For example,
# See https://pynacl.readthedocs.io/en/latest/secret/#key
import nacl.secret
import nacl.utils
key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
from turbo_broccoli.secret import SecretStr
from turbo_broccoli.environment import set_shared_key
set_shared_key(key)
x = {
"user": "alice",
"password": SecretStr("dolphin")
}
json.dumps(x, cls=tb.TurboBroccoliEncoder)
produces the following string (modulo indentation and modulo the encrypted content):
{
"user": "alice",
"password": {
"__secret__": {
"__version__": 1,
"data": {
"__bytes__": {
"__version__": 1,
"data": "qPSsruu..."
}
}
}
}
}
Deserialization decrypts the secrets, but they stay wrapped inside the secret
types above. If the wrong key is provided, an exception is raised. If no key is
provided, the secret values are replaced by a
turbo_broccoli.secret.LockedSecret
. Internally, Turbo Broccoli uses
pynacl
's
SecretBox
.
WARNING: In the case of SecretDict
and SecretList
, the values contained
within must be JSON-serializable without Turbo Broccoli. See also the
TB_SHARED_KEY
environment variable below.
Environment variables
Some behaviors of Turbo Broccoli can be tweaked by setting specific environment
variables. If you want to modify these parameters programatically, do not do so
by modifying os.environ
. Rather, use the methods of
turbo_broccoli.environment
.
-
TB_ARTIFACT_PATH
(default:./
; see alsoturbo_broccoli.set_artifact_path
,turbo_broccoli.environment.get_artifact_path
): During serialization, Turbo Broccoli may create artifacts to which the JSON object will point to. The artifacts will be stored inTB_ARTIFACT_PATH
. For example, ifarr
is a big numpy array,obj = {"an_array": arr} json.dumps(obj, cls=tb.TurboBroccoliEncoder)
will generate the following string (modulo indentation and id)
{ "an_array": { "__numpy__": { "__type__": "ndarray", "__version__": 3, "id": "70692d08-c4cf-4231-b3f0-0969ea552d5a" } } }
and a
70692d08-c4cf-4231-b3f0-0969ea552d5a
file has been created inTB_ARTIFACT_PATH
. -
TB_KERAS_FORMAT
(default:tf
, valid values arejson
,h5
, andtf
; see alsoturbo_broccoli.set_keras_format
,turbo_broccoli.environment.get_keras_format
): The serialization format for keras models. Ifh5
ortf
is used, an artifact following said format will be created inTB_ARTIFACT_PATH
. Ifjson
is used, the model will be contained in the JSON document (anthough the weights may be in artifacts if they are too large). -
TB_MAX_NBYTES
(default:8000
, see alsoturbo_broccoli.set_max_nbytes
,turbo_broccoli.environment.get_max_nbytes
): The maximum byte size of an numpy array or pandas object beyond which serialization will produce an artifact instead of storing it in the JSON document. This does not limit the size of the overall JSON document though. 8000 bytes should be enough for a numpy array of 1000float64
s to be stored in-document. -
TB_NODECODE
(default: empty; see alsoturbo_broccoli.set_nodecode
,turbo_broccoli.environment.is_nodecode
): Comma-separated list of types to not deserialize, for examplebytes,numpy.ndarray
. Excludable types are:bytes
,dataclass.<dataclass_name>
(case sensitive),collections.deque
,collections.namedtuple
,keras.model
,keras.layer
,keras.loss
,keras.metric
,keras.optimizer
,numpy.ndarray
,numpy.number
,pandas.dataframe
,pandas.series
, WARNING: excludingpandas.dataframe
will crash any deserialization ofpandas.series
tensorflow.sparse_tensor
,tensorflow.tensor
,tensorflow.variable
. WARNING: excludingnumpy.ndarray
will may crash deserialization of Tensorflow and Pandas types.
-
TB_SHARED_KEY
(default: empty; see alsoturbo_broccoli.set_shared_key
,turbo_broccoli.environment.get_shared_key
): Secret key used to encrypt secrets. The encryption usespynacl
'sSecretBox
. An exception is raised when attempting to serialize a secret type while no key is set.
Contributing
Dependencies
python3.9
or newer;requirements.txt
for runtime dependencies;requirements.dev.txt
for development dependencies.make
(optional);
Simply run
virtualenv venv -p python3.9
. ./venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements.dev.txt
Documentation
Simply run
make docs
This will generate the HTML doc of the project, and the index file should be at
docs/index.html
. To have it directly in your browser, run
make docs-browser
Code quality
Don't forget to run
make
to format the code following black, typecheck it using mypy, and check it against coding standards using pylint.
Unit tests
Run
make test
to have pytest run the unit tests in tests/
.
Credits
This project takes inspiration from Crimson-Crow/json-numpy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for turbo_broccoli-2.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55cf88f4b502abf370175b389f4a0c5445b362fa45e04132b2f4f61f18f158f4 |
|
MD5 | 0c72b3247da0b0aecdbc6ce9c1421602 |
|
BLAKE2b-256 | 761980506ceebcb4ac3910e00d2f249565f690f519aa158da21a135ff1cb50ba |