JSON (de)serialization extensions
Project description
Turbo Broccoli 🥦
JSON (de)serialization extensions, originally aimed at numpy
and tensorflow
objects, but now supporting a wide range of objects.
Installation
pip install turbo-broccoli
Usage
To/from string
import numpy as np
import turbo_broccoli as tb
obj = {"an_array": np.array([[1, 2], [3, 4]], dtype="float32")}
tb.to_json(obj)
produces the following string (modulo indentation and the value of
$.an_array.data.data
):
{
"an_array": {
"__type__": "numpy.ndarray",
"__version__": 5,
"data": {
"__type__": "bytes",
"__version__": 3,
"data": "QAAAAAAAAAB7ImRhd..."
}
}
}
For deserialization, simply use
tb.from_json(json_string)
To/from file
Simply replace
turbo_broccoli.to_json
and
turbo_broccoli.from_json
with
turbo_broccoli.save_json
and
turbo_broccoli.load_json
:
import numpy as np
import turbo_broccoli as tb
obj = {"an_array": np.array([[1, 2], [3, 4]], dtype="float32")}
tb.save_json(obj, "foo/bar/foobar.json")
...
obj = tb.load_json("foo/bar/foobar.json")
Contexts
The behaviour of
turbo_broccoli.to_json
and
turbo_broccoli.from_json
can be tweaked by using
contexts.
For example, to set a encryption/decryption key for secret
types:
import nacl.secret
import nacl.utils
import turbo_broccoli as tb
key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
ctx = tb.Context(nacl_shared_key=key)
obj = {"user": "alice", "password": tb.SecretStr("dolphin")}
doc = tb.to_json(obj, ctx)
...
obj = tb.from_json(doc, ctx)
The behaviour of
turbo_broccoli.save_json
and
turbo_broccoli.load_json
can be tweaked in a similar manner, but for convenience, the argument of the
context are passed directly to the method:
import nacl.secret
import nacl.utils
import turbo_broccoli as tb
key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
obj = {"user": "alice", "password": tb.SecretStr("dolphin")}
tb.save_json(obj, "foo/bar/foobar.json", nacl_shared_key=key)
See the documentation.
Artifacts
If an object inside obj
is too large to be embedded inside the JSON file
(e.g. a large numpy array), then an artifact file is created:
import numpy as np
import turbo_broccoli as tb
obj = {"an_array": np.random.rand(1000, 1000)}
tb.save_json(obj, "foo/bar/foobar.json")
produces the JSON file
{
"an_array": {
"__type__": "numpy.ndarray",
"__version__": 5,
"data": {
"__type__": "bytes",
"__version__": 3,
"id": "1e6dff28-5e26-44df-9e7a-75bc726ce9aa"
}
}
}
and a file foo/bar/foobar.1e6dff28-5e26-44df-9e7a-75bc726ce9aa.tb
containing
the array data. The artifact directory can be explicitely specified by setting
it in the serialization
context
or by setting the TB_ARTIFACT_PATH
environment variable (see below.). The
code for loading the JSON file does not change:
obj = tb.load_json("foo/bar/foobar.json")
If using
turbo_broccoli.to_json
,
since there is no output file path specified, the artifacts are storied in a
temporary directory instead:
import numpy as np
import turbo_broccoli as tb
obj = {"an_array": np.random.rand(1000, 1000)}
doc = tb.to_json(obj)
# An artifact has been created somewhere in e.g. /tmp
Since no information about this directory is stored in the output JSON string,
it is not possible to load doc
using
turbo_broccoli.from_json
.
If deserialization is necessary, instantiate a context:
import numpy as np
import turbo_broccoli as tb
ctx = tb.Context()
obj = {"an_array": np.random.rand(1000, 1000)}
doc = tb.to_json(obj, ctx)
# An artifact has been created in ctx.artifact_path
...
obj = tb.from_json(doc, ctx)
Environment variables
Some behaviors of Turbo Broccoli can be tweaked by setting specific environment
variables. If you want to modify these parameters programatically, do not do so
by modifying os.environ
. Rather, use a
turbo_broccoli.Context
.
-
TB_ARTIFACT_PATH
(default: output JSON file's parent directory): During serialization, Turbo Broccoli may create artifacts to which the JSON object will point to. The artifacts will be stored inTB_ARTIFACT_PATH
if specified. -
TB_KERAS_FORMAT
(default:tf
, valid values arekeras
,tf
, andh5
): The serialization format for keras models. Ifh5
ortf
is used, an artifact following said format will be created inTB_ARTIFACT_PATH
. Ifjson
is used, the model will be contained in the JSON document (anthough the weights may be in artifacts if they are too large). -
TB_MAX_NBYTES
(default:8000
): The maximum byte size of a python object beyond which serialization will produce an artifact instead of storing it in the JSON document. This does not limit the size of the overall JSON document though. 8000 bytes should be enough for a numpy array of 1000float64
s to be stored in-document. -
TB_NODECODE
(default: empty): Comma-separated list of types to not deserialize, for examplebytes,numpy.ndarray
. Excludable types are:-
bokeh
,bokeh.buffer
,bokeh.generic
, -
bytes
, -
collections
,collections.deque
,collections.namedtuple
,collections.set
, -
dataclass
,dataclass.<dataclass_name>
(case sensitive), -
datetime
,datetime.datetime
,datetime.time
,datetime.timedelta
, -
generic
, -
keras
,keras.model
,keras.layer
,keras.loss
,keras.metric
,keras.optimizer
, -
numpy
,numpy.ndarray
,numpy.number
,numpy.dtype
,numpy.random_state
, -
pandas
,pandas.dataframe
,pandas.series
, Warning: excludingpandas.dataframe
will also excludepandas.series
, -
pytorch
,pytorch.tensor
,pytorch.module
, -
scipy
,scipy.csr_matrix
, -
secret
, -
sklearn
,sklearn.estimator
,sklearn.estimator.<estimator name>
(case sensitive, see the list of supported sklearn estimators below), -
tensorflow
,tensorflow.sparse_tensor
,tensorflow.tensor
,tensorflow.variable
.
-
-
TB_SHARED_KEY
(default: empty): Secret key used to encrypt/decrypt secrets. The encryption usespynacl
'sSecretBox
. An exception is raised when attempting to serialize a secret type while no key is set.
Guarded blocks
This is so cool. Check out
turbo_broccoli.GuardedBlockHandler
.
Supported types
Basic types
-
Collections:
collections.deque
,collections.namedtuple
-
Dataclasses: serialization is straightforward:
@dataclass class C: a: int b: str doc = tb.to_json({"c": C(a=1, b="Hello")})
For deserialization, first register the class:
ctx = tb.Context(dataclass_types=[C]) tb.from_json(doc, ctx)
Generic objects
serialization only. A generic object is an object that
has the __turbo_broccoli__
attribute. This attribute is expected to be a list
of attributes whose values will be serialized. For example,
class C:
__turbo_broccoli__ = ["a", "b"]
a: int
b: int
c: int
x = C()
x.a, x.b, x.c = 42, 43, 44
tb.to_json(x)
produces the following string:
{"a": 42,"b": 43,}
Registered attributes can of course have any type supported by Turbo Broccoli,
such as numpy arrays. Registered attributes can be @property
methods.
Keras
-
standard subclasses of
keras.layers.Layer
,keras.losses.Loss
,keras.metrics.Metric
, andkeras.optimizers.Optimizer
.
Numpy
numpy.number
, numpy.ndarray
with numerical dtype, and numpy.dtype
.
Pandas
pandas.DataFrame
and pandas.Series
, but with the following limitations:
-
the following dtypes are not supported:
complex
,object
,timedelta
-
the column / series names cannot be ints or int-strings. The following are not acceptable:
df = pd.DataFrame([[1, 2], [3, 4]]) df = pd.DataFrame([[1, 2], [3, 4]], columns=["0", "1"])
Tensorflow
tensorflow.Tensor
with numerical dtype, but not tensorflow.RaggedTensor
.
Pytorch
-
torch.Tensor
, Warning: loaded tensors are automatically placed on the CPU and gradients are lost; -
torch.nn.Module
, don't forget to register your module type using aturbo_broccoli.Context
:# Serialization class MyModule(torch.nn.Module): ... module = MyModule() # Must be instantiable without arguments doc = tb.to_json({"module": module}) # Deserialization ctx = tb.Context(pytorch_module_types=[MyModule]) module = tb.from_json(doc, ctx)
Warning: It is not possible to register and deserialize standard pytorch module containers directly. Wrap them in your own custom module class. For following is not acceptable
import turbo_broccoli as tb import torch module = torch.nn.Sequential( torch.nn.Linear(4, 2), torch.nn.ReLU(), torch.nn.Linear(2, 1), torch.nn.ReLU(), ) obj = {"module": module} doc = tb.to_json(obj) # works, but... tb.from_json(a, ctx) # does't work
but the following works:
class MyModule(torch.nn.Module): module: torch.nn.Sequential # Wrapped sequential def __init__(self): super().__init__() self.module = torch.nn.Sequential( torch.nn.Linear(4, 2), torch.nn.ReLU(), torch.nn.Linear(2, 1), torch.nn.ReLU(), ) ... module = MyModule() # Must be instantiable without arguments doc = tb.to_json({"module": module}) ctx = tb.Context(pytorch_module_types=[MyModule]) module = tb.from_json(doc, ctx)
Scipy
Just scipy.sparse.csr_matrix
. ^^"
Scikit-learn
sklearn
estimators (i.e. that inherit from
sklean.base.BaseEstimator
).
Supported estimators are: AdaBoostClassifier
, AdaBoostRegressor
,
AdditiveChi2Sampler
, AffinityPropagation
, AgglomerativeClustering
,
ARDRegression
, BayesianGaussianMixture
, BayesianRidge
, BernoulliNB
,
BernoulliRBM
, Binarizer
, CategoricalNB
, CCA
, ClassifierChain
,
ComplementNB
, DBSCAN
, DecisionTreeClassifier
, DecisionTreeRegressor
,
DictionaryLearning
, ElasticNet
, EllipticEnvelope
, EmpiricalCovariance
,
ExtraTreeClassifier
, ExtraTreeRegressor
, ExtraTreesClassifier
,
ExtraTreesRegressor
, FactorAnalysis
, FeatureUnion
, GaussianMixture
,
GaussianNB
, GaussianRandomProjection
, GraphicalLasso
, HuberRegressor
,
IncrementalPCA
, IsolationForest
, Isomap
, KernelCenterer
,
KernelDensity
, KernelPCA
, KernelRidge
, KMeans
, KNeighborsClassifier
,
KNeighborsRegressor
, KNNImputer
, LabelBinarizer
, LabelEncoder
,
LabelPropagation
, LabelSpreading
, Lars
, Lasso
, LassoLars
,
LassoLarsIC
, LatentDirichletAllocation
, LedoitWolf
,
LinearDiscriminantAnalysis
, LinearRegression
, LinearSVC
, LinearSVR
,
LocallyLinearEmbedding
, LocalOutlierFactor
, LogisticRegression
,
MaxAbsScaler
, MDS
, MeanShift
, MinCovDet
, MiniBatchDictionaryLearning
,
MiniBatchKMeans
, MiniBatchSparsePCA
, MinMaxScaler
, MissingIndicator
,
MLPClassifier
, MLPRegressor
, MultiLabelBinarizer
, MultinomialNB
,
MultiOutputClassifier
, MultiOutputRegressor
, MultiTaskElasticNet
,
MultiTaskLasso
, NearestCentroid
, NearestNeighbors
,
NeighborhoodComponentsAnalysis
, NMF
, Normalizer
, NuSVC
, NuSVR
,
Nystroem
, OAS
, OneClassSVM
, OneVsOneClassifier
, OneVsRestClassifier
,
OPTICS
, OrthogonalMatchingPursuit
, PassiveAggressiveRegressor
, PCA
,
Pipeline
, PLSCanonical
, PLSRegression
, PLSSVD
, PolynomialCountSketch
,
PolynomialFeatures
, PowerTransformer
, QuadraticDiscriminantAnalysis
,
QuantileRegressor
, QuantileTransformer
, RadiusNeighborsClassifier
,
RadiusNeighborsRegressor
, RandomForestClassifier
, RandomForestRegressor
,
RANSACRegressor
, RBFSampler
, RegressorChain
, RFE
, RFECV
, Ridge
,
RidgeClassifier
, RobustScaler
, SelectFromModel
, SelfTrainingClassifier
,
SGDRegressor
, ShrunkCovariance
, SimpleImputer
, SkewedChi2Sampler
,
SparsePCA
, SparseRandomProjection
, SpectralBiclustering
,
SpectralClustering
, SpectralCoclustering
, SpectralEmbedding
,
StackingClassifier
, StackingRegressor
, StandardScaler
, SVC
, SVC
,
SVR
, SVR
, TheilSenRegressor
, TruncatedSVD
, TSNE
, VarianceThreshold
,
VotingClassifier
, VotingRegressor
. Doesn't work with:
-
All CV classes because the
score_
attribute is a dict indexed withnp.int64
, whichjson.JSONEncoder._iterencode_dict
rejects. -
Everything that is parametrized by an arbitrary object/callable/estimator:
FunctionTransformer
,TransformedTargetRegressor
. -
Other classes that have non JSON-serializable attributes:
Class Non-serializable attr. Birch
_CFNode
BisectingKMeans
function
ColumnTransformer
slice
GammaRegressor
HalfGammaLoss
GaussianProcessClassifier
Product
GaussianProcessRegressor
Sum
IsotonicRegression
interp1d
OutputCodeClassifier
_ConstantPredictor
Perceptron
Hinge
PoissonRegressor
HalfPoissonLoss
SGDClassifier
Hinge
SGDOneClassSVM
Hinge
SplineTransformer
BSpline
TweedieRegressor
HalfTweedieLossIdentity
-
Other errors:
-
FastICA
: I'm not sure why... -
BaggingClassifier
:IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
. -
GradientBoostingClassifier
,GradientBoostingRegressor
,RandomTreesEmbedding
,KBinsDiscretizer
:Exception: dtype object is not covered
. -
HistGradientBoostingClassifier
: Problems with deserialization of_BinMapper
object? -
PassiveAggressiveClassifier
: some unknown label type error... -
SequentialFeatureSelector
: Problem with the unit test itself ^^" -
KNeighborsTransformer
: A serialized-deserialized instance seems tofit_transform
an array to a sparse matrix whereas the original object returns an array? -
RadiusNeighborsTransformer
: Inverse problem fromKNeighborsTransformer
.
-
Bokeh
Secrets
Basic Python types can be wrapped in their corresponding secret type according to the following table
Python type | Secret type |
---|---|
dict |
turbo_broccoli.SecretDict |
float |
turbo_broccoli.SecretFloat |
int |
turbo_broccoli.SecretInt |
list |
turbo_broccoli.SecretList |
str |
turbo_broccoli.SecretStr |
The secret value can be recovered with the get_secret_value
method. At
serialization, the this value will be encrypted. For example,
## See https://pynacl.readthedocs.io/en/latest/secret/#key
import nacl.secret
import nacl.utils
import turbo_broccoli as tb
key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
ctx = tb.Context(nacl_shared_key=key)
obj = {"user": "alice", "password": tb.SecretStr("dolphin")}
tb.to_json(obj, ctx)
produces the following string (modulo indentation and modulo the encrypted content):
{
"user": "alice",
"password": {
"__type__": "secret",
"__version__": 2,
"data": {
"__type__": "bytes",
"__version__": 3,
"data": "gbRXF3hq9Q9hIQ9Xz+WdGKYP5meJ4eTmlFt0r0Ov3PV64065plk6RqsFUcynSOqHzA=="
}
}
}
Deserialization decrypts the secrets, but they stay wrapped inside the secret
types above. If the wrong key is provided, an exception is raised. If no key is
provided, the secret values are replaced by a
turbo_broccoli.LockedSecret
. Internally, Turbo Broccoli uses
pynacl
's
SecretBox
.
Warning: In the case of SecretDict
and SecretList
, the values contained
within must be JSON-serializable without Turbo Broccoli. The following is
not acceptable:
import nacl.secret
import nacl.utils
import numpy as np
import turbo_broccoli as tb
key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
ctx = tb.Context(nacl_shared_key=key)
obj = {"data": tb.SecretList([np.array([1, 2, 3])])}
tb.to_json(obj, ctx)
See also the TB_SHARED_KEY
environment variable below.
Contributing
Dependencies
-
python3.10
or newer; -
requirements.txt
for runtime dependencies; -
requirements.dev.txt
for development dependencies. -
make
(optional);
Simply run
virtualenv venv -p python3.10
. ./venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements.dev.txt
Documentation
Simply run
make docs
This will generate the HTML doc of the project, and the index file should be at
docs/index.html
. To have it directly in your browser, run
make docs-browser
Code quality
Don't forget to run
make
to format the code following black, typecheck it using mypy, and check it against coding standards using pylint.
Unit tests
Run
make test
to have pytest run the unit tests in tests/
.
Credits
This project takes inspiration from Crimson-Crow/json-numpy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for turbo_broccoli-4.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ae022cd32e4fc2d237c9eee3873f1eab19289a64a874ae139e2f39f648fcc8e |
|
MD5 | b9931a0ea4e8fa46a59d9c7904823775 |
|
BLAKE2b-256 | 5bfe82d592db7690b7ad20ff56c26bc4ca61afcdead9aa880e2f2d48cfef9332 |