A library for deserializing various formats directly into numpy arrays
Project description
serde-numpy
serde-numpy is a library for efficient deserializing of various file formats directly into numpy arrays.
See how it works for:
Installation
Currently only available for linux, python >= 3.7
pip install --upgrade pip
pip install serde-numpy
Image formats
Example usage
>>> from serde_numpy import decode_jpeg, read_jpeg, decode_png, read_png
>>>
>>> img = read_jpeg("test.jpg")
>>> img
array([[[ 75, 29, 82],
[ 96, 56, 133],
[ 72, 47, 168],
[ 63, 56, 179]],
[[216, 176, 203],
[173, 139, 190],
[111, 93, 188],
[129, 128, 225]],
[[ 75, 46, 21],
[ 73, 51, 48],
[ 81, 73, 115],
[157, 167, 209]],
[[165, 142, 99],
[181, 165, 144],
[169, 169, 188],
[185, 203, 222]]], dtype=uint8)
>>>
>>> byte_array = open("test.png", "rb").read()
>>> img = decode_png(byte_array)
>>> img
array([[[ 33, 47, 146],
[206, 19, 120],
[185, 8, 55],
[ 33, 54, 176]],
[[252, 156, 169],
[169, 139, 100],
[ 24, 128, 222],
[136, 146, 213]],
[[ 28, 24, 192],
[184, 51, 58],
[ 39, 61, 252],
[237, 165, 113]],
[[239, 111, 72],
[ 30, 242, 38],
[165, 161, 223],
[ 91, 246, 217]]], dtype=uint8)
Benchmarks
All benchmarks were performed on an AMD Ryzen 9 3950X (Python 3.8.12, numpy 1.23.2, orjson 3.6.4). We compare serde_numpy's decode_png and decode_jpeg versus pillow's Image.open + np.asarray (which is the de facto standard for libraries than do a lot of image loading e.g. pytorch's torchvision).
JPEG
JPEG decoding for square images:
PNG
PNG decoding for square images:
JSON Formats
Motivation
If you've ever done something like this in your code:
data = json.load(open("data.json"))
arr = np.array(data["x"])
then this library does it faster by using minimal array allocations and less python.
Speed ups are 1.5x - 8x times faster, depending on array sizes (and CPU), when compared to orjson + numpy.
Usage
The user specifies the numpy dtypes within a structure corresponding to the data that they want to deserialize.
N-dimensional array
A subset of the json's (or msgpack) keys are specified in the structure which is used to initialize the NumpyDeserializer and then that subset of keys are deserialized accordingly:
>>> from serde_numpy import NumpyDeserializer
>>>
>>> json_str = b"""
... {
... "name": "coordinates",
... "version": "0.1.0",
... "arr": [[1.254439975231648, -0.6893827594332794],
... [-0.2922560025562806, 0.5204819306523419]]
... }
... """
>>>
>>> structure = {
... 'name': str,
... 'arr': np.float32
... }
>>>
>>> deserializer = NumpyDeserializer.from_dict(structure)
>>>
>>> deserializer.deserialize_json(json_str)
{'arr': array([[ 1.25444 , -0.68938273],
[-0.292256 , 0.52048194]], dtype=float32),
'name': 'coordinates'}
Transposed arrays
Sometimes people store data in jsons in a row-wise fashion as opposed to column-wise. Therefore each row can contain multiple dtypes. serde-numpy allows you to specify the types of each row and then deserializes into columns. To tell the numpy deserializer that you want to transpose the columns put square brackets outside either a dictionary [{key: Type, ...}] like this example:
>>> json_str = b"""
... {
... "df": [{"a": 3, "b": 4.23},
... {"a": 4, "b": 5.12}]
... }
... """
>>>
>>> structure = {"df": [{"a": np.uint16, "b": np.float64}]}
>>>
>>> deserializer = NumpyDeserializer.from_dict(structure)
>>>
>>> deserializer.deserialize_json(json_str)
{'df': {'b': array([4.23, 5.12]), 'a': array([3, 4], dtype=uint16)}}
or put square brackets outside a list [[Type, ...]] of types:
>>> json_str = b"""
... {
... "df": [["i", true],
... ["j", false],
... ["k", true]]
... }
... """
>>>
>>> structure = {"df": [[str, np.bool_]]}
>>>
>>> deserializer = NumpyDeserializer.from_dict(structure)
>>>
>>> deserializer.deserialize_json(json_str)
{'df': [['i', 'j', 'k'], array([ True, False, True])]}
Currently supported data formats:
JSON::NumpyDeserializer.deserialize_jsonMessagePack::NumpyDeserializer.deserialize_msgpack
Currently supported types:
Numpy types:
np.int8np.int16np.int32np.int64np.uint8np.uint16np.uint32np.uint64np.float32np.float64np.bool_
Python types:
intfloatstrdictlist
Benchmarks
All benchmarks were performed on an AMD Ryzen 9 3950X (Python 3.8.12, numpy 1.23.2, orjson 3.6.4). Orjson was selected as the comparison as it is the fastest on python json benchmarks and we have also found it to be fastest in practice.
2D Array deserialization
Two tests are performed. The number of rows are kept constant at 10 while varying the number of columns and the number of columns are kept constant at 10 while varying the number of rows. We compare against orjson.loads + np.array with the desired data type. Results are presented below for deserializing arrays of various data types:
Transposed arrays deserialization
For this test we test the speed of deserializing multiple data types which have been serialized in a row-wise fashion and converting it to column-wise arrays during deserializition.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file serde_numpy-0.3.0.tar.gz.
File metadata
- Download URL: serde_numpy-0.3.0.tar.gz
- Upload date:
- Size: 215.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b13259b89aee493fbdbca83d965cd8969d98035f56a9d80c1d4073b5dfb9199a
|
|
| MD5 |
d748572ca459994a5822366e012bea2a
|
|
| BLAKE2b-256 |
97887f2d47552027b03b12b450ff3e72fcfb2186fb0198333bb1e2b5ada10741
|
File details
Details for the file serde_numpy-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: serde_numpy-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 470.2 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0292cd15d23d9b122996e5fb3a60c46651f4cc3717a5a2d65c530d406c1684a4
|
|
| MD5 |
d23f4302db346250a576e32a96342d71
|
|
| BLAKE2b-256 |
d61c2674ba1ccf42f7f466a7eb129ae432ed9bf38499b7f047abcfc2d1dc75df
|
File details
Details for the file serde_numpy-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: serde_numpy-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 470.4 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b5fffe6224230cae11784ed53e30a814029d8a3036b435cd40fa38ed5353a7c
|
|
| MD5 |
0df924a47cd224e6493e07248f698135
|
|
| BLAKE2b-256 |
371d007d838c436fcb2922f9086fcd72d71c748b5d60b11b760baab834c13143
|
File details
Details for the file serde_numpy-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: serde_numpy-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 470.9 kB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3d1e44ded070e2c6144e9cfd30afb9904ae9b2fe04e7355cf3d5f9d8245d498
|
|
| MD5 |
5891478df1bc9bafde768c08501656b5
|
|
| BLAKE2b-256 |
aed8a930c7f3e21c0d44431edbb8f145941a6860a420c4fe2eeb4606b949a9d7
|
File details
Details for the file serde_numpy-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: serde_numpy-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 471.0 kB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffc8f0a6256e085b7957a075ae28a56ceb8428569b913e54092ae3d86a9ef814
|
|
| MD5 |
a413af579658b370796fc6bd74f60acc
|
|
| BLAKE2b-256 |
f6edc14b901f3ab1ce92e7aa1dde60b83b738e04209398ae63f6664a81f0a1fd
|