Skip to main content

A library for deserializing various formats directly into numpy arrays

Project description

serde-numpy

serde-numpy is a library for efficient deserializing of various file formats directly into numpy arrays.

See how it works for:

Installation

Currently only available for linux, python >= 3.7

pip install --upgrade pip
pip install serde-numpy

Image formats

Example usage

>>> from serde_numpy import decode_jpeg, read_jpeg, decode_png, read_png
>>> 
>>> img = read_jpeg("test.jpg")
>>> img
array([[[ 75,  29,  82],
        [ 96,  56, 133],
        [ 72,  47, 168],
        [ 63,  56, 179]],

       [[216, 176, 203],
        [173, 139, 190],
        [111,  93, 188],
        [129, 128, 225]],

       [[ 75,  46,  21],
        [ 73,  51,  48],
        [ 81,  73, 115],
        [157, 167, 209]],

       [[165, 142,  99],
        [181, 165, 144],
        [169, 169, 188],
        [185, 203, 222]]], dtype=uint8)
>>> 
>>> byte_array = open("test.png", "rb").read()
>>> img = decode_png(byte_array)
>>> img
array([[[ 33,  47, 146],
        [206,  19, 120],
        [185,   8,  55],
        [ 33,  54, 176]],

       [[252, 156, 169],
        [169, 139, 100],
        [ 24, 128, 222],
        [136, 146, 213]],

       [[ 28,  24, 192],
        [184,  51,  58],
        [ 39,  61, 252],
        [237, 165, 113]],

       [[239, 111,  72],
        [ 30, 242,  38],
        [165, 161, 223],
        [ 91, 246, 217]]], dtype=uint8)

Benchmarks

All benchmarks were performed on an AMD Ryzen 9 3950X (Python 3.8.12, numpy 1.23.2, orjson 3.6.4). We compare serde_numpy's decode_png and decode_jpeg versus pillow's Image.open + np.asarray (which is the de facto standard for libraries than do a lot of image loading e.g. pytorch's torchvision).

JPEG

JPEG decoding for square images:

alt text

PNG

PNG decoding for square images:

alt text

JSON Formats

Motivation

If you've ever done something like this in your code:

data = json.load(open("data.json"))

arr = np.array(data["x"])

then this library does it faster by using minimal array allocations and less python.

Speed ups are 1.5x - 8x times faster, depending on array sizes (and CPU), when compared to orjson + numpy.

Usage

The user specifies the numpy dtypes within a structure corresponding to the data that they want to deserialize.

N-dimensional array

A subset of the json's (or msgpack) keys are specified in the structure which is used to initialize the NumpyDeserializer and then that subset of keys are deserialized accordingly:

>>> from serde_numpy import NumpyDeserializer
>>> 
>>> json_str = b"""
... {
...     "name": "coordinates",
...     "version": "0.1.0",
...     "arr": [[1.254439975231648, -0.6893827594332794],
...             [-0.2922560025562806, 0.5204819306523419]]
... }
... """
>>> 
>>> structure = {
...     'name': str,
...     'arr': np.float32
... }
>>> 
>>> deserializer = NumpyDeserializer.from_dict(structure)
>>> 
>>> deserializer.deserialize_json(json_str)
{'arr': array([[ 1.25444   , -0.68938273],
               [-0.292256  ,  0.52048194]], dtype=float32), 
 'name': 'coordinates'}

Transposed arrays

Sometimes people store data in jsons in a row-wise fashion as opposed to column-wise. Therefore each row can contain multiple dtypes. serde-numpy allows you to specify the types of each row and then deserializes into columns. To tell the numpy deserializer that you want to transpose the columns put square brackets outside either a dictionary [{key: Type, ...}] like this example:

>>> json_str = b"""
... {
...     "df": [{"a": 3, "b": 4.23},
...            {"a": 4, "b": 5.12}]
... }
... """
>>> 
>>> structure = {"df": [{"a": np.uint16, "b": np.float64}]}
>>> 
>>> deserializer = NumpyDeserializer.from_dict(structure)
>>> 
>>> deserializer.deserialize_json(json_str)
{'df': {'b': array([4.23, 5.12]), 'a': array([3, 4], dtype=uint16)}}

or put square brackets outside a list [[Type, ...]] of types:

>>> json_str = b"""
... {
...     "df": [["i", true],
...            ["j", false],
...            ["k", true]]
... }
... """
>>> 
>>> structure = {"df": [[str, np.bool_]]}
>>> 
>>> deserializer = NumpyDeserializer.from_dict(structure)
>>> 
>>> deserializer.deserialize_json(json_str)
{'df': [['i', 'j', 'k'], array([ True, False,  True])]}

Currently supported data formats:

  • JSON :: NumpyDeserializer.deserialize_json
  • MessagePack :: NumpyDeserializer.deserialize_msgpack

Currently supported types:

Numpy types:

  • np.int8
  • np.int16
  • np.int32
  • np.int64
  • np.uint8
  • np.uint16
  • np.uint32
  • np.uint64
  • np.float32
  • np.float64
  • np.bool_

Python types:

  • int
  • float
  • str
  • dict
  • list

Benchmarks

All benchmarks were performed on an AMD Ryzen 9 3950X (Python 3.8.12, numpy 1.23.2, orjson 3.6.4). Orjson was selected as the comparison as it is the fastest on python json benchmarks and we have also found it to be fastest in practice.

2D Array deserialization

Two tests are performed. The number of rows are kept constant at 10 while varying the number of columns and the number of columns are kept constant at 10 while varying the number of rows. We compare against orjson.loads + np.array with the desired data type. Results are presented below for deserializing arrays of various data types:

alt text

Transposed arrays deserialization

For this test we test the speed of deserializing multiple data types which have been serialized in a row-wise fashion and converting it to column-wise arrays during deserializition.

alt text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serde_numpy-0.3.0.tar.gz (215.3 kB view details)

Uploaded Source

Built Distributions

serde_numpy-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (470.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

serde_numpy-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (470.4 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

serde_numpy-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (470.9 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

serde_numpy-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471.0 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file serde_numpy-0.3.0.tar.gz.

File metadata

  • Download URL: serde_numpy-0.3.0.tar.gz
  • Upload date:
  • Size: 215.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.1.0

File hashes

Hashes for serde_numpy-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b13259b89aee493fbdbca83d965cd8969d98035f56a9d80c1d4073b5dfb9199a
MD5 d748572ca459994a5822366e012bea2a
BLAKE2b-256 97887f2d47552027b03b12b450ff3e72fcfb2186fb0198333bb1e2b5ada10741

See more details on using hashes here.

File details

Details for the file serde_numpy-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for serde_numpy-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0292cd15d23d9b122996e5fb3a60c46651f4cc3717a5a2d65c530d406c1684a4
MD5 d23f4302db346250a576e32a96342d71
BLAKE2b-256 d61c2674ba1ccf42f7f466a7eb129ae432ed9bf38499b7f047abcfc2d1dc75df

See more details on using hashes here.

File details

Details for the file serde_numpy-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for serde_numpy-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6b5fffe6224230cae11784ed53e30a814029d8a3036b435cd40fa38ed5353a7c
MD5 0df924a47cd224e6493e07248f698135
BLAKE2b-256 371d007d838c436fcb2922f9086fcd72d71c748b5d60b11b760baab834c13143

See more details on using hashes here.

File details

Details for the file serde_numpy-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for serde_numpy-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f3d1e44ded070e2c6144e9cfd30afb9904ae9b2fe04e7355cf3d5f9d8245d498
MD5 5891478df1bc9bafde768c08501656b5
BLAKE2b-256 aed8a930c7f3e21c0d44431edbb8f145941a6860a420c4fe2eeb4606b949a9d7

See more details on using hashes here.

File details

Details for the file serde_numpy-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for serde_numpy-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ffc8f0a6256e085b7957a075ae28a56ceb8428569b913e54092ae3d86a9ef814
MD5 a413af579658b370796fc6bd74f60acc
BLAKE2b-256 f6edc14b901f3ab1ce92e7aa1dde60b83b738e04209398ae63f6664a81f0a1fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page