Skip to main content

Schema-aware pysimdjson loader for efficient parsing of large excessive inputs.

Project description

pysimdjson-schemaful

Schema-aware pysimdjson loader for efficient parsing of large excessive JSON inputs.

When working with external APIs you have zero influence on, you may face the following unfortunate edge-case (as we did):

  • Particular endpoint responds with a relatively massive JSON-body, say, ≥ 1 MB.
  • The amount of data you really need is several magnitudes smaller, e.g., 1 KB.
  • There is no server-side filtering available.

In such a case it may be very excessive in terms of memory, cpu time and delay to deserialize and, subsequently, validate the whole response, even when using fast JSON-deseralization libraries, such as orjson.

In our particular case we needed less than 0.1% of ~5 MB responses, which we validated with pydantic. First, we compared several combinations of deserializers and validators:

  • json + pydantic v1 (Model.parse_raw(json.loads(data)))
  • orjson + pydantic v1 (Model.parse_raw(orjson.loads(data)))
  • pysimdjson + pydantic v1 (Model.parse_raw(simdjson.loads(data)))
  • pydantic v2 (Model.model_validate_json(data))

To our surprise internal pydantic v2 parser appeared to be ~2-3 times slower than json + pydantic v1. The fastest was orjson + pydantic v1 (~2-3 times faster than json and a bit faster than full simdjson parsing). Such a speed-up, however, still comes with excessive memory spending (as a complete python dict object is created and populated on deserialization).

Thus, we ended up using pysimdjson with its fast lazy parsing and manually iterated over nested JSON objects/arrays and extracted only required keys. It is ugly, tedious and hard to maintain of course. However, it showed to be several times faster than orjson and decreased memory consumption.

Table of Contents

The crux

This package aims to automate the manual labour of lazy loading with pysimdjson.

Simply feed the JSON-schema in and the input data will be traversed and loaded with pysimdjson accordingly.

Supports

  • pydantic>=1,<3
  • python>=3.8,<3.12
  • simdjson>=2,<6 (with caveats)

Does not support complex schemas (it may be not very reasonable from the practical standpoint anyway), e.g.,

  • anyOf (Union[Model1, Model2])
  • ...

In such cases it will fully (not lazily) load the underlying objects.

When to use?

  • Input JSON data is large relatively to what is needed in there, i.e., selectivity is small.
  • Other deserialization methods appear to be slower and/or more memory consuming.

If you can check all the boxes, then, this package may prove useful to you. Never use it as a default deserialization method: run some benchmarks for your particular case first, otherwise, it may and will disappoint you.

Installation

pip install pysimdjson-schemaful

If you need pydantic support

pip install "pysimdjson-schemaful[pydantic]"

Usage

Basic

import json
from simdjson_schemaful import loads

schema = {
  "type": "array",
  "items": {
    "$ref": "#/definitions/Model"
  },
  "definitions": {
    "Model": {
      "type": "object",
      "properties": {
        "key": {"type": "integer"},
      }
    }
  }
}

data = json.dumps([
    {"key": 0, "other": 1},
    {"missing": 2},
])

parsed = loads(data, schema=schema)

assert parsed == [
    {"key": 0},
    {},
]

Example with additionalProperties:

schema = {
  "type": "object",
  "additionalProperties": {
    "$ref": "#/definitions/Model",
  },
  "definitions": {
    "Model": {
      "type": "object",
      "properties": {
        "key": {"type": "integer"},
      }
    }
  }
}

data = json.dumps({
    "some": {"key": 0, "other": 1},
    "other": {"missing": 2},
})

parsed = loads(data, schema=schema)

assert parsed == {
    "some": {"key": 0},
    "other": {},
}

Reusing parser

With re-used simdjson parser (recommended when used in a single thread, otherwise better consult pysimdjson project on thread-safety):

from simdjson import Parser

parser = Parser()
parsed = loads(data, schema=schema, parser=parser)

assert parsed == {
    "some": {"key": 0},
    "other": {},
}

Pydantic v1

With model (call BaseModel.parse_raw_simdjson):

import json
from simdjson_schemaful.pydantic.v1 import BaseModel

class Model(BaseModel):
  key: int

data = json.dumps({"key": 0, "other": 1})

obj = Model.parse_raw_simdjson(data)

With type (call parse_raw_as_simdjson):

import json
from typing import List
from simdjson_schemaful.pydantic.v1 import BaseModel, parse_raw_simdjson_as

class Model(BaseModel):
  key: int

Type = List[Model]

data = json.dumps([
  {"key": 0, "other": 1},
  {"key": 1, "another": 2},
])

obj1, obj2 = parse_raw_simdjson_as(Type, data)

Pydantic v2

With model (call BaseModel.model_validate_simdjson):

import json
from simdjson_schemaful.pydantic.v2 import BaseModel

class Model(BaseModel):
  key: int

data = json.dumps({"key": 0, "other": 1})

obj = Model.model_validate_simdjson(data)

With type adapter (call TypeAdapter.validate_simdjson)

import json
from typing import List
from simdjson_schemaful.pydantic.v2 import BaseModel, TypeAdapter

class Model(BaseModel):
  key: int

adapter = TypeAdapter(List[Model])

data = json.dumps([
  {"key": 0, "other": 1},
  {"key": 1, "another": 2},
])

obj1, obj2 = adapter.validate_simdjson(data)

Benchmarks

TBD

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysimdjson_schemaful-0.3.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

pysimdjson_schemaful-0.3.0-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file pysimdjson_schemaful-0.3.0.tar.gz.

File metadata

  • Download URL: pysimdjson_schemaful-0.3.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Linux/6.5.0-1016-azure

File hashes

Hashes for pysimdjson_schemaful-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9740d5a5f2b5c9550a662b3daf2702413759da4ea96bbb66e8d41c4359131233
MD5 ca4b32d5ce73afb5ef042ac19fcf8dc6
BLAKE2b-256 0cb37d432eebf9fb64ddeaf51e3de334019408d92837ddc86653926c2cb26a49

See more details on using hashes here.

File details

Details for the file pysimdjson_schemaful-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pysimdjson_schemaful-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc3e25a628f02ec7a0af5a007269828bead70415ffbd81965f6e57140a76e4f8
MD5 ff69213889f2894a1559c022079c5011
BLAKE2b-256 0896e3ee59eebaef092e230ce0a4b2283c26a47d2b130587de5fa0e4632c7601

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page