simdjson bindings for python
Project description
pysimdjson
Quick-n'dirty Python bindings for simdjson just to see if going down this path might yield some parse time improvements in real-world applications. So far, the results are promising, especially when only part of a document is of interest.
These bindings are currently only tested on OS X & Windows, but should work everywhere simdjson does although you'll probably have to tweak your build flags.
See the latest documentation at http://pysimdjson.tkte.ch.
Installation
There are binary wheels available for py3.6/py3.7 on OS X 10.12 & Windows. On other platforms you'll need a C++17-capable compiler.
pip install pysimdjson
or from source:
git clone https://github.com/TkTech/pysimdjson.git
cd pysimdjson
python setup.py install
Example
import simdjson
with open('sample.json', 'rb') as fin:
doc = simdjson.loads(fin.read())
However, this doesn't really gain you that much over, say, ujson. You're still
loading the entire document and converting the entire thing into a series of
Python objects which is very expensive. You can instead use items()
to pull
only part of a document into Python.
Example document:
{
"type": "search_results",
"count": 2,
"results": [
{"username": "bob"},
{"username": "tod"}
],
"error": {
"message": "All good captain"
}
}
And now lets try some queries...
import simdjson
with open('sample.json', 'rb') as fin:
# Calling ParsedJson with a document is a shortcut for
# calling pj.allocate_capacity(<size>) and pj.parse(<doc>). If you're
# parsing many JSON documents of similar sizes, you can allocate
# a large buffer just once and keep re-using it instead.
pj = simdjson.ParsedJson(fin.read())
pj.items('.type') #> "search_results"
pj.items('.count') #> 2
pj.items('.results[].username') #> ["bob", "tod"]
pj.items('.error.message') #> "All good captain"
AVX2
simdjson requires AVX2 support to function. Check to see if your OS/processor supports it:
- OS X:
sysctl -a | grep machdep.cpu.leaf7_features
- Linux:
grep avx2 /proc/cpuinfo
Low-level interface
You can use the low-level simdjson Iterator interface directly, just be aware that this interface can change any time. If you depend on it you should pin to a specific version of simdjson. You may need to use this interface if you're dealing with odd JSON, such as a document with repeated non-unique keys.
with open('sample.json', 'rb') as fin:
pj = simdjson.ParsedJson(fin.read())
iter = simdjson.Iterator(pj)
if iter.is_object():
if iter.down():
print(iter.get_string())
Early Benchmark
Comparing the built-in json module loads
on py3.7 to simdjson loads
.
File | json time |
pysimdjson time |
---|---|---|
jsonexamples/apache_builds.json |
0.09916733999999999 | 0.074089268 |
jsonexamples/canada.json |
5.305393378 | 1.6547515810000002 |
jsonexamples/citm_catalog.json |
1.3718639709999998 | 1.0438697340000003 |
jsonexamples/github_events.json |
0.04840242700000097 | 0.034239397999998644 |
jsonexamples/gsoc-2018.json |
1.5382746889999996 | 0.9597240750000005 |
jsonexamples/instruments.json |
0.24350973299999978 | 0.13639699600000021 |
jsonexamples/marine_ik.json |
4.505123285000002 | 2.8965093270000004 |
jsonexamples/mesh.json |
1.0325923849999974 | 0.38916503499999777 |
jsonexamples/mesh.pretty.json |
1.7129034710000006 | 0.46509220500000126 |
jsonexamples/numbers.json |
0.16577519699999854 | 0.04843887400000213 |
jsonexamples/random.json |
0.6930746310000018 | 0.6175370539999996 |
jsonexamples/twitter.json |
0.6069602610000011 | 0.41049074900000093 |
jsonexamples/twitterescaped.json |
0.7587005720000022 | 0.41576198399999953 |
jsonexamples/update-center.json |
0.5577604210000011 | 0.4961777420000004 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pysimdjson-1.2.1-py3.7-win-amd64.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 21c6d5eec3a3675420e754a5236ac8d969e6b698340e84de2751d429b3097a7c |
|
MD5 | cb299abaf57915d1ccc3f71cf4bc72ba |
|
BLAKE2b-256 | d566231305802dc0e49ae72b56fe2e33a75e9289cfeca3f59becb01acf550dd7 |
Hashes for pysimdjson-1.2.1-py3.6-win-amd64.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbf4dd51dbc2be5e6de23b69af0ec44a7121c841315840c096d06fc45ef38f5f |
|
MD5 | 73d1ae962251057f8c46b1371b7ae14f |
|
BLAKE2b-256 | 3277ec9ace7041e51c612a2fe77c7b605dd11260e2d160934f13203f76470e51 |
Hashes for pysimdjson-1.2.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f66a4919a1a05a0aa3c6a058ecc3b01109233b0d9d9014b5cd4c3d58d12cbbda |
|
MD5 | 2b9f78d0dddceeaea9a67b4e0d48c0cc |
|
BLAKE2b-256 | d66bd7f97c53c2a4cc87b40888654e5d3fb9c96cff8ce6f3e3d3e1299b451368 |
Hashes for pysimdjson-1.2.1-cp37-cp37m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47eec5e469bbcedde4fbc49821d11595ae305820b1ffbee751cbd0aa98985adf |
|
MD5 | e6460b97c1cbdad8b5ab3a1ed28c9a40 |
|
BLAKE2b-256 | ebf6f4b89358079c903a880298039cc50249b57ad8c7ff12e7e7d83c84e645b5 |
Hashes for pysimdjson-1.2.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f44319aae4b781c2fd6a29ea569343e4da2f736a8e3ff21e4909899f626d033a |
|
MD5 | eaf350592d3ed43f0ec26033c308816f |
|
BLAKE2b-256 | d025e018dfc806519073bfb4171bc5b2ffb79019196bdb3bac5f47ec46dba79d |