simdjson bindings for python
Project description
pysimdjson
Quick-n'dirty Python bindings for simdjson just to see if going down this path might yield some parse time improvements in real-world applications. So far, the results are promising, especially when only part of a document is of interest.
Bindings are currently tested on OS X, Linux, and Windows.
See the latest documentation at http://pysimdjson.tkte.ch.
Installation
There are binary wheels available for some platforms. On other platforms you'll need a C++17-capable compiler.
pip install pysimdjson
Binary wheels are available for:
Platform | py3.4 | py3.5 | py3.6 | py3.7 |
---|---|---|---|---|
OS X 10.12 | x | x | x | y |
Windows | x | x | y | y |
Linux | y | y | y | y |
or build from git:
git clone https://github.com/TkTech/pysimdjson.git
cd pysimdjson
python setup.py install
Example
import simdjson
with open('sample.json', 'rb') as fin:
doc = simdjson.loads(fin.read())
However, this doesn't really gain you that much over, say, ujson. You're still
loading the entire document and converting the entire thing into a series of
Python objects which is very expensive. You can instead use items()
to pull
only part of a document into Python.
Example document:
{
"type": "search_results",
"count": 2,
"results": [
{"username": "bob"},
{"username": "tod"}
],
"error": {
"message": "All good captain"
}
}
And now lets try some queries...
import simdjson
with open('sample.json', 'rb') as fin:
# Calling ParsedJson with a document is a shortcut for
# calling pj.allocate_capacity(<size>) and pj.parse(<doc>). If you're
# parsing many JSON documents of similar sizes, you can allocate
# a large buffer just once and keep re-using it instead.
pj = simdjson.ParsedJson(fin.read())
pj.items('.type') #> "search_results"
pj.items('.count') #> 2
pj.items('.results[].username') #> ["bob", "tod"]
pj.items('.error.message') #> "All good captain"
AVX2
simdjson requires AVX2 support to function. Check to see if your OS/processor supports it:
- OS X:
sysctl -a | grep machdep.cpu.leaf7_features
- Linux:
grep avx2 /proc/cpuinfo
Low-level interface
You can use the low-level simdjson Iterator interface directly, just be aware that this interface can change any time. If you depend on it you should pin to a specific version of simdjson. You may need to use this interface if you're dealing with odd JSON, such as a document with repeated non-unique keys.
with open('sample.json', 'rb') as fin:
pj = simdjson.ParsedJson(fin.read())
iter = simdjson.Iterator(pj)
if iter.is_object():
if iter.down():
print(iter.get_string())
Early Benchmark
Comparing the built-in json module loads
on py3.7 to simdjson loads
.
File | json time |
pysimdjson time |
---|---|---|
jsonexamples/apache_builds.json |
0.09916733999999999 | 0.074089268 |
jsonexamples/canada.json |
5.305393378 | 1.6547515810000002 |
jsonexamples/citm_catalog.json |
1.3718639709999998 | 1.0438697340000003 |
jsonexamples/github_events.json |
0.04840242700000097 | 0.034239397999998644 |
jsonexamples/gsoc-2018.json |
1.5382746889999996 | 0.9597240750000005 |
jsonexamples/instruments.json |
0.24350973299999978 | 0.13639699600000021 |
jsonexamples/marine_ik.json |
4.505123285000002 | 2.8965093270000004 |
jsonexamples/mesh.json |
1.0325923849999974 | 0.38916503499999777 |
jsonexamples/mesh.pretty.json |
1.7129034710000006 | 0.46509220500000126 |
jsonexamples/numbers.json |
0.16577519699999854 | 0.04843887400000213 |
jsonexamples/random.json |
0.6930746310000018 | 0.6175370539999996 |
jsonexamples/twitter.json |
0.6069602610000011 | 0.41049074900000093 |
jsonexamples/twitterescaped.json |
0.7587005720000022 | 0.41576198399999953 |
jsonexamples/update-center.json |
0.5577604210000011 | 0.4961777420000004 |
Getting subsets of the document is significantly faster. For canada.json
getting .type
using the naive approach and the items()
appraoch, average
over N=100.
Python | Time |
---|---|
json.loads(canada_json)['type'] |
5.76244878 |
simdjson.loads(canada_json)['type'] |
1.5984486990000004 |
simdjson.ParsedJson(canada_json).items('.type') |
0.3949587819999998 |
This approach avoids creating Python objects for fields that aren't of interest. When you only care about a small part of the document, it will always be faster.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pysimdjson-1.4.1-py3.7-win-amd64.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6fc60daf92a2c03298551320965fe94f158f325e3fac1b6653acde7a5ef68c46 |
|
MD5 | cefe394b0188ccb66589c8c13268681f |
|
BLAKE2b-256 | c6afbd9f58d38c8b1e04df3a393197978bb0e78012f5a94428cbbf1e6a6247bb |
Hashes for pysimdjson-1.4.1-py3.6-win-amd64.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8047c0d727d9be0910fb572245ede614e19c3b84eb7956792647e741bf6141d5 |
|
MD5 | a1410def7257d44a898eaf19afecd8b7 |
|
BLAKE2b-256 | abd775657386222d8c2fc6be5ba5ea38a5ee5fb6b1b1da9d4254afec070db5b6 |
Hashes for pysimdjson-1.4.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b28937f680b386d6cb5789d34c91943e4d5ed1487fd3bbaada454c7c2a0eecf |
|
MD5 | 9824edad69a6cf395aee1e1e85eec64e |
|
BLAKE2b-256 | 042622e5e5f002cb641bf2bf4db9587c53c9f8d3961ba402d608bdb0ef5d0771 |
Hashes for pysimdjson-1.4.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f002ad4f1c81e369cb493426696c1f61a73b7c8c1be93e97685e89479eac03a4 |
|
MD5 | c3b4fec2f9e81e2501228bfda8f4b27a |
|
BLAKE2b-256 | f8906e5debb2fb1be7a105d61803a53612caf4aca46b370b09102282c5fae678 |
Hashes for pysimdjson-1.4.1-cp37-cp37m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 51038856d00c2cc2ceb5a8a43eb4c76bf77ac91fb2c3d3a53efaece4a49859bb |
|
MD5 | 2930ec3e38b874393697589caceaa568 |
|
BLAKE2b-256 | 79ec59d0e404d53deb0fde60a54e972e39811813ab32e7810db898a9fd748326 |
Hashes for pysimdjson-1.4.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcdaca430981864b716c62629e9e26e84a9acd8fc287833648e9d727d310d5d9 |
|
MD5 | 51bfd6631d059141a5591e6a8a6da9fa |
|
BLAKE2b-256 | 0e65176f707609113f16bb958261553e304032f02d2385ceb3ce909019bac7c1 |
Hashes for pysimdjson-1.4.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20f2bd413338b13f354c81531a93338c7b48551aba54140b1f84a9522b67d88d |
|
MD5 | 9721e5f58a9ecb934b2fafbc5baab1fb |
|
BLAKE2b-256 | c1679b032077ce998740b6f132b49f55c2c705b1125cd3d1c8c66633f8fda4cc |
Hashes for pysimdjson-1.4.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38b3f2407d35019c268e60244f275a22b71c6ae1a1aa8635b25863e5bc0a4bac |
|
MD5 | 49768bf7cf0856a1ca6feda8db1c1ad4 |
|
BLAKE2b-256 | 1c9b671095cfec59157fe378c49c1d589a3564125812c9b8ddc4395ba102dbcd |
Hashes for pysimdjson-1.4.1-cp34-cp34m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a75580b2d6d2339a93c1337eae2ded99c3bcd890d68e77f5ffb3572e8a63d56 |
|
MD5 | 8793920bebd97a8b00355feaf46e1afc |
|
BLAKE2b-256 | a6909ade24d3cf6a2aec74e56cf96280c4fd09f5891be56c2f92e5ca783eee2b |