A faster tokenizer for the json-stream Python library
Project description
json-stream-rs-tokenizer
A faster tokenizer for the json-stream Python library.
It's actually just json-stream
's own tokenizer (itself adapted from the
NAYA project) ported to Rust almost
verbatim and made available as a Python module using
PyO3.
On my machine, it speeds up parsing by a factor of 4–10, depending on the nature of the data.
Installation
pip install json-stream-rs-tokenizer
Note that in editable/develop installs, it will sometimes (?) compile the
Rust library in debug mode, which makes it run slower than the pure-Python
tokenizer. When in doubt, run installation commands with --verbose
to see the
Rust compilation commands and verify that they used --release
.
Usage
To use this package's RustTokenizer
, simply pass it as the tokenizer
argument to json-stream
's load
or visit
:
from io import StringIO
from json_stream import load
from json_stream_rs_tokenizer import RustTokenizer
json_buf = StringIO('{ "a": [1,2,3,4], "b": [5,6,7] }')
# uses the Rust tokenizer to load JSON:
d = load(json_buf, tokenizer=RustTokenizer)
for k, l in d.items():
print(f"{k}: {' '.join(str(n) for n in l)}")
As a perhaps slightly more convenient alternative, the package also provides
wrappers around json_stream's load
and visit
functions which do this for
you:
from json_stream_rs_tokenizer import load
d = load(StringIO('{ "a": [1,2,3,4], "b": [5,6,7] }'))
# ...
Limitations
- Arbitrary-size integers are not currently supported for PyPy nor when the
extension is built against Python's limited C API (
Py_LIMITED_API
). This is due to a limitation of PyO3'snum-bigint
extension. However, PyO3 PR #2626, which lifts the restriction for PyPy, has been merged into PyO3 main and is expected to make it into a release sooner or later.
Benchmarks
The package comes with a script for rudimentary benchmarks on randomly
generated JSON data. To run it, you'll need to install the optional benchmark
dependencies and a version of json-stream
with
this patch applied:
pip install json_stream_rs_tokenizer[benchmark]
pip install --ignore-installed \
git+https://github.com/smheidrich/json-stream.git@util-to-convert-to-py-std-types
You can then run the benchmark as follows:
python -m json_stream_rs_tokenizer.benchmark
Run it with --help
to see more information.
License
MIT license. Refer to the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for json_stream_rs_tokenizer-0.3.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4738aa2b065001df90f41434957c3bb6b3deb1a243f183f849b40d889a836764 |
|
MD5 | 874838e6547eb47da21a3be2f5a82826 |
|
BLAKE2b-256 | 867209fcba30828bdd265800e9d760b3848d229974ef5abb4a787f30f4c4ba4e |
Hashes for json_stream_rs_tokenizer-0.3.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 171d78a670afa45cd91d1228992d11e6c970e66aee3a8397187d48699cc0f286 |
|
MD5 | 489c7fb6778b42a506ce631409e69a50 |
|
BLAKE2b-256 | 81004c285239430d7311e2132374ce98138368360f4a0f5d993341830ec0dbff |
Hashes for json_stream_rs_tokenizer-0.3.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f1ceeb6dcbfdbf9f4841cfb37107a82a047a23aec76bc51cce09edc6a1d1d2d |
|
MD5 | 76ab6fb5079c66a9a01d1a4e1bac9e68 |
|
BLAKE2b-256 | 91880660363006f859f129146573182dfef352db8f3d20b8ad63d30b58cadb1a |
Hashes for json_stream_rs_tokenizer-0.3.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae96439523b9efd90a6174cdd6ac41d1ddc465b402d4837fd34efacb7bba4a2c |
|
MD5 | 33e9a72b13cfe1bc233dd868376f7fe6 |
|
BLAKE2b-256 | e3296d6902cbf29740b12d6c054670234bf5dce28779721a63ef7182436976cd |
Hashes for json_stream_rs_tokenizer-0.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8cb83dd1d5e2bf1e3b26525b5a806f0ec8f880bf0f27a8d6350293c29462727e |
|
MD5 | 763c29475ee6fafac547c44a66362212 |
|
BLAKE2b-256 | 97d8fed003a94cb87058eb0cb5a6155fef87cdd65551dffef7ed50098da72bd0 |
Hashes for json_stream_rs_tokenizer-0.3.1-cp310-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e95f7401a51cdf1e3600fd15d442a96a0444c985af745f128c3ce327fd82156 |
|
MD5 | 9fc12fc6b3ddaf69dc9a3475ce5e195e |
|
BLAKE2b-256 | 413bdb48efaaeee3c4fabe9e40eb53c4e1ed0cf42058de4de9bacc38420ed213 |
Hashes for json_stream_rs_tokenizer-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58a3e4b5cedc9b1650bdc98418dc2734b98d6804f6ec5ad2fd0367d61411f46e |
|
MD5 | 33cc273af12346f0599e2e2ff866c37e |
|
BLAKE2b-256 | fe0ea7440ff8c12df8da62f08bece9de32744a89830c542d9b8f045bb15bb0dd |
Hashes for json_stream_rs_tokenizer-0.3.1-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ecd0508514ea3c5d25d83440b593d364a86038975f55047158f336e4cc6cb9a |
|
MD5 | d639b401732d1052389f18d9148bb683 |
|
BLAKE2b-256 | 5787bd28767514d4779c421c93fa63f7c5bf8de1b2423432541b32ec6c91dc21 |
Hashes for json_stream_rs_tokenizer-0.3.1-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 702d21910dbd44a089fca61a8fafbad5a97cce26fb6b4c4df724bd6153094caf |
|
MD5 | eeb1c8eec6462c49af30eb468ae1d9cd |
|
BLAKE2b-256 | 3c81fb0e50b0905bb8ec82d6220c801a31b2493e8f5bcf259eae9ffbb623af9f |
Hashes for json_stream_rs_tokenizer-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0cb28e2b5c8ec235c16cd56aca5cf5e9a459f9aff5415c7a9cc99e95011bd9ef |
|
MD5 | 3948bcf9c171cf2fa65a6b493d7aab73 |
|
BLAKE2b-256 | 91acdcc0ad91c1c87f5200f7947cfc0de514bbef8282104809e0e6eac121d839 |
Hashes for json_stream_rs_tokenizer-0.3.1-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5cb49d648998901cc3f3f89ae38b95f06a8852efe4858bbb7f0a95e4af0504a9 |
|
MD5 | eb8ad0221411bef607c0540efa396ee0 |
|
BLAKE2b-256 | 80d306086539a65e3861f9644a9ab427d59ae60c29cff142e0e64e2aaf1f9491 |
Hashes for json_stream_rs_tokenizer-0.3.1-cp38-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c20c9cbc1c7f5bb0d10fd5cea6558bf55bb48d2cb8e71ab8f8c4f41874281608 |
|
MD5 | 53594468e957d8fcac18d36443211595 |
|
BLAKE2b-256 | e537b17bc9547d4016aed87ecf42208ca6d9bc98bfb51f614e8c78889561ef7b |
Hashes for json_stream_rs_tokenizer-0.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d14aeae4dd66cbb6ef38f557e2c75abf4f51e9a7aa4b2946bb7f4c3ef4dd3f42 |
|
MD5 | 24be965ad842ca2c350428b7f612ad1d |
|
BLAKE2b-256 | 12182a57ae16e2b4e4b50cc79f6d7be99f0dcf60cdcdb986c265a479d1ad77c3 |
Hashes for json_stream_rs_tokenizer-0.3.1-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 909cdcd8b7c88de083cb50f4b906701a367a0443c6fabf1872b04459d64e73f7 |
|
MD5 | 64079d81774f8cc27b3f44a7f9da7f79 |
|
BLAKE2b-256 | 6ecbad787d95d1da984e78e23c49b91c2b4fb317abff9c14fcf33467b760b90e |
Hashes for json_stream_rs_tokenizer-0.3.1-cp37-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | db9a2fd8fd050a8fccee570b1ca4454f1df22416026d1e853fc004dd99190d41 |
|
MD5 | a111ffdbf528121ba64fcc38f24d728a |
|
BLAKE2b-256 | e98f640353dfa7c62669bd48e63384af46f661c23141d95425cb264e719a73b3 |
Hashes for json_stream_rs_tokenizer-0.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 770129ff969c2d6ad1fe8adcbc9ba58fda808b1ef7ff0216a41e51ccc3467c0b |
|
MD5 | 9a8b41845fa600edbe77e2a666e5da9f |
|
BLAKE2b-256 | ddd6ada5ef1a45ee23445bbd14988402fce65004607ed84eb08f48cfd20dcfd0 |
Hashes for json_stream_rs_tokenizer-0.3.1-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14cf2f262431b5e46b4887d433aff28324350a86c43fb5bc4322341ecf96f6a0 |
|
MD5 | 2cbc610891c910ac08130ea586215606 |
|
BLAKE2b-256 | c2e729ff03aa8fc2ee6786726b2ac0b437d1fe66a44b27f32d7267d03493c170 |