A faster tokenizer for the json-stream Python library
Project description
json-stream-rs-tokenizer
A faster tokenizer for the json-stream Python library.
It's actually just json-stream
's own tokenizer (itself adapted from the
NAYA project) ported to Rust almost
verbatim and made available as a Python module using
PyO3.
On my machine, it speeds up parsing by a factor of 4–10, depending on the nature of the data.
Installation
pip install json-stream-rs-tokenizer
Note that in editable/develop installs, it will sometimes (?) compile the
Rust library in debug mode, which makes it run slower than the pure-Python
tokenizer. When in doubt, run installation commands with --verbose
to see the
Rust compilation commands and verify that they used --release
.
Usage
To use this package's RustTokenizer
, simply pass it as the tokenizer
argument to json-stream
's load
or visit
:
from io import StringIO
from json_stream import load
from json_stream_rs_tokenizer import RustTokenizer
json_buf = StringIO('{ "a": [1,2,3,4], "b": [5,6,7] }')
# uses the Rust tokenizer to load JSON:
d = load(json_buf, tokenizer=RustTokenizer)
for k, l in d.items():
print(f"{k}: {' '.join(str(n) for n in l)}")
As a perhaps slightly more convenient alternative, the package also provides
wrappers around json_stream's load
and visit
functions which do this for
you:
from json_stream_rs_tokenizer import load
d = load(StringIO('{ "a": [1,2,3,4], "b": [5,6,7] }'))
# ...
Limitations
- Arbitrary-size integers are not currently supported for PyPy nor when the
extension is built against Python's limited C API (
Py_LIMITED_API
). This is due to a limitation of PyO3'snum-bigint
extension. However, PyO3 PR #2626, which lifts the restriction for PyPy, has been merged into PyO3 main and is expected to make it into a release sooner or later.
Benchmarks
The package comes with a script for rudimentary benchmarks on randomly
generated JSON data. To run it, you'll need to install the optional benchmark
dependencies and a version of json-stream
with
this patch applied:
pip install json_stream_rs_tokenizer[benchmark]
pip install --ignore-installed \
git+https://github.com/smheidrich/json-stream.git@util-to-convert-to-py-std-types
You can then run the benchmark as follows:
python -m json_stream_rs_tokenizer.benchmark
Run it with --help
to see more information.
License
MIT license. Refer to the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for json_stream_rs_tokenizer-0.3.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1fec44cbc87d826fcb728c2601f6252c471996dc2e5ef033590396c0c4bbe2d |
|
MD5 | 5fa3c8f3ad1715cc5e1316edf5779ac9 |
|
BLAKE2b-256 | a01b2bec3b200c02b8f3b6e90bc3e5d52df4c41c53f75352d8deb449edbe4687 |
Hashes for json_stream_rs_tokenizer-0.3.2-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30268a9b6c1888ffb14af4610d8dce5647984e5c42de30e8e010490bf5b4cead |
|
MD5 | 1731498a189f8f5b1d66a56e3caed595 |
|
BLAKE2b-256 | 446405d6f143f26c933bd900a5c3a4e4e80425f9ec4f55ec6171f409099a41a1 |
Hashes for json_stream_rs_tokenizer-0.3.2-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ecbc7abe8cd8eb933fc5166e9c95ffb04d299d7f5dcb52067837e0776993c74 |
|
MD5 | 0221cd10372a3b886ba0edd1b2280609 |
|
BLAKE2b-256 | 42d8b0048d2a9a5aef3818c4f5cb8db1abc58f186ad683ed9f7a80909ec4be2f |
Hashes for json_stream_rs_tokenizer-0.3.2-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cdc34e8576694d5c197dd870379339495b07518db722693cd632387696c60412 |
|
MD5 | 2d630f99a07069f528661a55eb497a97 |
|
BLAKE2b-256 | edcda048ef36e58d06dc2b4df129afbe5aab20c98572396c0b23c1f39a02d69b |
Hashes for json_stream_rs_tokenizer-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | da120e6a611c8a29e22170b34b365aaa9bff29f9fd4b8f4c728bb037140c1fad |
|
MD5 | ba2eccaf9a67ac53f0d251ab421c6bd3 |
|
BLAKE2b-256 | 3f4da633bf7f2318bb49a61e121abe4caeb89f4d28098b287f4a8a30c6944d55 |
Hashes for json_stream_rs_tokenizer-0.3.2-cp310-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61ff55d373b99cff0e5bf0a3fc8ec5da1f84163a0a41684c38797c2d91077b1a |
|
MD5 | 85a2ba36fea3fbe862c35dfbc9cfc09a |
|
BLAKE2b-256 | b7b666046ec0da8efe675110dd7ac476136df5a609a7dc1072372f7b80a31373 |
Hashes for json_stream_rs_tokenizer-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67108b16e43e5ab3c3c2f3e0b0015a797dad20f6a22ece68294c76d8d74a5d4a |
|
MD5 | 5f3060822c44f2d6514100ffcbf7b51e |
|
BLAKE2b-256 | 469df3ce7eec371ad7a3df2f964a30bb48157b0d838138e9fd843b41c6e75874 |
Hashes for json_stream_rs_tokenizer-0.3.2-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14c38dcf08a1306aac5409aa32fabfbe718c937b9fb661a13f5b5ec71a6d210d |
|
MD5 | 7a44f9a330115aa123b4a40cf8b28f8e |
|
BLAKE2b-256 | 1ee5e76f815ac909ad904b3af0b59861d5f2ce278097be0d5b1b8cc70c00f2a1 |
Hashes for json_stream_rs_tokenizer-0.3.2-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 532772c9a8b04ce92ddd194091f3baa4f2c6784c918b2091b41db4b3ce42a3db |
|
MD5 | 1cc78dffcd698d7c08bdec235f31373e |
|
BLAKE2b-256 | b7e47f6f00d1bad0d8d89c370cdb0bdae1bef3bde801c19fb0fb01bbd02a95d7 |
Hashes for json_stream_rs_tokenizer-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ab1609beac4df881788ae3692834467e69d0c157eac08c876c277438b46d709 |
|
MD5 | c09c5c3d4a5c0e0497e19ab0e3bc9900 |
|
BLAKE2b-256 | 6b2d33fe0df4c6d072def0ea65aa31b809d3d1e4f249775492a78fa7d81f27ef |
Hashes for json_stream_rs_tokenizer-0.3.2-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2bcb000eccfefd338aeb47d49a2aca7fedb7bcc21f6efeb1e6e75045d656fb3d |
|
MD5 | 1145969eb8048d879c1169f63b9ba626 |
|
BLAKE2b-256 | 5ab9c8c34f7f1ae5370713a8b46bf0e487f256023cca0f2463c7433546d16f2b |
Hashes for json_stream_rs_tokenizer-0.3.2-cp38-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1bad4fbb00f8ebce775b9071cd173ab206c6de2da2e406465eb5b3e90957ae3 |
|
MD5 | fd5cbc833b50993223f9e31230f7c16a |
|
BLAKE2b-256 | 619fe45ffda33456fdec547c73914d720421bb353781d35752236cd82d83db8c |
Hashes for json_stream_rs_tokenizer-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a631ea7b328417e1e08429b9965d9de1f8fe26e9b4dd4d4dbeaadb3a045ad27c |
|
MD5 | dad90e76c8cded08c115959aa6e76bd6 |
|
BLAKE2b-256 | e94c759887f215b487172c97ff7da13d712359b108a8ae7b228ddb78ed9515d3 |
Hashes for json_stream_rs_tokenizer-0.3.2-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35db689f6ead347ea5dbe6fcd7486f4c4c614d7b7274703f4ae20a47c0a845d0 |
|
MD5 | 2d11b9ea553fa9094f2b5c505848ee43 |
|
BLAKE2b-256 | af69ad48b4d6521b7e4567b6597d631e6154ba3e6b54bfb85a6cdaa1448013ea |
Hashes for json_stream_rs_tokenizer-0.3.2-cp37-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97caa63cb9dbd275beb3d3c879f9ef878a9d3cf45e686adb655619176c6e0bbc |
|
MD5 | 772847be4f50606c58428568e887af26 |
|
BLAKE2b-256 | adc7b49a285cd5e93195905dc95a7758f2d2ce562465bcad127f92dcd3143bf4 |
Hashes for json_stream_rs_tokenizer-0.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39bd6c9f1eed6b0203449ae2e63fdefcdeef332af76cbd6b262627dcecdd2304 |
|
MD5 | 0e08a294cc918115c37b9b869ae6482b |
|
BLAKE2b-256 | 15f272703f170892afcf22dce43cd30945290d7e1d056d40f58c896711f9aa0d |
Hashes for json_stream_rs_tokenizer-0.3.2-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53715b2279afaf684df96441e229e62658263458b6bcc51d9df97bb96eee035e |
|
MD5 | 48870f85a0c46289818a0a1947771a24 |
|
BLAKE2b-256 | f73acfd12199b99978b4d8e9ed1a97a1045ff85bf5365bcefc198c9013a01e9b |