Skip to main content

A faster tokenizer for the json-stream Python library

Project description

json-stream-rs-tokenizer

CI build badge CI test badge PyPI package and version badge Supported Python versions badge

A faster tokenizer for the json-stream Python library.

It's actually just json-stream's own tokenizer (itself adapted from the NAYA project) ported to Rust almost verbatim and made available as a Python module using PyO3.

On my machine, it speeds up parsing by a factor of 4–10, depending on the nature of the data.

Installation

pip install json-stream-rs-tokenizer

This will install a prebuilt wheel if one is available for your platform and otherwise try to build it from the source distribution which requires a Rust toolchain to be installed and available to succeed. Note that if the build fails, the package installation will be considered as successfully completed anyway, but RustTokenizer (see below) won't be available for import. This is so that packages can depend on the library but fall back to their own implementation if neither a prebuild wheel is available nor the build succeeds. Increase the installation command's verbosity with -v (repeated for even more information, e.g. -vv) to see error messages when the build fails.

Note that in editable/develop installs, it will sometimes (?) compile the Rust library in debug mode, which makes it run slower than the pure-Python tokenizer. When in doubt, run installation commands with --verbose to see the Rust compilation commands and verify that they used --release.

Usage

To use this package's RustTokenizer, simply pass it as the tokenizer argument to json-stream's load or visit:

from io import StringIO
from json_stream import load
from json_stream_rs_tokenizer import RustTokenizer

json_buf = StringIO('{ "a": [1,2,3,4], "b": [5,6,7] }')

# uses the Rust tokenizer to load JSON:
d = load(json_buf, tokenizer=RustTokenizer)

for k, l in d.items():
  print(f"{k}: {' '.join(str(n) for n in l)}")

As a perhaps slightly more convenient alternative, the package also provides wrappers around json_stream's load and visit functions which do this for you, provided that json-stream has been installed:

from json_stream_rs_tokenizer import load

d = load(StringIO('{ "a": [1,2,3,4], "b": [5,6,7] }'))

# ...

Limitations

  • For PyPy, the speedup is only 1.0-1.5x (much lower than that for CPython). This has yet to be investigated.
  • In builds that don't support PyO3's num-bigint extension (currently only PyPy builds and manual ones against Python's limited C API (Py_LIMITED_API)), conversion of large integers is performed in Python rather than in Rust, at a very small runtime cost.

Benchmarks

The package comes with a script for rudimentary benchmarks on randomly generated JSON data. To run it, you'll need to install the optional benchmark dependencies and a version of json-stream with this patch applied:

pip install json_stream_rs_tokenizer[benchmark]
pip install --ignore-installed \
  git+https://github.com/smheidrich/json-stream.git@util-to-convert-to-py-std-types

You can then run the benchmark as follows:

python -m json_stream_rs_tokenizer.benchmark

Run it with --help to see more information.

License

MIT license. Refer to the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

json-stream-rs-tokenizer-0.4.9.tar.gz (11.7 kB view hashes)

Uploaded Source

Built Distributions

json_stream_rs_tokenizer-0.4.9-pp39-pypy39_pp73-win_amd64.whl (162.1 kB view hashes)

Uploaded PyPy Windows x86-64

json_stream_rs_tokenizer-0.4.9-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

json_stream_rs_tokenizer-0.4.9-pp39-pypy39_pp73-macosx_10_9_x86_64.whl (275.4 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

json_stream_rs_tokenizer-0.4.9-pp38-pypy38_pp73-win_amd64.whl (161.9 kB view hashes)

Uploaded PyPy Windows x86-64

json_stream_rs_tokenizer-0.4.9-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

json_stream_rs_tokenizer-0.4.9-pp38-pypy38_pp73-macosx_10_9_x86_64.whl (275.5 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

json_stream_rs_tokenizer-0.4.9-pp37-pypy37_pp73-win_amd64.whl (164.2 kB view hashes)

Uploaded PyPy Windows x86-64

json_stream_rs_tokenizer-0.4.9-pp37-pypy37_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

json_stream_rs_tokenizer-0.4.9-pp37-pypy37_pp73-macosx_10_9_x86_64.whl (277.4 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

json_stream_rs_tokenizer-0.4.9-cp311-cp311-win_amd64.whl (168.6 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

json_stream_rs_tokenizer-0.4.9-cp311-cp311-win32.whl (163.8 kB view hashes)

Uploaded CPython 3.11 Windows x86

json_stream_rs_tokenizer-0.4.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

json_stream_rs_tokenizer-0.4.9-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

json_stream_rs_tokenizer-0.4.9-cp311-cp311-macosx_11_0_arm64.whl (274.3 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

json_stream_rs_tokenizer-0.4.9-cp311-cp311-macosx_10_9_x86_64.whl (282.5 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

json_stream_rs_tokenizer-0.4.9-cp310-cp310-win_amd64.whl (168.6 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

json_stream_rs_tokenizer-0.4.9-cp310-cp310-win32.whl (163.8 kB view hashes)

Uploaded CPython 3.10 Windows x86

json_stream_rs_tokenizer-0.4.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

json_stream_rs_tokenizer-0.4.9-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

json_stream_rs_tokenizer-0.4.9-cp310-cp310-macosx_11_0_arm64.whl (274.3 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

json_stream_rs_tokenizer-0.4.9-cp310-cp310-macosx_10_9_x86_64.whl (282.5 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

json_stream_rs_tokenizer-0.4.9-cp39-cp39-win_amd64.whl (168.8 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

json_stream_rs_tokenizer-0.4.9-cp39-cp39-win32.whl (163.9 kB view hashes)

Uploaded CPython 3.9 Windows x86

json_stream_rs_tokenizer-0.4.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

json_stream_rs_tokenizer-0.4.9-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

json_stream_rs_tokenizer-0.4.9-cp39-cp39-macosx_11_0_arm64.whl (274.4 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

json_stream_rs_tokenizer-0.4.9-cp39-cp39-macosx_10_9_x86_64.whl (282.9 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

json_stream_rs_tokenizer-0.4.9-cp38-cp38-win_amd64.whl (168.5 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

json_stream_rs_tokenizer-0.4.9-cp38-cp38-win32.whl (164.0 kB view hashes)

Uploaded CPython 3.8 Windows x86

json_stream_rs_tokenizer-0.4.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

json_stream_rs_tokenizer-0.4.9-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

json_stream_rs_tokenizer-0.4.9-cp38-cp38-macosx_11_0_arm64.whl (274.1 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

json_stream_rs_tokenizer-0.4.9-cp38-cp38-macosx_10_9_x86_64.whl (282.4 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

json_stream_rs_tokenizer-0.4.9-cp37-cp37m-win_amd64.whl (168.5 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

json_stream_rs_tokenizer-0.4.9-cp37-cp37m-win32.whl (164.0 kB view hashes)

Uploaded CPython 3.7m Windows x86

json_stream_rs_tokenizer-0.4.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

json_stream_rs_tokenizer-0.4.9-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (1.1 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

json_stream_rs_tokenizer-0.4.9-cp37-cp37m-macosx_10_9_x86_64.whl (282.5 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page