Skip to main content

A high-performance streaming JSON query engine for out-of-memory files

Project description

crowley

A high-performance JSON querying engine designed for fast starts, low flat memory usage, and out-of-memory streaming.

It is primarily designed to substitute for ijson. If you're coming to crowley from ijson, see the IJSON Migration Guide.

Written in Rust, with a SAX-style JSON event parser adapted from the json-event-parser crate and a regular expression query language adapted from the jsongrep crate.

Use cases

crowley is optimized for the following scenarios:

  • Queries over files too large to fit comfortably in memory. crowley streams through JSON files with bounded memory regardless of file size. A 37 GB file uses ~30 MB of RAM.
  • Queries on transient data. crowley quickly queries data which do not merit transformation into a more easily-queried structure such as a database or dataframe, because of time constraints or because the data is sensitive and cannot be loaded into an external application.
  • Queries over heterogeneous, deeply-nested, and schemaless data which tools such as pandas, polars, or duckdb cannot ingest and transform. crowley's regular-language queries don't require schema inference.
  • Queries over many files in parallel. crowley natively supports searching over many files with the same query, using either a list of file paths or a pattern match. These files will be searched in parallel more quickly and with less memory overhead than ijson with a ProcessPool.

Usage

Single-file search

from crowley import Query

names = Query("data.json", "users[*].name")
ages = Query("data.json", "users[*].age")

names.count()       # 4
names.exists()      # True
names.values()      # ['Alice', 'Bob', "Charlie", "Diana"]
ages.values()       # [30, 25, 35, 28]
names.agg("sum")    # nan
ages.agg("sum")     # 118.0
names.types()       # ['string']
ages.types()        # ['number']
names.mode()        # {'values': ['Alice', 'Bob', 'Charlie', 'Diana'], 'frequency': 1}

Multi-file search

from crowley import Query

repo_names = Query("tests/github_daily_jsonl/2015*", "[*].repo.name")

repo_names.count() # [7702, 7427, 7234, 7387, 8273, 8971, 10307, 11351, 11749, 11961, 12229, 12314, 6743, 12442, 13111, 12473, 11601, 5971, 5869, 5887, 8322, 7105, 6139, 6371]
repo_names.total_count() # 218939
repo_names.total_unique() # 65703
repo_names.mode()[0] # {'values': ['KenanSulayman/heartbeat'], 'frequency': 79}

Query language

The query language uses a regular-expression-inspired syntax for navigating JSON structure:

Query Meaning
name Field name in the root object
address.street Field street inside address
users[*].name name field of every element in users array
* Any field in the root object
[*] Any element in the root array
users[0] First element of users
users[1:3] Elements at indices 1 and 2
(name | age) Either name or age
(* | [*])* Any value at any depth (recursive descent)
a? Returns the value of a if it exists

Performance

Benchmarks measured on a Mac M3 Max with 32GB of RAM:

File: Flat GitHub log data, 34GB
Query: [*].repo.name

Count matches:
    crowley: 71.6s
    ijson: 128.8s
    Difference: 1.8x

Return matches:
    crowley: 116.0s
    ijson: 126.1s
    Difference: 1.09x

Return unique values:
    crowley: 125.7 
    ijson: 129.5s
    Difference: 1.03x

Return unique count:
    crowley: 122.1
    ijson: 129.5s
    Difference: 1.06x

File: Nested GeoJSON, 30MB
Query: features[*].properties.name

Count matches:
    crowley: 138.44ms
    ijson: 421.85ms
    Difference: 3.0x

Existence check (true):
    crowley: 16µs
    ijson: 793µs
    Difference: 49x

Query: features[*].properties.scalerank

Sum matches:
    crowley: 184.88ms
    ijson: 425.89ms
    Difference: 2.3x

Query: features[*].properties.nonexistent

Existence check (false):
    crowley: 138.9ms
    ijson: 409.7ms
    Difference: 2.9x

On queries where the objective is to return values crowley outperforms ijson by 3-10%. In cases where a measure such as count or aggregate sum is returned, crowley can often outperform ijson by 2-3x by avoiding materializing values unnecessarily.

But the real benefit comes from crowley's more expressive query language, which can efficiently express what would otherwise require Python loops aroung ijson.

It can extract multiple fields through disjunctions (at one or multiple levels) in a single pass without having to materialize the parent object:

# get the number of matching objects
# 133.6ms
crowley.Query(file_str, "features[*].properties.(name | admin)").count()

# get the number of unique matches
# 144.2ms
crowley.Query(file_str, "features[*].properties.(name | admin)").unique_values()

# get the number of matching objects
# 851.6ms
def ijson_two_passes():
    with open(file_str, "rb") as f:
        count1 = sum(1 for _ in ijson.items(f, "features.item.properties.name"))
    with open(file_str, "rb") as f:
        count2 = sum(1 for _ in ijson.items(f, "features.item.properties.admin"))
    return count1 + count2
ijson_two_passes()

# get the number of unique matches
# 430ms
def ijson_two_fields():
    names = set()
    with open(file_str, "rb") as f:
        for obj in ijson.items(f, "features.item.properties"):
            if "name" in obj:
                names.add(obj["name"])
            if "admin" in obj:
                names.add(obj["admin"])
    return names
ijson_two_fields()

It can extract all property values without internal iteration:

# get the number of all matching property values by query
# 133.9ms
crowley.Query(file_str, "features[*].properties.*").count()

# get the  number of all matching properties by internal iteration
# 427.9ms
def ijson_all_props():
    count = 0
    with open(file_str, "rb") as f:
        for obj in ijson.items(f, "features.item.properties"):
            count += len(obj)
    return count
ijson_all_props()

It can select ranges of array elements without manual index checking:

  • Note: this is one of the few places crowley can be slower under some conditions: if the array range is not at the root level, ijson + Python break logic can stop more quickly, while crowley must continue parsing the outer structure. For root-level array ranges, crowley remains faster. Attempting to use the same approach with crowley as with ijson, manually checking values and breaking out, makes crowley even slower, however.
Root-level array (github_array.json):
    crowley [0:3]: 22µs (crowley terminates early more quickly)
    ijson [0:3]+break: 234µs
    Difference: 10.6x

    crowley [97:102]: 464µs (crowley terminates early more quickly)
    ijson [97:102]+break: 923 µs
    Difference: 1.98x

    crowley [*] (full): 49.4ms
    crowley [*]+break: 60.9ms

Nested array (ne_10m.json):
    crowley [0:3]: 131.4ms
    ijson [0:3]+break: 847µs (ijson is able to short-circuit faster!)
    Difference: 0.006x

    crowley [97:102]: 133.8ms
    ijson [97:102]+break: 11.5ms (ijson is able to short-circuit faster!)
    Difference: 0.086x
# start of array
crowley.Query(file_str, "features[0:3].properties.name", no_seek=True).values()

# middle of array
crowley.Query(file_str, "features[97:102].properties.name", no_seek=True).values()

def ijson_range_start():
    result = []
    with open(file_str, "rb") as f:
        for i, name in enumerate(ijson.items(f, "features.item.properties.name")):
            if i < 3:
                result.append(name)
            else:
                break
    return result
ijson_range_start()

def ijson_range_mid():
    result = []
    with open(file_str, "rb") as f:
        for i, name in enumerate(ijson.items(f, "features.item.properties.name")):
            if 97 <= i < 102:
                result.append(name)
            if i >= 101:
                break
    return result
ijson_range_mid()

And can even descend recursively in a way that ijson simply cannot do: this would require a non-streaming solution like json that loads the whole file into memory.

# get unique values of 'type' at any depth 
# 221.8ms : ['FeatureCollection', 'name', 'Feature', 'Polygon']
crowley.Query(file_str, "(* | [*])*.type", no_seek=True).unique_values()

# get count of all matching objects at all depths
# 156.7ms : 17090
crowley.Query(file_str, "(* | [*])*.type", no_seek=True).count()

# walk the entire json tree manually looking for matching keys
# 509.8ms
import json
def json_recursive_search(key):
    with open(file_str) as f:
        data = json.load(f)

    results = []
    def walk(obj):
        if isinstance(obj, dict):
            for k, v in obj.items():
                if k == key:
                    results.append(v)
                walk(v)
        elif isinstance(obj, list):
            for item in obj:
                walk(item)
    walk(data)
    return results

values = json_recursive_search("type")
unique = set(str(x) for x in values)

Cold vs Hot Start

On cold starts (first query, no prior loading), crowley is 2-3x faster than pandas, 3-7x faster than DuckDB, and handles files that make Polars fail entirely due to schema inference errors.

On subsequent calls, methods such as count() or exists() return their pre-computed answer in O(1) with zero file I/O. Other methods like types() and agg() will determine whether reading only matched byte positions will be faster than a full sequential scan.

However, on very large files with a large volume of matches, the cached byte offsets for matches can considerably exceed the memory usage from streaming itself, and these offsets remain in the Query object until it is dropped. The query's cache can be manually cleared with .clear_cache(), and cache accumulation can be deactivated at query creation with the no_seek=True kwarg. This can be configured globally with crowley.configure(no_seek=True).

Acknowledgments

Built on the DFA-based query engine from jsongrep by Micah Kepe, and the SAX parser from json-event-parser by the Oxigraph project.

This project benefits not only from the work of other developers, but also from their choice to make their source code public and freely re-usable under the MIT and Apache2.0 licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycrowley-0.1.0.tar.gz (93.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl (584.2 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64

pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl (585.1 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (585.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl (587.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (587.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

File details

Details for the file pycrowley-0.1.0.tar.gz.

File metadata

  • Download URL: pycrowley-0.1.0.tar.gz
  • Upload date:
  • Size: 93.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for pycrowley-0.1.0.tar.gz
Algorithm Hash digest
SHA256 32d58afc74336025194f991cfc70af8a7512c352752a2c8b23ff401650da7f69
MD5 0aa213c3819a610b4d504ca949518250
BLAKE2b-256 d6226c977c579cc609a7c44928c8a34761568e33761efc7667fdcdb7c67f6d28

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b4feaedc6f5a4ee8331ec807e8bf6c7ea3d32756fe4e3a2eb5634062935fa971
MD5 d9dd4495d5f8d811a4fc220239e2d0ac
BLAKE2b-256 48c4858b8a2b1244d318f8efa3eb27a5d8837e4a7ba10842be40e17caf2bb8d5

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2851dbecd1dd15044cfee75f52630eb8036585ce4862e1ca7f9c98e8f0c7eb09
MD5 774f252fb13ea2882d6ec227df563d85
BLAKE2b-256 6d1eeb2ccf47849dcbd9ea5b22436e8662701ef96343bbc958639a6f9ce9af14

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bc47e5e0c79417e3675a7f85d6cff7b1371631cff12f170b48c4150dda4501e4
MD5 2bf6c9d4b9b2fdfc6349dc7575467eef
BLAKE2b-256 c3db8f28484263af5c8c112d05085a3024a080550c3f1f3482eb625251421fa0

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 27e4a56e2eeb17b48502d288ab9acf922ec6275bc76a54ff384037626ca3509c
MD5 60bb27e7d7628a748807c034ae9139b5
BLAKE2b-256 13abf9fbf55bfa28e9639769d8df436f766aa1bb85fec151230abf9f3863f2e0

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9ebfc1c5b65c8e90e83d9e5b161352071229841676615c6e6a103c1286afcb22
MD5 3c823b86f0e46e7577c2e12b4c27fd80
BLAKE2b-256 58a431dcac4b3ac8e30f73fe7d289af999af7be41a317413e5b699d6b05af772

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page