A high-performance streaming JSON query engine for out-of-memory files

These details have not been verified by PyPI

Project links

Project description

crowley

A high-performance JSON querying engine designed for fast starts, low flat memory usage, and out-of-memory streaming.

It is primarily designed to substitute for ijson. If you're coming to crowley from ijson, see the IJSON Migration Guide.

Written in Rust, with a SAX-style JSON event parser adapted from the json-event-parser crate and a regular expression query language adapted from the jsongrep crate.

Use cases

crowley is optimized for the following scenarios:

Queries over files too large to fit comfortably in memory. crowley streams through JSON files with bounded memory regardless of file size. A 37 GB file uses ~30 MB of RAM.
Queries on transient data. crowley quickly queries data which do not merit transformation into a more easily-queried structure such as a database or dataframe, because of time constraints or because the data is sensitive and cannot be loaded into an external application.
Queries over heterogeneous, deeply-nested, and schemaless data which tools such as pandas, polars, or duckdb cannot ingest and transform. crowley's regular-language queries don't require schema inference.
Queries over many files in parallel. crowley natively supports searching over many files with the same query, using either a list of file paths or a pattern match. These files will be searched in parallel more quickly and with less memory overhead than ijson with a ProcessPool.

Usage

Single-file search

from crowley import Query

names = Query("data.json", "users[*].name")
ages = Query("data.json", "users[*].age")

names.count()       # 4
names.exists()      # True
names.values()      # ['Alice', 'Bob', "Charlie", "Diana"]
ages.values()       # [30, 25, 35, 28]
names.agg("sum")    # nan
ages.agg("sum")     # 118.0
names.types()       # ['string']
ages.types()        # ['number']
names.mode()        # {'values': ['Alice', 'Bob', 'Charlie', 'Diana'], 'frequency': 1}

Multi-file search

from crowley import Query

repo_names = Query("tests/github_daily_jsonl/2015*", "[*].repo.name")

repo_names.count() # [7702, 7427, 7234, 7387, 8273, 8971, 10307, 11351, 11749, 11961, 12229, 12314, 6743, 12442, 13111, 12473, 11601, 5971, 5869, 5887, 8322, 7105, 6139, 6371]
repo_names.total_count() # 218939
repo_names.total_unique() # 65703
repo_names.mode()[0] # {'values': ['KenanSulayman/heartbeat'], 'frequency': 79}

Query language

The query language uses a regular-expression-inspired syntax for navigating JSON structure:

Query	Meaning
`name`	Field `name` in the root object
`address.street`	Field `street` inside `address`
`users[*].name`	`name` field of every element in `users` array
`*`	Any field in the root object
`[*]`	Any element in the root array
`users[0]`	First element of `users`
`users[1:3]`	Elements at indices 1 and 2
`(name \| age)`	Either `name` or `age`
`(* \| [])`	Any value at any depth (recursive descent)
`a?`	Returns the value of `a` if it exists

Performance

Benchmarks measured on a Mac M3 Max with 32GB of RAM:

File: Flat GitHub log data, 34GB
Query: [*].repo.name

Count matches:
    crowley: 71.6s
    ijson: 128.8s
    Difference: 1.8x

Return matches:
    crowley: 116.0s
    ijson: 126.1s
    Difference: 1.09x

Return unique values:
    crowley: 125.7 
    ijson: 129.5s
    Difference: 1.03x

Return unique count:
    crowley: 122.1
    ijson: 129.5s
    Difference: 1.06x

File: Nested GeoJSON, 30MB
Query: features[*].properties.name

Count matches:
    crowley: 138.44ms
    ijson: 421.85ms
    Difference: 3.0x

Existence check (true):
    crowley: 16µs
    ijson: 793µs
    Difference: 49x

Query: features[*].properties.scalerank

Sum matches:
    crowley: 184.88ms
    ijson: 425.89ms
    Difference: 2.3x

Query: features[*].properties.nonexistent

Existence check (false):
    crowley: 138.9ms
    ijson: 409.7ms
    Difference: 2.9x

On queries where the objective is to return values crowley outperforms ijson by 3-10%. In cases where a measure such as count or aggregate sum is returned, crowley can often outperform ijson by 2-3x by avoiding materializing values unnecessarily.

But the real benefit comes from crowley's more expressive query language, which can efficiently express what would otherwise require Python loops aroung ijson.

It can extract multiple fields through disjunctions (at one or multiple levels) in a single pass without having to materialize the parent object:

# get the number of matching objects
# 133.6ms
crowley.Query(file_str, "features[*].properties.(name | admin)").count()

# get the number of unique matches
# 144.2ms
crowley.Query(file_str, "features[*].properties.(name | admin)").unique_values()

# get the number of matching objects
# 851.6ms
def ijson_two_passes():
    with open(file_str, "rb") as f:
        count1 = sum(1 for _ in ijson.items(f, "features.item.properties.name"))
    with open(file_str, "rb") as f:
        count2 = sum(1 for _ in ijson.items(f, "features.item.properties.admin"))
    return count1 + count2
ijson_two_passes()

# get the number of unique matches
# 430ms
def ijson_two_fields():
    names = set()
    with open(file_str, "rb") as f:
        for obj in ijson.items(f, "features.item.properties"):
            if "name" in obj:
                names.add(obj["name"])
            if "admin" in obj:
                names.add(obj["admin"])
    return names
ijson_two_fields()

It can extract all property values without internal iteration:

# get the number of all matching property values by query
# 133.9ms
crowley.Query(file_str, "features[*].properties.*").count()

# get the  number of all matching properties by internal iteration
# 427.9ms
def ijson_all_props():
    count = 0
    with open(file_str, "rb") as f:
        for obj in ijson.items(f, "features.item.properties"):
            count += len(obj)
    return count
ijson_all_props()

It can select ranges of array elements without manual index checking:

Note: this is one of the few places crowley can be slower under some conditions: if the array range is not at the root level, ijson + Python break logic can stop more quickly, while crowley must continue parsing the outer structure. For root-level array ranges, crowley remains faster. Attempting to use the same approach with crowley as with ijson, manually checking values and breaking out, makes crowley even slower, however.

Root-level array (github_array.json):
    crowley [0:3]: 22µs (crowley terminates early more quickly)
    ijson [0:3]+break: 234µs
    Difference: 10.6x

    crowley [97:102]: 464µs (crowley terminates early more quickly)
    ijson [97:102]+break: 923 µs
    Difference: 1.98x

    crowley [*] (full): 49.4ms
    crowley [*]+break: 60.9ms

Nested array (ne_10m.json):
    crowley [0:3]: 131.4ms
    ijson [0:3]+break: 847µs (ijson is able to short-circuit faster!)
    Difference: 0.006x

    crowley [97:102]: 133.8ms
    ijson [97:102]+break: 11.5ms (ijson is able to short-circuit faster!)
    Difference: 0.086x

# start of array
crowley.Query(file_str, "features[0:3].properties.name", no_seek=True).values()

# middle of array
crowley.Query(file_str, "features[97:102].properties.name", no_seek=True).values()

def ijson_range_start():
    result = []
    with open(file_str, "rb") as f:
        for i, name in enumerate(ijson.items(f, "features.item.properties.name")):
            if i < 3:
                result.append(name)
            else:
                break
    return result
ijson_range_start()

def ijson_range_mid():
    result = []
    with open(file_str, "rb") as f:
        for i, name in enumerate(ijson.items(f, "features.item.properties.name")):
            if 97 <= i < 102:
                result.append(name)
            if i >= 101:
                break
    return result
ijson_range_mid()

And can even descend recursively in a way that ijson simply cannot do: this would require a non-streaming solution like json that loads the whole file into memory.

# get unique values of 'type' at any depth 
# 221.8ms : ['FeatureCollection', 'name', 'Feature', 'Polygon']
crowley.Query(file_str, "(* | [*])*.type", no_seek=True).unique_values()

# get count of all matching objects at all depths
# 156.7ms : 17090
crowley.Query(file_str, "(* | [*])*.type", no_seek=True).count()

# walk the entire json tree manually looking for matching keys
# 509.8ms
import json
def json_recursive_search(key):
    with open(file_str) as f:
        data = json.load(f)

    results = []
    def walk(obj):
        if isinstance(obj, dict):
            for k, v in obj.items():
                if k == key:
                    results.append(v)
                walk(v)
        elif isinstance(obj, list):
            for item in obj:
                walk(item)
    walk(data)
    return results

values = json_recursive_search("type")
unique = set(str(x) for x in values)

Cold vs Hot Start

On cold starts (first query, no prior loading), crowley is 2-3x faster than pandas, 3-7x faster than DuckDB, and handles files that make Polars fail entirely due to schema inference errors.

On subsequent calls, methods such as count() or exists() return their pre-computed answer in O(1) with zero file I/O. Other methods like types() and agg() will determine whether reading only matched byte positions will be faster than a full sequential scan.

However, on very large files with a large volume of matches, the cached byte offsets for matches can considerably exceed the memory usage from streaming itself, and these offsets remain in the Query object until it is dropped. The query's cache can be manually cleared with .clear_cache(), and cache accumulation can be deactivated at query creation with the no_seek=True kwarg. This can be configured globally with crowley.configure(no_seek=True).

Acknowledgments

Built on the DFA-based query engine from jsongrep by Micah Kepe, and the SAX parser from json-event-parser by the Oxigraph project.

This project benefits not only from the work of other developers, but also from their choice to make their source code public and freely re-usable under the MIT and Apache2.0 licenses.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycrowley-0.1.0.tar.gz (93.7 kB view details)

Uploaded Apr 4, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl (584.2 kB view details)

Uploaded Apr 4, 2026 CPython 3.14manylinux: glibc 2.28+ x86-64

pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl (585.1 kB view details)

Uploaded Apr 4, 2026 CPython 3.13manylinux: glibc 2.28+ x86-64

pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (585.3 kB view details)

Uploaded Apr 4, 2026 CPython 3.12manylinux: glibc 2.28+ x86-64

pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl (587.2 kB view details)

Uploaded Apr 4, 2026 CPython 3.11manylinux: glibc 2.28+ x86-64

pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (587.2 kB view details)

Uploaded Apr 4, 2026 CPython 3.10manylinux: glibc 2.28+ x86-64

File details

Details for the file pycrowley-0.1.0.tar.gz.

File metadata

Download URL: pycrowley-0.1.0.tar.gz
Upload date: Apr 4, 2026
Size: 93.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for pycrowley-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`32d58afc74336025194f991cfc70af8a7512c352752a2c8b23ff401650da7f69`
MD5	`0aa213c3819a610b4d504ca949518250`
BLAKE2b-256	`d6226c977c579cc609a7c44928c8a34761568e33761efc7667fdcdb7c67f6d28`

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl.

File metadata

Download URL: pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl
Upload date: Apr 4, 2026
Size: 584.2 kB
Tags: CPython 3.14, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`b4feaedc6f5a4ee8331ec807e8bf6c7ea3d32756fe4e3a2eb5634062935fa971`
MD5	`d9dd4495d5f8d811a4fc220239e2d0ac`
BLAKE2b-256	`48c4858b8a2b1244d318f8efa3eb27a5d8837e4a7ba10842be40e17caf2bb8d5`

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

Download URL: pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl
Upload date: Apr 4, 2026
Size: 585.1 kB
Tags: CPython 3.13, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`2851dbecd1dd15044cfee75f52630eb8036585ce4862e1ca7f9c98e8f0c7eb09`
MD5	`774f252fb13ea2882d6ec227df563d85`
BLAKE2b-256	`6d1eeb2ccf47849dcbd9ea5b22436e8662701ef96343bbc958639a6f9ce9af14`

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

Download URL: pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl
Upload date: Apr 4, 2026
Size: 585.3 kB
Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`bc47e5e0c79417e3675a7f85d6cff7b1371631cff12f170b48c4150dda4501e4`
MD5	`2bf6c9d4b9b2fdfc6349dc7575467eef`
BLAKE2b-256	`c3db8f28484263af5c8c112d05085a3024a080550c3f1f3482eb625251421fa0`

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

Download URL: pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Upload date: Apr 4, 2026
Size: 587.2 kB
Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`27e4a56e2eeb17b48502d288ab9acf922ec6275bc76a54ff384037626ca3509c`
MD5	`60bb27e7d7628a748807c034ae9139b5`
BLAKE2b-256	`13abf9fbf55bfa28e9639769d8df436f766aa1bb85fec151230abf9f3863f2e0`

See more details on using hashes here.

File details

Details for the file pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

Download URL: pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Upload date: Apr 4, 2026
Size: 587.2 kB
Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`9ebfc1c5b65c8e90e83d9e5b161352071229841676615c6e6a103c1286afcb22`
MD5	`3c823b86f0e46e7577c2e12b4c27fd80`
BLAKE2b-256	`58a431dcac4b3ac8e30f73fe7d289af999af7be41a317413e5b699d6b05af772`

See more details on using hashes here.

pycrowley 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

crowley

Use cases

Usage

Single-file search

Multi-file search

Query language

Performance

Cold vs Hot Start

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes