A high-performance streaming JSON query engine for out-of-memory files
Project description
crowley
A high-performance JSON querying engine designed for fast starts, low flat memory usage, and out-of-memory streaming.
It is primarily designed to substitute for ijson. If you're coming to crowley from ijson, see the IJSON Migration Guide.
Written in Rust, with a SAX-style JSON event parser adapted from the json-event-parser crate and a regular expression query language adapted from the jsongrep crate.
Use cases
crowley is optimized for the following scenarios:
- Queries over files too large to fit comfortably in memory.
crowleystreams through JSON files with bounded memory regardless of file size. A 37 GB file uses ~30 MB of RAM. - Queries on transient data.
crowleyquickly queries data which do not merit transformation into a more easily-queried structure such as a database or dataframe, because of time constraints or because the data is sensitive and cannot be loaded into an external application. - Queries over heterogeneous, deeply-nested, and schemaless data which tools such as
pandas,polars, orduckdbcannot ingest and transform.crowley's regular-language queries don't require schema inference. - Queries over many files in parallel.
crowleynatively supports searching over many files with the same query, using either a list of file paths or a pattern match. These files will be searched in parallel more quickly and with less memory overhead thanijsonwith a ProcessPool.
Usage
Single-file search
from crowley import Query
names = Query("data.json", "users[*].name")
ages = Query("data.json", "users[*].age")
names.count() # 4
names.exists() # True
names.values() # ['Alice', 'Bob', "Charlie", "Diana"]
ages.values() # [30, 25, 35, 28]
names.agg("sum") # nan
ages.agg("sum") # 118.0
names.types() # ['string']
ages.types() # ['number']
names.mode() # {'values': ['Alice', 'Bob', 'Charlie', 'Diana'], 'frequency': 1}
Multi-file search
from crowley import Query
repo_names = Query("tests/github_daily_jsonl/2015*", "[*].repo.name")
repo_names.count() # [7702, 7427, 7234, 7387, 8273, 8971, 10307, 11351, 11749, 11961, 12229, 12314, 6743, 12442, 13111, 12473, 11601, 5971, 5869, 5887, 8322, 7105, 6139, 6371]
repo_names.total_count() # 218939
repo_names.total_unique() # 65703
repo_names.mode()[0] # {'values': ['KenanSulayman/heartbeat'], 'frequency': 79}
Query language
The query language uses a regular-expression-inspired syntax for navigating JSON structure:
| Query | Meaning |
|---|---|
name |
Field name in the root object |
address.street |
Field street inside address |
users[*].name |
name field of every element in users array |
* |
Any field in the root object |
[*] |
Any element in the root array |
users[0] |
First element of users |
users[1:3] |
Elements at indices 1 and 2 |
(name | age) |
Either name or age |
(* | [*])* |
Any value at any depth (recursive descent) |
a? |
Returns the value of a if it exists |
Performance
Benchmarks measured on a Mac M3 Max with 32GB of RAM:
File: Flat GitHub log data, 34GB
Query: [*].repo.name
Count matches:
crowley: 71.6s
ijson: 128.8s
Difference: 1.8x
Return matches:
crowley: 116.0s
ijson: 126.1s
Difference: 1.09x
Return unique values:
crowley: 125.7
ijson: 129.5s
Difference: 1.03x
Return unique count:
crowley: 122.1
ijson: 129.5s
Difference: 1.06x
File: Nested GeoJSON, 30MB
Query: features[*].properties.name
Count matches:
crowley: 138.44ms
ijson: 421.85ms
Difference: 3.0x
Existence check (true):
crowley: 16µs
ijson: 793µs
Difference: 49x
Query: features[*].properties.scalerank
Sum matches:
crowley: 184.88ms
ijson: 425.89ms
Difference: 2.3x
Query: features[*].properties.nonexistent
Existence check (false):
crowley: 138.9ms
ijson: 409.7ms
Difference: 2.9x
On queries where the objective is to return values crowley outperforms ijson by 3-10%. In cases where a measure such as count or aggregate sum is returned, crowley can often outperform ijson by 2-3x by avoiding materializing values unnecessarily.
But the real benefit comes from crowley's more expressive query language, which can efficiently express what would otherwise require Python loops aroung ijson.
It can extract multiple fields through disjunctions (at one or multiple levels) in a single pass without having to materialize the parent object:
# get the number of matching objects
# 133.6ms
crowley.Query(file_str, "features[*].properties.(name | admin)").count()
# get the number of unique matches
# 144.2ms
crowley.Query(file_str, "features[*].properties.(name | admin)").unique_values()
# get the number of matching objects
# 851.6ms
def ijson_two_passes():
with open(file_str, "rb") as f:
count1 = sum(1 for _ in ijson.items(f, "features.item.properties.name"))
with open(file_str, "rb") as f:
count2 = sum(1 for _ in ijson.items(f, "features.item.properties.admin"))
return count1 + count2
ijson_two_passes()
# get the number of unique matches
# 430ms
def ijson_two_fields():
names = set()
with open(file_str, "rb") as f:
for obj in ijson.items(f, "features.item.properties"):
if "name" in obj:
names.add(obj["name"])
if "admin" in obj:
names.add(obj["admin"])
return names
ijson_two_fields()
It can extract all property values without internal iteration:
# get the number of all matching property values by query
# 133.9ms
crowley.Query(file_str, "features[*].properties.*").count()
# get the number of all matching properties by internal iteration
# 427.9ms
def ijson_all_props():
count = 0
with open(file_str, "rb") as f:
for obj in ijson.items(f, "features.item.properties"):
count += len(obj)
return count
ijson_all_props()
It can select ranges of array elements without manual index checking:
- Note: this is one of the few places
crowleycan be slower under some conditions: if the array range is not at the root level,ijson+ Python break logic can stop more quickly, whilecrowleymust continue parsing the outer structure. For root-level array ranges,crowleyremains faster. Attempting to use the same approach withcrowleyas withijson, manually checking values and breaking out, makes crowley even slower, however.
Root-level array (github_array.json):
crowley [0:3]: 22µs (crowley terminates early more quickly)
ijson [0:3]+break: 234µs
Difference: 10.6x
crowley [97:102]: 464µs (crowley terminates early more quickly)
ijson [97:102]+break: 923 µs
Difference: 1.98x
crowley [*] (full): 49.4ms
crowley [*]+break: 60.9ms
Nested array (ne_10m.json):
crowley [0:3]: 131.4ms
ijson [0:3]+break: 847µs (ijson is able to short-circuit faster!)
Difference: 0.006x
crowley [97:102]: 133.8ms
ijson [97:102]+break: 11.5ms (ijson is able to short-circuit faster!)
Difference: 0.086x
# start of array
crowley.Query(file_str, "features[0:3].properties.name", no_seek=True).values()
# middle of array
crowley.Query(file_str, "features[97:102].properties.name", no_seek=True).values()
def ijson_range_start():
result = []
with open(file_str, "rb") as f:
for i, name in enumerate(ijson.items(f, "features.item.properties.name")):
if i < 3:
result.append(name)
else:
break
return result
ijson_range_start()
def ijson_range_mid():
result = []
with open(file_str, "rb") as f:
for i, name in enumerate(ijson.items(f, "features.item.properties.name")):
if 97 <= i < 102:
result.append(name)
if i >= 101:
break
return result
ijson_range_mid()
And can even descend recursively in a way that ijson simply cannot do: this would require a non-streaming solution like json that loads the whole file into memory.
# get unique values of 'type' at any depth
# 221.8ms : ['FeatureCollection', 'name', 'Feature', 'Polygon']
crowley.Query(file_str, "(* | [*])*.type", no_seek=True).unique_values()
# get count of all matching objects at all depths
# 156.7ms : 17090
crowley.Query(file_str, "(* | [*])*.type", no_seek=True).count()
# walk the entire json tree manually looking for matching keys
# 509.8ms
import json
def json_recursive_search(key):
with open(file_str) as f:
data = json.load(f)
results = []
def walk(obj):
if isinstance(obj, dict):
for k, v in obj.items():
if k == key:
results.append(v)
walk(v)
elif isinstance(obj, list):
for item in obj:
walk(item)
walk(data)
return results
values = json_recursive_search("type")
unique = set(str(x) for x in values)
Cold vs Hot Start
On cold starts (first query, no prior loading), crowley is 2-3x faster than pandas, 3-7x faster than DuckDB, and handles files that make Polars fail entirely due to schema inference errors.
On subsequent calls, methods such as count() or exists() return their pre-computed answer in O(1) with zero file I/O. Other methods like types() and agg() will determine whether reading only matched byte positions will be faster than a full sequential scan.
However, on very large files with a large volume of matches, the cached byte offsets for matches can considerably exceed the memory usage from streaming itself, and these offsets remain in the Query object until it is dropped. The query's cache can be manually cleared with .clear_cache(), and cache accumulation can be deactivated at query creation with the no_seek=True kwarg. This can be configured globally with crowley.configure(no_seek=True).
Acknowledgments
Built on the DFA-based query engine from jsongrep by Micah Kepe, and the SAX parser from json-event-parser by the Oxigraph project.
This project benefits not only from the work of other developers, but also from their choice to make their source code public and freely re-usable under the MIT and Apache2.0 licenses.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycrowley-0.1.0.tar.gz.
File metadata
- Download URL: pycrowley-0.1.0.tar.gz
- Upload date:
- Size: 93.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32d58afc74336025194f991cfc70af8a7512c352752a2c8b23ff401650da7f69
|
|
| MD5 |
0aa213c3819a610b4d504ca949518250
|
|
| BLAKE2b-256 |
d6226c977c579cc609a7c44928c8a34761568e33761efc7667fdcdb7c67f6d28
|
File details
Details for the file pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: pycrowley-0.1.0-cp314-cp314-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 584.2 kB
- Tags: CPython 3.14, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4feaedc6f5a4ee8331ec807e8bf6c7ea3d32756fe4e3a2eb5634062935fa971
|
|
| MD5 |
d9dd4495d5f8d811a4fc220239e2d0ac
|
|
| BLAKE2b-256 |
48c4858b8a2b1244d318f8efa3eb27a5d8837e4a7ba10842be40e17caf2bb8d5
|
File details
Details for the file pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: pycrowley-0.1.0-cp313-cp313-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 585.1 kB
- Tags: CPython 3.13, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2851dbecd1dd15044cfee75f52630eb8036585ce4862e1ca7f9c98e8f0c7eb09
|
|
| MD5 |
774f252fb13ea2882d6ec227df563d85
|
|
| BLAKE2b-256 |
6d1eeb2ccf47849dcbd9ea5b22436e8662701ef96343bbc958639a6f9ce9af14
|
File details
Details for the file pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: pycrowley-0.1.0-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 585.3 kB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc47e5e0c79417e3675a7f85d6cff7b1371631cff12f170b48c4150dda4501e4
|
|
| MD5 |
2bf6c9d4b9b2fdfc6349dc7575467eef
|
|
| BLAKE2b-256 |
c3db8f28484263af5c8c112d05085a3024a080550c3f1f3482eb625251421fa0
|
File details
Details for the file pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: pycrowley-0.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 587.2 kB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27e4a56e2eeb17b48502d288ab9acf922ec6275bc76a54ff384037626ca3509c
|
|
| MD5 |
60bb27e7d7628a748807c034ae9139b5
|
|
| BLAKE2b-256 |
13abf9fbf55bfa28e9639769d8df436f766aa1bb85fec151230abf9f3863f2e0
|
File details
Details for the file pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: pycrowley-0.1.0-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 587.2 kB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ebfc1c5b65c8e90e83d9e5b161352071229841676615c6e6a103c1286afcb22
|
|
| MD5 |
3c823b86f0e46e7577c2e12b4c27fd80
|
|
| BLAKE2b-256 |
58a431dcac4b3ac8e30f73fe7d289af999af7be41a317413e5b699d6b05af772
|