No project description provided
Project description
arpeggio-shredder
Reversible shredding (flatten) and unshredding (rebuild) of semi-structured JSONL into Apache Arrow RecordBatches.
Source Code: https://codeberg.org/alwyna/shredder
This package provides a thin Python binding over a C++ core that converts JSON documents into a columnar “atoms” representation suitable for Arrow / Parquet workflows, while preserving the ability to reconstruct the original documents exactly.
The scope is intentionally narrow: flattening with reversibility, not general JSON processing.
Key properties
- Reversible: JSON → Arrow atoms → JSON
- Columnar-first: output is optimized for Arrow-native pipelines
- Deterministic identity: optional object and transaction tagging
- Minimal Python surface: High-level
shredderpackage or rawshredder_extextension
Installation
pip install arpeggio-shredder
Usage
Shredder 0.2.0 supports two explicit "codec" strategies: json (UTF-8 text) and arrow (native Struct/List DOM).
Known limitations / Non-goals for 0.2.x
- No support for array-of-array: Shredder throws a
runtime_erroron nested arrays (e.g.,[[1,2]]). - Row expansion semantics (no sibling back-fill): Fields encountered after an array may only be present in the last expanded row. Reconstruction remains correct via internal metadata.
- Depth/complexity caveat: The iterative core is frozen for 0.2.0. We prioritize stability over redesigning the traversal engine for extreme nesting.
- Codec strategy is explicit:
codec="json"requires strings;codec="arrow"requires Arrow DOM (struct/list). No auto-detection.
1. JSON Text Codec (Default)
Warning: The
jsoncodec expects UTF-8 JSON strings (e.g.,pa.string()), not Python dicts or Arrow structs. No silent casting is performed.
The classic workflow: JSONL → Arrow atoms → JSONL.
import pyarrow as pa
import shredder
# 1. Load JSONL as Arrow string array
lines = ['{"id": 1, "msg": "hello"}', '{"id": 2, "msg": "world"}']
array = pa.array(lines, type=pa.string())
# 2. Shred into columnar "atoms"
# codec="json" is default, but can be explicit
atoms = shredder.shred(array, codec="json")
# 3. Unshred back to JSON
# Returns RecordBatch with single column 'doc'
reconstructed = shredder.unshred(atoms, codec="json")
json_docs = reconstructed.column(0)
for doc in json_docs:
print(doc.as_py())
2. Arrow DOM Codec
New in 0.2.0: Treat Arrow nested structures as the source of truth. No JSON parsing involved.
import pyarrow as pa
import pyarrow.json as pajson
import shredder
# 1. Read JSONL into Arrow-native columns
# table = pajson.read_json("input.jsonl")
table = pa.table({"id": [1, 2], "val": [10.5, 20.0]})
# 2. Convert to a single struct column (Arrow DOM input)
doc_struct = pa.StructArray.from_arrays(table.columns, names=table.column_names)
# 3. Use the stateful Shredder class
s = shredder.Shredder(codec="arrow")
# 4. Shred/Unshred
atoms = s.shred(doc_struct)
reconstructed = s.unshred(atoms)
# Output 'doc' column is the reconstructed StructArray
doc_out = reconstructed.column(0)
print(doc_out.to_pylist())
How it works
- Codecs:
JsonTextCodec: 1-col UTF-8 string → internal DOM → atoms.ArrowDomCodec: 1-col Arrow struct/list/etc → internal DOM → atoms.
- Flattening: Nested objects are flattened into path-based columns (e.g.,
root_user_name). - Row Expansion: Nested arrays are "shredded" by duplicating parent values for each array element, adding an
__idxcolumn to preserve order. - Identity: The engine automatically injects
__obj_idand__txn_id(BLAKE3 hashes of content) to ensure perfect reconstruction. - Reversibility: Shredder preserves enough metadata to rebuild the exact original structure, regardless of the input codec.
Native extension
The package ships with a prebuilt native extension (.so) built against Apache Arrow and exposed via pybind11.
- No runtime compilation
- No system Arrow installation required
- Shared libraries are bundled into the wheel
Platform support
- OS: Linux (manylinux-compatible)
- Architecture: x86_64
- Python: CPython 3.12
- ABI: glibc (manylinux)
Other platforms are not currently supported.
Relationship to the C++ project
This package is the Python distribution layer for the Shredder C++ project hosted on Codeberg: https://codeberg.org/alwyna/shredder.
License
This package is dual-licensed:
- AGPL-3.0 for open-source use and networked deployments
- Commercial license for proprietary or closed-source use
Commercial licensing is available via https://arpeggio.one/shop.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arpeggio_shredder-0.2.0-cp312-cp312-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: arpeggio_shredder-0.2.0-cp312-cp312-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 24.8 MB
- Tags: CPython 3.12, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a505199f267d1c6d6af039cfc3fabce44f71c90e3efad4ad339106b84ee4c10a
|
|
| MD5 |
94c6496922bdeb13deee3283943a9d34
|
|
| BLAKE2b-256 |
6248a70a1808842edd561e406e19bcba97d45978fd43d26cda5146d9a1427fd6
|