Reversible JSON ↔ Arrow shredding with schema stability

Project description

arpeggio-shredder

Reversible shredding (flatten) and unshredding (rebuild) of semi-structured JSONL into Apache Arrow RecordBatches.

Source Code: https://codeberg.org/alwyna/shredder

This package provides a thin Python binding over a C++ core that converts JSON documents into a columnar “atoms” representation suitable for Arrow / Parquet workflows, while preserving the ability to reconstruct the original documents exactly.

The scope is intentionally narrow: flattening with reversibility, not general JSON processing.

Key properties

Reversible: JSON → Arrow atoms → JSON
Columnar-first: output is optimized for Arrow-native pipelines
Deterministic identity: optional object and transaction tagging
Minimal Python surface: High-level shredder package or raw shredder_ext extension

Installation

pip install arpeggio-shredder

Usage

Shredder 0.2.0 supports two explicit "codec" strategies: json (UTF-8 text) and arrow (native Struct/List DOM).

Known limitations / Non-goals for 0.2.x

No support for array-of-array: Shredder throws a runtime_error on nested arrays (e.g., [[1,2]]).
Row expansion semantics (no sibling back-fill): Fields encountered after an array may only be present in the last expanded row. Reconstruction remains correct via internal metadata.
Depth/complexity caveat: The iterative core is frozen for 0.2.0. We prioritize stability over redesigning the traversal engine for extreme nesting.
Codec strategy is explicit: codec="json" requires strings; codec="arrow" requires Arrow DOM (struct/list). No auto-detection.

1. JSON Text Codec (Default)

Warning: The json codec expects UTF-8 JSON strings (e.g., pa.string()), not Python dicts or Arrow structs. No silent casting is performed.

The classic workflow: JSONL → Arrow atoms → JSONL.

import pyarrow as pa
import shredder

# 1. Load JSONL as Arrow string array
lines = ['{"id": 1, "msg": "hello"}', '{"id": 2, "msg": "world"}']
array = pa.array(lines, type=pa.string())

# 2. Shred into columnar "atoms"
# codec="json" is default, but can be explicit
atoms = shredder.shred(array, codec="json")

# 3. Unshred back to JSON
# Returns RecordBatch with single column 'doc'
reconstructed = shredder.unshred(atoms, codec="json")
json_docs = reconstructed.column(0)

for doc in json_docs:
    print(doc.as_py())

2. Arrow DOM Codec

New in 0.2.0: Treat Arrow nested structures as the source of truth. No JSON parsing involved.

import pyarrow as pa
import pyarrow.json as pajson
import shredder

# 1. Read JSONL into Arrow-native columns
# table = pajson.read_json("input.jsonl")
table = pa.table({"id": [1, 2], "val": [10.5, 20.0]})

# 2. Convert to a single struct column (Arrow DOM input)
doc_struct = pa.StructArray.from_arrays(table.columns, names=table.column_names)

# 3. Use the stateful Shredder class
s = shredder.Shredder(codec="arrow")

# 4. Shred/Unshred
atoms = s.shred(doc_struct)
reconstructed = s.unshred(atoms)

# Output 'doc' column is the reconstructed StructArray
doc_out = reconstructed.column(0)
print(doc_out.to_pylist())

How it works

Codecs:
- JsonTextCodec: 1-col UTF-8 string → internal DOM → atoms.
- ArrowDomCodec: 1-col Arrow struct/list/etc → internal DOM → atoms.
Flattening: Nested objects are flattened into path-based columns (e.g., root_user_name).
Row Expansion: Nested arrays are "shredded" by duplicating parent values for each array element, adding an __idx column to preserve order.
Identity: The engine automatically injects __obj_id and __txn_id (BLAKE3 hashes of content) to ensure perfect reconstruction.
Reversibility: Shredder preserves enough metadata to rebuild the exact original structure, regardless of the input codec.

Native extension

The package ships with a prebuilt native extension (.so) built against Apache Arrow and exposed via pybind11.

No runtime compilation
No system Arrow installation required
Shared libraries are bundled into the wheel

Platform support

OS: Linux (manylinux-compatible)
Architecture: x86_64
Python: CPython 3.12
ABI: glibc (manylinux)

Other platforms are not currently supported.

Relationship to the C++ project

This package is the Python distribution layer for the Shredder C++ project hosted on Codeberg: https://codeberg.org/alwyna/shredder.

License

This package is dual-licensed:

AGPL-3.0 for open-source use and networked deployments
Commercial license for proprietary or closed-source use

Commercial licensing is available via https://arpeggio.one/shop.

Project details

Release history Release notifications | RSS feed

This version

0.2.1

Feb 9, 2026

0.2.0

Feb 7, 2026

0.1.9

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arpeggio_shredder-0.2.1-cp312-cp312-manylinux_2_39_x86_64.whl (24.8 MB view details)

Uploaded Feb 9, 2026 CPython 3.12manylinux: glibc 2.39+ x86-64

File details

Details for the file arpeggio_shredder-0.2.1-cp312-cp312-manylinux_2_39_x86_64.whl.

File metadata

Download URL: arpeggio_shredder-0.2.1-cp312-cp312-manylinux_2_39_x86_64.whl
Upload date: Feb 9, 2026
Size: 24.8 MB
Tags: CPython 3.12, manylinux: glibc 2.39+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for arpeggio_shredder-0.2.1-cp312-cp312-manylinux_2_39_x86_64.whl
Algorithm	Hash digest
SHA256	`dca31c8f7edaf707a1852cce7eddf4c20334f175c7f04e32ae498791fc07d226`
MD5	`31761de4c6bb1aad060b064222468d9a`
BLAKE2b-256	`036642eab851045e2db5fb83e416ab887a06737b4f6d28095d58e91fc326bfd9`

See more details on using hashes here.

arpeggio-shredder 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta