Skip to main content

Reversible JSON ↔ Arrow shredding with schema stability

Project description

arpeggio-shredder

Reversible shredding (flatten) and unshredding (rebuild) of semi-structured JSONL into Apache Arrow RecordBatches.

Source Code: https://codeberg.org/alwyna/shredder

This package provides a thin Python binding over a C++ core that converts JSON documents into a columnar “atoms” representation suitable for Arrow / Parquet workflows, while preserving the ability to reconstruct the original documents exactly.

The scope is intentionally narrow: flattening with reversibility, not general JSON processing.

Key properties

  • Reversible: JSON → Arrow atoms → JSON
  • Columnar-first: output is optimized for Arrow-native pipelines
  • Deterministic identity: optional object and transaction tagging
  • Minimal Python surface: High-level shredder package or raw shredder_ext extension

Installation

pip install arpeggio-shredder

Usage

Shredder 0.2.0 supports two explicit "codec" strategies: json (UTF-8 text) and arrow (native Struct/List DOM).

Known limitations / Non-goals for 0.2.x

  • No support for array-of-array: Shredder throws a runtime_error on nested arrays (e.g., [[1,2]]).
  • Row expansion semantics (no sibling back-fill): Fields encountered after an array may only be present in the last expanded row. Reconstruction remains correct via internal metadata.
  • Depth/complexity caveat: The iterative core is frozen for 0.2.0. We prioritize stability over redesigning the traversal engine for extreme nesting.
  • Codec strategy is explicit: codec="json" requires strings; codec="arrow" requires Arrow DOM (struct/list). No auto-detection.

1. JSON Text Codec (Default)

Warning: The json codec expects UTF-8 JSON strings (e.g., pa.string()), not Python dicts or Arrow structs. No silent casting is performed.

The classic workflow: JSONL → Arrow atoms → JSONL.

import pyarrow as pa
import shredder

# 1. Load JSONL as Arrow string array
lines = ['{"id": 1, "msg": "hello"}', '{"id": 2, "msg": "world"}']
array = pa.array(lines, type=pa.string())

# 2. Shred into columnar "atoms"
# codec="json" is default, but can be explicit
atoms = shredder.shred(array, codec="json")

# 3. Unshred back to JSON
# Returns RecordBatch with single column 'doc'
reconstructed = shredder.unshred(atoms, codec="json")
json_docs = reconstructed.column(0)

for doc in json_docs:
    print(doc.as_py())

2. Arrow DOM Codec

New in 0.2.0: Treat Arrow nested structures as the source of truth. No JSON parsing involved.

import pyarrow as pa
import pyarrow.json as pajson
import shredder

# 1. Read JSONL into Arrow-native columns
# table = pajson.read_json("input.jsonl")
table = pa.table({"id": [1, 2], "val": [10.5, 20.0]})

# 2. Convert to a single struct column (Arrow DOM input)
doc_struct = pa.StructArray.from_arrays(table.columns, names=table.column_names)

# 3. Use the stateful Shredder class
s = shredder.Shredder(codec="arrow")

# 4. Shred/Unshred
atoms = s.shred(doc_struct)
reconstructed = s.unshred(atoms)

# Output 'doc' column is the reconstructed StructArray
doc_out = reconstructed.column(0)
print(doc_out.to_pylist())

How it works

  1. Codecs:
    • JsonTextCodec: 1-col UTF-8 string → internal DOM → atoms.
    • ArrowDomCodec: 1-col Arrow struct/list/etc → internal DOM → atoms.
  2. Flattening: Nested objects are flattened into path-based columns (e.g., root_user_name).
  3. Row Expansion: Nested arrays are "shredded" by duplicating parent values for each array element, adding an __idx column to preserve order.
  4. Identity: The engine automatically injects __obj_id and __txn_id (BLAKE3 hashes of content) to ensure perfect reconstruction.
  5. Reversibility: Shredder preserves enough metadata to rebuild the exact original structure, regardless of the input codec.

Native extension

The package ships with a prebuilt native extension (.so) built against Apache Arrow and exposed via pybind11.

  • No runtime compilation
  • No system Arrow installation required
  • Shared libraries are bundled into the wheel

Platform support

  • OS: Linux (manylinux-compatible)
  • Architecture: x86_64
  • Python: CPython 3.12
  • ABI: glibc (manylinux)

Other platforms are not currently supported.

Relationship to the C++ project

This package is the Python distribution layer for the Shredder C++ project hosted on Codeberg: https://codeberg.org/alwyna/shredder.

License

This package is dual-licensed:

  • AGPL-3.0 for open-source use and networked deployments
  • Commercial license for proprietary or closed-source use

Commercial licensing is available via https://arpeggio.one/shop.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arpeggio_shredder-0.2.1-cp312-cp312-manylinux_2_39_x86_64.whl (24.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.39+ x86-64

File details

Details for the file arpeggio_shredder-0.2.1-cp312-cp312-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for arpeggio_shredder-0.2.1-cp312-cp312-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 dca31c8f7edaf707a1852cce7eddf4c20334f175c7f04e32ae498791fc07d226
MD5 31761de4c6bb1aad060b064222468d9a
BLAKE2b-256 036642eab851045e2db5fb83e416ab887a06737b4f6d28095d58e91fc326bfd9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page