Skip to main content

JSON -> [*]

Project description

JSON2Vec

json2vec is a framework for learning embeddings directly from nested, semi-structured records (JSON, Parquet-like objects, and similar shapes) without flattening them into static feature tables.

It treats each dataset as a tree of contexts and fields, learns datatype-specific leaf embeddings, and routes those embeddings upward through context encoders to produce representations at multiple levels.

Why this repository exists

Most production ML systems on nested business data eventually accumulate:

  • brittle, duplicated feature engineering code,
  • train/serve skew between offline and online transforms,
  • heavy coupling to a static schema.

json2vec aims to make the model itself responsible for value encoding, masking, pruning, and reconstruction so the same logic can run in training and inference.

Ambition and scope

Ambition

  • Provide a reusable representation layer for hierarchical data.
  • Support changing schemas without rebuilding a separate feature store pipeline.
  • Make intermediate embeddings available for diagnostics and downstream modeling.
  • Real-time serving created solely from checkpoint to perfectly duplicate model training logic.

Scope

  • Structured and semi-structured domains (finance, travel, operational telemetry, ecommerce, etc.).
  • Inputs that can be described as nested contexts with typed fields.

Current restrictions

  • Not a general multimodal system (images/audio/video are currently out of scope, but you may include pre-encoded embeddings).
  • Not schema-free; you must define structure and jmespath queries explicitly.
  • Field plugins currently implemented: number, category, dateparts, entity, vector.

Core features

  • Hierarchical modeling from structure definitions: You declare contexts and fields in a configuration file akin to a jsonschema; the model compiles this tree into addressable modules.
  • N-dimensional / ragged nested value support: Field tensorization handles arbitrarily nested list-like values, pads them to fixed shapes (ndarray-style tensors), and tracks value state (valued, null, padded, masked, pruned).
  • Featureless training flow: The model learns value encoding/normalization/tokenization within field plugins instead of depending on a separate handcrafted feature pipeline.
  • jmespath query extraction: Each field request has a query powered by jmespath, letting you pull values from deeply nested JSON-like records without flattening upstream.
  • SHIM processor support: Dataset processors (registered via @register) can mutate, filter, explode, or enrich observations before tensorization. This supports domain-specific logic without offline batch feature jobs.
  • Masking and pruning controls: p_mask and p_prune support self-supervised reconstruction and robustness to missing branches; permanent field pruning is supported per training session.
  • Pruning-based feature importance: Because pruning is native to the model path, you can run controlled ablations (field/context removal) and measure impact as an intrinsic importance signal..
  • Multi-level embedding outputs: You can emit intermediate embeddings at leaf/context/root addresses (session.output), not only final decoded predictions.
  • Shared train/serve logic: Training and online inference both use the same structure, field plugins, and processors, reducing train/serve skew risk.

Architecture at a glance

The model is a tree of modules:

  • Leaf nodes: datatype-specific embedders/decoders.
  • Context nodes: stacked rotary self-attention + learned-query cross-attention pooling.
  • Routing unit: "parcels" carrying tensor payloads with origin and destination addresses.

Tree of encoding modules

Single context node

The repository also includes a full example architecture diagram used in the TaxML configuration.

Example configured module tree

Data path

The data path is iterable/streaming and designed for large datasets:

  • fetch -> read -> process -> shuffle -> batch -> transform -> mask -> prune

Supported sources/formats in the current code:

  • Local filesystem and S3.
  • ndjson, parquet, feather, avro, csv, orc, json.

Pipeline stages

Potential use cases

  • Financial services: Customer/account/transaction/statement hierarchies for fraud detection, risk scoring, customer similarity, and anomaly detection.
  • Travel and pricing: Itinerary/flight/segment structures for offer quality modeling, tax/fee behavior, conversion propensity, and partner/carrier analysis.
  • E-commerce and marketplaces: User/session/order/item/event trees for ranking, return-risk prediction, abuse detection, and behavioral clustering.
  • Product telemetry and operations: Device/session/event streams for reliability monitoring, failure prediction, and root-cause-oriented embedding analysis.
  • Insurance and claims: Policy/claim/line-item/event structures for triage, severity estimation, and outlier detection.
  • Healthcare administration data: Patient/encounter/claim/procedure trees for cohort modeling and utilization pattern analysis (subject to compliance constraints).

Common task patterns across these domains:

  • Supervised prediction from nested records without flattening.
  • Similarity search and clustering on entity embeddings.
  • Counterfactual analysis via context/field pruning.
  • Robust multi-target inference when branches or fields are missing at runtime.

Repository layout

  • src/json2vec/architecture: model, encoders, attention/pooling, parcel flow.
  • src/json2vec/tensorfields: plugin system and datatype implementations.
  • src/json2vec/data: streaming dataset pipeline and tensor instantiation.
  • src/json2vec/processors: dataset-specific shims/transforms.
  • experiments/: self-contained Jsonnet experiment configs.
  • docs/summary.typ: short conceptual overview.
  • docs/whitepaper.typ: extended technical write-up.

Quickstart

1. Install

uv sync

2. Run a training workflow

uv run python -m json2vec --experiment taxml --name local-dev --notes "baseline run"

make train is a shorthand for launching the same workflow.

3. Run serving API

CHECKPOINT=/path/to/model.ckpt uv run python src/json2vec/inference/deployment.py

make serve runs the same deployment entrypoint.

Synthetic Examples

The examples/ directory contains runnable, shim-first tutorials where dataset.root is null and observations are generated by a registered processor.

Each use case has:

  • config.jsonnet: schema + session config.
  • run.py: shim registration and pipeline execution.

Try any of these:

uv run python examples/finance-risk/run.py --batches 2
uv run python examples/travel-pricing/run.py --batches 2
uv run python examples/operations-telemetry/run.py --batches 2

Configuration model

Experiment configuration is Jsonnet-based:

  • experiments/<name>.jsonnet: project-level settings and ordered sessions, with dataset and structure definitions inline per session.
  • dataset.root may be null when observations are generated entirely by the configured processor (useful for tutorials/examples).
  • runtime behavior is environment-driven: WANDB_API_KEY, NEPTUNE_API_TOKEN, COMET_API_KEY, MLFLOW_TRACKING_URI, JSON2VEC_LOGGER, JSON2VEC_NUM_WORKERS, JSON2VEC_PERSISTENT_WORKERS, JSON2VEC_PIN_MEMORY, JSON2VEC_SHARDING (file|chunk|record, default chunk) and JSON2VEC_CHUNK_BATCH_SIZE (default 4096).

Sessions support staged workflows (fit, validate, test, predict) and per-session controls:

  • p_mask, p_prune, permanent pruned addresses,
  • LR/scheduler parameters,
  • trainer args and early stopping.

Extensibility

Add a new field type

Implement and register:

  • Request
  • TensorField
  • Embedder
  • Decoder
  • loss
  • optional write

in src/json2vec/tensorfields/extensions/.

Add dataset-specific preprocessing

Register a processor in src/json2vec/processors/extensions/ with @register, then reference it from dataset config.

Maturity notes

This repository is actively evolving. The design is stable enough for experimentation and internal workloads, and additional improvements are expected as plugin coverage and deployment ergonomics continue to mature.

License

Licensed under the Apache License, Version 2.0. See LICENSE. Attribution details are in NOTICE.

Bibliography

Reference material is listed in BIBLIOGRAPHY.md. Project citation metadata is available in CITATION.bib.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

json2vec-0.1.0.tar.gz (49.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

json2vec-0.1.0-py3-none-any.whl (64.3 kB view details)

Uploaded Python 3

File details

Details for the file json2vec-0.1.0.tar.gz.

File metadata

  • Download URL: json2vec-0.1.0.tar.gz
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for json2vec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 917e0e06b33534faffceb867b6b0c32dac148261f54e2e424e6faa20410a7edd
MD5 061af8bbba80abe7c7c367fd6bacd095
BLAKE2b-256 97085833d14c535d892012e69263a98cc02105a66c66cba11393f0e74fe900c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for json2vec-0.1.0.tar.gz:

Publisher: pypi-publish.yml on granthamtaylor/json2vec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file json2vec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: json2vec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 64.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for json2vec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5d981810819b6be9387e8005e811b594b69fe7e7fa82940758a2fbbf40b0050
MD5 4718d543094b9de0f7e0f9bdbce7db6d
BLAKE2b-256 41f09fa1e17de50f84db55fc98ccfa3a3a764e1c2541402b039128bb20d8f55e

See more details on using hashes here.

Provenance

The following attestation bundles were made for json2vec-0.1.0-py3-none-any.whl:

Publisher: pypi-publish.yml on granthamtaylor/json2vec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page