JSON -> [*]
Project description
JSON2Vec
json2vec is a framework for learning embeddings directly from nested, semi-structured records (JSON, Parquet-like objects, and similar shapes) without flattening them into static feature tables.
It treats each dataset as a tree of contexts and fields, learns datatype-specific leaf embeddings, and routes those embeddings upward through context encoders to produce representations at multiple levels.
Why this repository exists
Most production ML systems on nested business data eventually accumulate:
- brittle, duplicated feature engineering code,
- train/serve skew between offline and online transforms,
- heavy coupling to a static schema.
json2vec aims to make the model itself responsible for value encoding, masking, pruning, and reconstruction so the same logic can run in training and inference.
Ambition and scope
Ambition
- Provide a reusable representation layer for hierarchical data.
- Support changing schemas without rebuilding a separate feature store pipeline.
- Make intermediate embeddings available for diagnostics and downstream modeling.
- Real-time serving created solely from checkpoint to perfectly duplicate model training logic.
Scope
- Structured and semi-structured domains (finance, travel, operational telemetry, ecommerce, etc.).
- Inputs that can be described as nested contexts with typed fields.
Current restrictions
- Not a general multimodal system (images/audio/video are currently out of scope, but you may include pre-encoded embeddings).
- Not schema-free; you must define structure and
jmespathqueries explicitly. - Field plugins currently implemented:
number,category,dateparts,entity,vector.
Core features
- Hierarchical modeling from structure definitions:
You declare contexts and fields in a configuration file akin to a
jsonschema; the model compiles this tree into addressable modules. - N-dimensional / ragged nested value support:
Field tensorization handles arbitrarily nested list-like values, pads them to fixed shapes (
ndarray-style tensors), and tracks value state (valued,null,padded,masked,pruned). - Featureless training flow: The model learns value encoding/normalization/tokenization within field plugins instead of depending on a separate handcrafted feature pipeline.
jmespathquery extraction: Each field request has aquerypowered byjmespath, letting you pull values from deeply nested JSON-like records without flattening upstream.- SHIM processor support:
Dataset processors (registered via
@register) can mutate, filter, explode, or enrich observations before tensorization. This supports domain-specific logic without offline batch feature jobs. - Masking and pruning controls:
p_maskandp_prunesupport self-supervised reconstruction and robustness to missing branches; permanent field pruning is supported per training session. - Pruning-based feature importance: Because pruning is native to the model path, you can run controlled ablations (field/context removal) and measure impact as an intrinsic importance signal..
- Multi-level embedding outputs:
You can emit intermediate embeddings at leaf/context/root addresses (
session.output), not only final decoded predictions. - Shared train/serve logic: Training and online inference both use the same structure, field plugins, and processors, reducing train/serve skew risk.
Architecture at a glance
The model is a tree of modules:
- Leaf nodes: datatype-specific embedders/decoders.
- Context nodes: stacked rotary self-attention + learned-query cross-attention pooling.
- Routing unit: "parcels" carrying tensor payloads with
originanddestinationaddresses.
The repository also includes a full example architecture diagram used in the TaxML configuration.
Data path
The data path is iterable/streaming and designed for large datasets:
fetch -> read -> process -> shuffle -> batch -> transform -> mask -> prune
Supported sources/formats in the current code:
- Local filesystem and S3.
ndjson,parquet,feather,avro,csv,orc,json.
Potential use cases
- Financial services: Customer/account/transaction/statement hierarchies for fraud detection, risk scoring, customer similarity, and anomaly detection.
- Travel and pricing: Itinerary/flight/segment structures for offer quality modeling, tax/fee behavior, conversion propensity, and partner/carrier analysis.
- E-commerce and marketplaces: User/session/order/item/event trees for ranking, return-risk prediction, abuse detection, and behavioral clustering.
- Product telemetry and operations: Device/session/event streams for reliability monitoring, failure prediction, and root-cause-oriented embedding analysis.
- Insurance and claims: Policy/claim/line-item/event structures for triage, severity estimation, and outlier detection.
- Healthcare administration data: Patient/encounter/claim/procedure trees for cohort modeling and utilization pattern analysis (subject to compliance constraints).
Common task patterns across these domains:
- Supervised prediction from nested records without flattening.
- Similarity search and clustering on entity embeddings.
- Counterfactual analysis via context/field pruning.
- Robust multi-target inference when branches or fields are missing at runtime.
Repository layout
src/json2vec/architecture: model, encoders, attention/pooling, parcel flow.src/json2vec/tensorfields: plugin system and datatype implementations.src/json2vec/data: streaming dataset pipeline and tensor instantiation.src/json2vec/processors: dataset-specific shims/transforms.experiments/: self-contained Jsonnet experiment configs.docs/summary.typ: short conceptual overview.docs/whitepaper.typ: extended technical write-up.
Quickstart
1. Install
uv sync
2. Run a training workflow
uv run python -m json2vec --experiment taxml --name local-dev --notes "baseline run"
make train is a shorthand for launching the same workflow.
3. Run serving API
CHECKPOINT=/path/to/model.ckpt uv run python src/json2vec/inference/deployment.py
make serve runs the same deployment entrypoint.
Synthetic Examples
The examples/ directory contains runnable, shim-first tutorials where dataset.root is null and observations are generated by a registered processor.
Each use case has:
config.jsonnet: schema + session config.run.py: shim registration and pipeline execution.
Try any of these:
uv run python examples/finance-risk/run.py --batches 2
uv run python examples/travel-pricing/run.py --batches 2
uv run python examples/operations-telemetry/run.py --batches 2
Configuration model
Experiment configuration is Jsonnet-based:
experiments/<name>.jsonnet: project-level settings and ordered sessions, with dataset and structure definitions inline per session.dataset.rootmay benullwhen observations are generated entirely by the configured processor (useful for tutorials/examples).- runtime behavior is environment-driven:
WANDB_API_KEY,NEPTUNE_API_TOKEN,COMET_API_KEY,MLFLOW_TRACKING_URI,JSON2VEC_LOGGER,JSON2VEC_NUM_WORKERS,JSON2VEC_PERSISTENT_WORKERS,JSON2VEC_PIN_MEMORY,JSON2VEC_SHARDING(file|chunk|record, defaultchunk) andJSON2VEC_CHUNK_BATCH_SIZE(default4096).
Sessions support staged workflows (fit, validate, test, predict) and per-session controls:
p_mask,p_prune, permanentprunedaddresses,- LR/scheduler parameters,
- trainer args and early stopping.
Extensibility
Add a new field type
Implement and register:
RequestTensorFieldEmbedderDecoderloss- optional
write
in src/json2vec/tensorfields/extensions/.
Add dataset-specific preprocessing
Register a processor in src/json2vec/processors/extensions/ with @register, then reference it from dataset config.
Maturity notes
This repository is actively evolving. The design is stable enough for experimentation and internal workloads, and additional improvements are expected as plugin coverage and deployment ergonomics continue to mature.
License
Licensed under the Apache License, Version 2.0. See LICENSE.
Attribution details are in NOTICE.
Bibliography
Reference material is listed in BIBLIOGRAPHY.md.
Project citation metadata is available in CITATION.bib.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file json2vec-0.1.0.tar.gz.
File metadata
- Download URL: json2vec-0.1.0.tar.gz
- Upload date:
- Size: 49.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
917e0e06b33534faffceb867b6b0c32dac148261f54e2e424e6faa20410a7edd
|
|
| MD5 |
061af8bbba80abe7c7c367fd6bacd095
|
|
| BLAKE2b-256 |
97085833d14c535d892012e69263a98cc02105a66c66cba11393f0e74fe900c7
|
Provenance
The following attestation bundles were made for json2vec-0.1.0.tar.gz:
Publisher:
pypi-publish.yml on granthamtaylor/json2vec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
json2vec-0.1.0.tar.gz -
Subject digest:
917e0e06b33534faffceb867b6b0c32dac148261f54e2e424e6faa20410a7edd - Sigstore transparency entry: 1351692388
- Sigstore integration time:
-
Permalink:
granthamtaylor/json2vec@d372b72cd0a58d2024beae92eb01b17e26b43b29 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/granthamtaylor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@d372b72cd0a58d2024beae92eb01b17e26b43b29 -
Trigger Event:
release
-
Statement type:
File details
Details for the file json2vec-0.1.0-py3-none-any.whl.
File metadata
- Download URL: json2vec-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5d981810819b6be9387e8005e811b594b69fe7e7fa82940758a2fbbf40b0050
|
|
| MD5 |
4718d543094b9de0f7e0f9bdbce7db6d
|
|
| BLAKE2b-256 |
41f09fa1e17de50f84db55fc98ccfa3a3a764e1c2541402b039128bb20d8f55e
|
Provenance
The following attestation bundles were made for json2vec-0.1.0-py3-none-any.whl:
Publisher:
pypi-publish.yml on granthamtaylor/json2vec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
json2vec-0.1.0-py3-none-any.whl -
Subject digest:
f5d981810819b6be9387e8005e811b594b69fe7e7fa82940758a2fbbf40b0050 - Sigstore transparency entry: 1351692455
- Sigstore integration time:
-
Permalink:
granthamtaylor/json2vec@d372b72cd0a58d2024beae92eb01b17e26b43b29 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/granthamtaylor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@d372b72cd0a58d2024beae92eb01b17e26b43b29 -
Trigger Event:
release
-
Statement type: