JSON -> [*]

Project description

JSON2Vec

json2vec is a framework for learning embeddings directly from nested, semi-structured records (JSON, Parquet-like objects, and similar shapes) without flattening them into static feature tables.

It treats each dataset as a tree of contexts and fields, learns datatype-specific leaf embeddings, and routes those embeddings upward through context encoders to produce representations at multiple levels.

Why this repository exists

Most production ML systems on nested business data eventually accumulate:

brittle, duplicated feature engineering code,
train/serve skew between offline and online transforms,
heavy coupling to a static schema.

json2vec aims to make the model itself responsible for value encoding, masking, pruning, and reconstruction so the same logic can run in training and inference.

Ambition and scope

Ambition

Provide a reusable representation layer for hierarchical data.
Support changing schemas without rebuilding a separate feature store pipeline.
Make intermediate embeddings available for diagnostics and downstream modeling.
Real-time serving created solely from checkpoint to perfectly duplicate model training logic.

Scope

Structured and semi-structured domains (finance, travel, operational telemetry, ecommerce, etc.).
Inputs that can be described as nested contexts with typed fields.

Current restrictions

Not a general multimodal system (images/audio/video are currently out of scope, but you may include pre-encoded embeddings).
Not schema-free; you must define structure and jmespath queries explicitly.
Field plugins currently implemented: number, category, dateparts, entity, vector.

Core features

Hierarchical modeling from structure definitions: You declare contexts and fields in a configuration file akin to a jsonschema; the model compiles this tree into addressable modules.
N-dimensional / ragged nested value support: Field tensorization handles arbitrarily nested list-like values, pads them to fixed shapes (ndarray-style tensors), and tracks value state (valued, null, padded, masked, pruned).
Featureless training flow: The model learns value encoding/normalization/tokenization within field plugins instead of depending on a separate handcrafted feature pipeline.
jmespath query extraction: Each field request has a query powered by jmespath, letting you pull values from deeply nested JSON-like records without flattening upstream.
SHIM processor support: Dataset processors (registered via @register) can mutate, filter, explode, or enrich observations before tensorization. This supports domain-specific logic without offline batch feature jobs.
Masking and pruning controls: p_mask and p_prune support self-supervised reconstruction and robustness to missing branches; permanent field pruning is supported per training session.
Pruning-based feature importance: Because pruning is native to the model path, you can run controlled ablations (field/context removal) and measure impact as an intrinsic importance signal..
Multi-level embedding outputs: You can emit intermediate embeddings at leaf/context/root addresses (session.output), not only final decoded predictions.
Shared train/serve logic: Training and online inference both use the same structure, field plugins, and processors, reducing train/serve skew risk.

Architecture at a glance

The model is a tree of modules:

Leaf nodes: datatype-specific embedders/decoders.
Context nodes: stacked rotary self-attention + learned-query cross-attention pooling.
Routing unit: "parcels" carrying tensor payloads with origin and destination addresses.

Tree of encoding modules

Single context node

The repository also includes a full example architecture diagram used in the TaxML configuration.

Example configured module tree

Data path

The data path is iterable/streaming and designed for large datasets:

fetch -> read -> process -> shuffle -> batch -> transform -> mask -> prune

Supported sources/formats in the current code:

Local filesystem and S3.
ndjson, parquet, feather, avro, csv, orc, json.

Pipeline stages

Potential use cases

Financial services: Customer/account/transaction/statement hierarchies for fraud detection, risk scoring, customer similarity, and anomaly detection.
Travel and pricing: Itinerary/flight/segment structures for offer quality modeling, tax/fee behavior, conversion propensity, and partner/carrier analysis.
E-commerce and marketplaces: User/session/order/item/event trees for ranking, return-risk prediction, abuse detection, and behavioral clustering.
Product telemetry and operations: Device/session/event streams for reliability monitoring, failure prediction, and root-cause-oriented embedding analysis.
Insurance and claims: Policy/claim/line-item/event structures for triage, severity estimation, and outlier detection.
Healthcare administration data: Patient/encounter/claim/procedure trees for cohort modeling and utilization pattern analysis (subject to compliance constraints).

Common task patterns across these domains:

Supervised prediction from nested records without flattening.
Similarity search and clustering on entity embeddings.
Counterfactual analysis via context/field pruning.
Robust multi-target inference when branches or fields are missing at runtime.

Repository layout

src/json2vec/architecture: model, encoders, attention/pooling, parcel flow.
src/json2vec/tensorfields: plugin system and datatype implementations.
src/json2vec/data: streaming dataset pipeline and tensor instantiation.
src/json2vec/processors: dataset-specific shims/transforms.
experiments/: self-contained Jsonnet experiment configs.
docs/summary.typ: short conceptual overview.
docs/whitepaper.typ: extended technical write-up.

Quickstart

1. Install

uv sync

2. Run a training workflow

uv run python -m json2vec --experiment taxml --name local-dev --notes "baseline run"

make train is a shorthand for launching the same workflow.

3. Run serving API

CHECKPOINT=/path/to/model.ckpt uv run python src/json2vec/inference/deployment.py

make serve runs the same deployment entrypoint.

Synthetic Examples

The examples/ directory contains runnable, shim-first tutorials where dataset.root is null and observations are generated by a registered processor.

Each use case has:

config.jsonnet: schema + session config.
run.py: shim registration and pipeline execution.

Try any of these:

uv run python examples/finance-risk/run.py --batches 2
uv run python examples/travel-pricing/run.py --batches 2
uv run python examples/operations-telemetry/run.py --batches 2

Configuration model

Experiment configuration is Jsonnet-based:

experiments/<name>.jsonnet: project-level settings and ordered sessions, with dataset and structure definitions inline per session.
dataset.root may be null when observations are generated entirely by the configured processor (useful for tutorials/examples).
runtime behavior is environment-driven: WANDB_API_KEY, NEPTUNE_API_TOKEN, COMET_API_KEY, MLFLOW_TRACKING_URI, JSON2VEC_LOGGER, JSON2VEC_NUM_WORKERS, JSON2VEC_PERSISTENT_WORKERS, JSON2VEC_PIN_MEMORY, JSON2VEC_SHARDING (file|chunk|record, default chunk) and JSON2VEC_CHUNK_BATCH_SIZE (default 4096).

Sessions support staged workflows (fit, validate, test, predict) and per-session controls:

p_mask, p_prune, permanent pruned addresses,
LR/scheduler parameters,
trainer args and early stopping.

Extensibility

Add a new field type

Implement and register:

Request
TensorField
Embedder
Decoder
loss
optional write

in src/json2vec/tensorfields/extensions/.

Add dataset-specific preprocessing

Maturity notes

This repository is actively evolving. The design is stable enough for experimentation and internal workloads, and additional improvements are expected as plugin coverage and deployment ergonomics continue to mature.

License

Licensed under the Apache License, Version 2.0. See LICENSE. Attribution details are in NOTICE.

Bibliography

Reference material is listed in BIBLIOGRAPHY.md. Project citation metadata is available in CITATION.bib.

Project details

Release history Release notifications | RSS feed

0.2.1

Apr 24, 2026

0.2.0

Apr 23, 2026

This version

0.1.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

json2vec-0.1.0.tar.gz (49.8 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

json2vec-0.1.0-py3-none-any.whl (64.3 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file json2vec-0.1.0.tar.gz.

File metadata

Download URL: json2vec-0.1.0.tar.gz
Upload date: Apr 21, 2026
Size: 49.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for json2vec-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`917e0e06b33534faffceb867b6b0c32dac148261f54e2e424e6faa20410a7edd`
MD5	`061af8bbba80abe7c7c367fd6bacd095`
BLAKE2b-256	`97085833d14c535d892012e69263a98cc02105a66c66cba11393f0e74fe900c7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for json2vec-0.1.0.tar.gz:

Publisher: pypi-publish.yml on granthamtaylor/json2vec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: json2vec-0.1.0.tar.gz
- Subject digest: 917e0e06b33534faffceb867b6b0c32dac148261f54e2e424e6faa20410a7edd
- Sigstore transparency entry: 1351692388
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: granthamtaylor/json2vec@d372b72cd0a58d2024beae92eb01b17e26b43b29
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/granthamtaylor
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@d372b72cd0a58d2024beae92eb01b17e26b43b29
- Trigger Event: release

File details

Details for the file json2vec-0.1.0-py3-none-any.whl.

File metadata

Download URL: json2vec-0.1.0-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 64.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for json2vec-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5d981810819b6be9387e8005e811b594b69fe7e7fa82940758a2fbbf40b0050`
MD5	`4718d543094b9de0f7e0f9bdbce7db6d`
BLAKE2b-256	`41f09fa1e17de50f84db55fc98ccfa3a3a764e1c2541402b039128bb20d8f55e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for json2vec-0.1.0-py3-none-any.whl:

Publisher: pypi-publish.yml on granthamtaylor/json2vec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: json2vec-0.1.0-py3-none-any.whl
- Subject digest: f5d981810819b6be9387e8005e811b594b69fe7e7fa82940758a2fbbf40b0050
- Sigstore transparency entry: 1351692455
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: granthamtaylor/json2vec@d372b72cd0a58d2024beae92eb01b17e26b43b29
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/granthamtaylor
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@d372b72cd0a58d2024beae92eb01b17e26b43b29
- Trigger Event: release

json2vec 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

JSON2Vec

Why this repository exists

Ambition and scope

Ambition

Scope

Current restrictions

Core features

Architecture at a glance

Data path

Potential use cases

Repository layout

Quickstart

1. Install

2. Run a training workflow

3. Run serving API

Synthetic Examples

Configuration model

Extensibility

Add a new field type

Add dataset-specific preprocessing

Maturity notes

License

Bibliography

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance