JSON -> [*]
Project description
JSON2Vec
json2vec is a schema-driven framework for learning embeddings and task
heads directly from nested, semi-structured records without flattening them
into a fixed feature table first.
The central idea is that the schema is the encoder. A declared tree of contexts and typed fields becomes an addressable neural graph: leaf tensorfield plugins encode raw values, context nodes aggregate child embeddings with rotary self-attention and learned-query cross-attention pooling, and datatype-specific decoders reconstruct masked, pruned, or supervised targets from the surrounding hierarchy.
This makes json2vec a factory for structure-aware encoders rather than a
single domain model. Customer/account/transaction data, flight itineraries,
order fulfillment events, clickstream sessions, and other nested records can
all use the same machinery while keeping their proprietary data, schemas, and
trained checkpoints private.
What Makes This Different
- Attributed-distance embeddings. The model can emit embeddings at any configured field or context, not only at the root. That means two observations can be similar overall while still exposing which branch of the hierarchy accounts for the difference: customer profile, monthly statement, login session, transaction history, or any other declared context.
- Prune-trained counterfactuals. Training can periodically remove whole fields, not just mask individual values. At inference time, the same mechanism supports zero-shot ablation questions such as "what changes if device data is unavailable?" without retraining a separate model for every feature-removal scenario.
- One path for self-supervised and supervised learning. Masked values, pruned fields, and explicit supervised targets all flow through the same datatype-specific heads. A new tensorfield type brings its own embedding, decoding, loss, and writing logic, so the framework stays reusable as schemas grow.
- Schema evolution is a first-class workflow. Because modules are addressed by the schema tree, structures can be patched, fields can be added or removed, and selected fields can be pruned across sessions without rebuilding a separate feature pipeline.
- Production semantics for missingness.
null,padded,masked,pruned, andvaluedare distinct states in the tensorfield type system. They are not collapsed into one generic missing-value bucket. - Online state lives with the model. Stateful components such as category vocabularies, counters, and numeric normalization state are learned during streaming training and serialized with checkpoints, so deployment does not depend on a parallel tokenizer or normalizer artifact.
- Training-serving parity. The same configured graph is used for fitting, validation, testing, batch prediction, and LitServe-backed online inference.
The attributed embeddings and prune-trained ablations are model-level explanation primitives. They help answer where two records differ and how a prediction changes when an information source is withheld. They are not a complete compliance story by themselves, but they make governance and audit layers easier to build on top of the representation layer.
Where It Fits
Use json2vec when the hierarchy is part of the signal:
- customer, account, transaction, statement, device, and session records
- flight itineraries, legs, segments, and events
- orders, shipments, fulfillment events, and support histories
- entities with repeated sub-objects, evolving schemas, and mixed datatypes
- embedding retrieval, anomaly detection, counterfactual ablation, and multi-target prediction over nested records
What It Does Not Do
json2vec stops at the representation and typed prediction layer. It does not
try to be a feature store, governance system, rule engine, authorization layer,
decision-capture system, or audit platform. Those systems can consume
json2vec embeddings and predictions, but their policies and operational
controls remain separate concerns.
It also does not require users to publish data, schemas, checkpoints, or model parameters. The open-source layer is the reusable encoder and runtime infrastructure. Your data stays yours, as does your parameters. The framework works under the assumption that model parameters will not be shared.
What Is In This Repository
This repository currently contains:
- the core library under
src/json2vec/ - tensorfield plugins for
number,category,dateparts,entity,vector, andtext - a processor registry for dataset-specific preprocessing
- a LitServe deployment entrypoint for serving from checkpoints
- tests covering structure loading, data processing, tensorfields, training helpers, logging, and inference
- diagrams plus longer design docs in
docs/
It does not currently ship maintained example experiments or make shortcuts. Older references to experiments/, examples/, and make train were removed because they no longer reflect the checked-in code.
More examples based on publicly available will soon be included to showcase implementation and expected behavior.
Install
For local development:
uv sync
If you want an editable install:
pip install -e .
The package requires Python >=3.12.
Core Concepts
Structuredefines the model tree.Contextnodes describe hierarchical grouping and aggregation.- Field
Requestnodes declare atype, aquery, and type-specific options. Addressvalues are stable paths such asroot/account/transaction/amount.jmespathqueries extract values from each observation.TensorFieldinstances preserve typed content plus state tokens such asvalued,null,padded,masked, andpruned.Parcelobjects carry embeddings from leaves to parent contexts and then up the tree.heritageis the path from a leaf to the root; decoders use that path as context when reconstructing masked, pruned, or supervised targets.Sessioncombines a dataset, structure, task, masking/pruning controls, and selected embedding outputs.Experimentis an ordered list of sessions loaded from config files.
Supported session tasks are:
fitvalidatetestpredict
Supported dataset suffixes are:
ndjsonparquetfeatheravrocsvorcjson
Supported dataset roots are local paths and s3://... URIs. If dataset.root is null, the pipeline runs in processor-driven mode and expects the configured processor to generate observations.
This will likely expand to support @register based UDFs for arbitrary data sourcing and file format support ...
How The Graph Runs
For each batch:
- Each field request extracts values with its
jmespathquery. - The matching tensorfield plugin tensorizes those values, updates any online state allowed for the current split, and records trainable targets when masking or pruning occurs.
- Leaf embedders emit parcels to their parent contexts.
- Context nodes run bottom-up. Each context concatenates available child parcels, applies rotary transformer layers, compresses with learned-query cross-attention, and emits a new parcel to its parent.
- Leaf decoders consume the parcel sequence along their heritage path to reconstruct trainable targets.
Random p_mask corrupts individual values. Random p_prune removes whole
field instances across an observation. Session-level pruned fields are always
withheld and become supervised targets; session-level output addresses are
serialized as embeddings during prediction.
Minimal Training Workflow
The CLI entrypoint is:
uv run python -m json2vec --experiments /path/to/configs --experiment demo --name local-dev --notes "first run"
The same function is also exposed as the train console script after installation.
Config discovery is directory-based. json2vec can load .json, .yaml, .yml, .toml, and .jsonnet experiment files. If a config directory contains exactly one experiment file, --experiment can be omitted.
A minimal YAML experiment looks like this:
project: demo
sessions:
- name: train
task: fit
learning_rate: 0.001
p_mask: 0.15
p_prune: 0.05
output:
- root
dataset:
root: /path/to/data
sample_rate: 1.0
file_buffer_size: 16
observation_buffer_size: 16
processor: default
kwargs: {}
suffix: ndjson
patterns:
train: .*
validate: .*
test: .*
predict: .*
structure:
name: demo-structure
type: structure
batch_size: 2
dropout: 0.1
d_model: 16
fields:
name: root
type: context
context_size: 1
n_outputs: 1
n_layers: 1
n_heads: 4
n_linear: 1
fields:
- name: identifier
type: category
query: "[*].id"
max_vocab_size: 1024
fit sessions write checkpoints to models/. In multi-session experiments, the output checkpoint from a fit session is automatically passed to later validate, test, or predict sessions.
To turn a field into a supervised target, include its address in
session.pruned for a fit, validate, test, or predict session. The model will
withhold that field from the encoder and use the same datatype-specific decoder
that is used for masked/pruned reconstruction. To export embeddings, include
field or context addresses in session.output.
Inference And Serving
Batch prediction uses the same experiment/session machinery as training. Prediction outputs are written to tmp/predictions/.
Checkpoints carry the Lightning weights, serialized session configuration, and stateful tensorfield state such as online category vocabularies, numeric normalization buffers, and class-frequency counters. This tight coupling is intentional: the deployed model should not depend on a separate, manually synchronized tokenizer or normalizer artifact.
For online serving, the repository exposes json2vec.inference.deployment.Deployment, which wraps a checkpoint-backed model in LitServe. Runtime configuration is environment-driven:
JSON2VEC_CHECKPOINTorCHECKPOINTJSON2VEC_MAX_BATCH_SIZEJSON2VEC_BATCH_TIMEOUTJSON2VEC_WORKERS_PER_DEVICEJSON2VEC_ACCELERATORJSON2VEC_TRACK_REQUESTS
A minimal serve entrypoint is:
from json2vec.inference.deployment import Deployment
Deployment.serve()
Processor Model
Processors are registered Python callables. The built-in default processor returns each observation unchanged.
Custom processors live under src/json2vec/processors/extensions/ and are registered with either @register.transformation or @register.generator.
- transformation processors must return a single
dict - generator processors may yield
dictobjects or return alist[dict] - every emitted object is wrapped as a single-item root context before tensorization
Configured dataset.kwargs are passed into the processor, with unsupported keyword arguments automatically ignored.
Tensorfield Plugins
The current built-in tensorfield types are:
numbercategorydatepartsentityvectortext
Each tensorfield plugin provides a request schema plus the model components needed to encode values, decode predictions, compute losses, and optionally serialize outputs.
The text tensorfield requires the optional transformers dependency and is not installed by default.
Runtime Environment
Training and dataloading behavior is controlled with environment variables such as:
JSON2VEC_LOGGERWANDB_API_KEYNEPTUNE_API_TOKENCOMET_API_KEYMLFLOW_TRACKING_URIJSON2VEC_TENSORBOARD_LOG_DIRJSON2VEC_CSV_LOG_DIRJSON2VEC_NUM_WORKERSJSON2VEC_PERSISTENT_WORKERSJSON2VEC_PIN_MEMORYJSON2VEC_SHARDINGJSON2VEC_CHUNK_BATCH_SIZE
Supported sharding strategies are file, chunk, and record.
Repository Layout
src/json2vec/architecture: model assembly, attention, pooling, and parcel routingsrc/json2vec/data: dataset fetch/read/process/batch/encode pipelinesrc/json2vec/entrypoints: training and evaluation orchestrationsrc/json2vec/inference: serving and prediction callbackssrc/json2vec/logging: tracking and runtime logging helperssrc/json2vec/processors: processor registry and built-in extensionssrc/json2vec/structs: pydantic config models, enums, tree structures, and environment settingssrc/json2vec/tensorfields: tensorfield plugin system and built-in field typestests/: package test suitedocs/summary.typanddocs/whitepaper.typ: longer written documentation
Diagrams
The repository includes architecture and pipeline diagrams:
Development
Run the test suite with:
uv run pytest
Run lint checks with:
uv run ruff check
License
Licensed under the Apache License, Version 2.0. See LICENSE and NOTICE.
References
BIBLIOGRAPHY.mdCITATION.bib
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file json2vec-0.2.1.tar.gz.
File metadata
- Download URL: json2vec-0.2.1.tar.gz
- Upload date:
- Size: 59.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3936c9b4af95190520deb72427b71e7ebcf83c4de5f4ad175bddbbb8b056f96b
|
|
| MD5 |
27b29d0837260a120c7bd6c80a0a883a
|
|
| BLAKE2b-256 |
c903a27e4920431914cb3d586acc8be80cd44f60c7b705c1872a779090eb50bf
|
Provenance
The following attestation bundles were made for json2vec-0.2.1.tar.gz:
Publisher:
pypi-publish.yml on granthamtaylor/json2vec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
json2vec-0.2.1.tar.gz -
Subject digest:
3936c9b4af95190520deb72427b71e7ebcf83c4de5f4ad175bddbbb8b056f96b - Sigstore transparency entry: 1366791547
- Sigstore integration time:
-
Permalink:
granthamtaylor/json2vec@f695924a673dd7ae7b6f72cab55f06df0540c1d8 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/granthamtaylor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@f695924a673dd7ae7b6f72cab55f06df0540c1d8 -
Trigger Event:
release
-
Statement type:
File details
Details for the file json2vec-0.2.1-py3-none-any.whl.
File metadata
- Download URL: json2vec-0.2.1-py3-none-any.whl
- Upload date:
- Size: 74.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6e813ce62679dc0fc6d527e8c7b64a7d29144727882627e064fbdcf31c32d4e
|
|
| MD5 |
292329717e3b8d68df21edfc46eeceac
|
|
| BLAKE2b-256 |
b1303ba14ecb85207601755226982a964604b93c83496d182d8686be36e8195a
|
Provenance
The following attestation bundles were made for json2vec-0.2.1-py3-none-any.whl:
Publisher:
pypi-publish.yml on granthamtaylor/json2vec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
json2vec-0.2.1-py3-none-any.whl -
Subject digest:
f6e813ce62679dc0fc6d527e8c7b64a7d29144727882627e064fbdcf31c32d4e - Sigstore transparency entry: 1366791577
- Sigstore integration time:
-
Permalink:
granthamtaylor/json2vec@f695924a673dd7ae7b6f72cab55f06df0540c1d8 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/granthamtaylor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@f695924a673dd7ae7b6f72cab55f06df0540c1d8 -
Trigger Event:
release
-
Statement type: