JSON -> [*]
Project description
JSON2Vec
json2vec is a Python library for learning embeddings directly from nested, semi-structured records without flattening them into a fixed feature table first.
The model is defined as a tree of contexts and typed fields. Leaf tensorfield plugins encode raw values, context nodes aggregate them with attention, and the same configured pipeline is used for training, batch prediction, and online inference.
What Is In This Repository
This repository currently contains:
- the core library under
src/json2vec/ - tensorfield plugins for
number,category,dateparts,entity,vector, andtext - a processor registry for dataset-specific preprocessing
- a LitServe deployment entrypoint for serving from checkpoints
- tests covering structure loading, data processing, tensorfields, training helpers, logging, and inference
- diagrams plus longer design docs in
docs/
It does not currently ship maintained example experiments or make shortcuts. Older references to experiments/, examples/, and make train were removed because they no longer reflect the checked-in code.
Install
For local development:
uv sync
If you want an editable install:
pip install -e .
The package requires Python >=3.12.
Core Concepts
Structuredefines the model tree.Contextnodes describe hierarchical grouping and aggregation.- Field
Requestnodes declare atype, aquery, and type-specific options. jmespathqueries extract values from each observation.Sessioncombines a dataset, structure, task, and runtime controls.Experimentis an ordered list of sessions loaded from config files.
Supported session tasks are:
fitvalidatetestpredict
Supported dataset suffixes are:
ndjsonparquetfeatheravrocsvorcjson
Supported dataset roots are local paths and s3://... URIs. If dataset.root is null, the pipeline runs in processor-driven mode and expects the configured processor to generate observations.
Minimal Training Workflow
The CLI entrypoint is:
uv run python -m json2vec --experiments /path/to/configs --experiment demo --name local-dev --notes "first run"
The same function is also exposed as the train console script after installation.
Config discovery is directory-based. json2vec can load .json, .yaml, .yml, .toml, and .jsonnet experiment files. If a config directory contains exactly one experiment file, --experiment can be omitted.
A minimal YAML experiment looks like this:
project: demo
sessions:
- name: train
task: fit
learning_rate: 0.001
dataset:
root: /path/to/data
sample_rate: 1.0
file_buffer_size: 16
observation_buffer_size: 16
processor: default
kwargs: {}
suffix: ndjson
patterns:
train: .*
validate: .*
test: .*
predict: .*
structure:
name: demo-structure
type: structure
batch_size: 2
dropout: 0.1
d_model: 16
fields:
name: root
type: context
context_size: 1
n_outputs: 1
fields:
- name: identifier
type: category
query: "[*].id"
max_vocab_size: 1024
fit sessions write checkpoints to models/. In multi-session experiments, the output checkpoint from a fit session is automatically passed to later validate, test, or predict sessions.
Inference And Serving
Batch prediction uses the same experiment/session machinery as training. Prediction outputs are written to tmp/predictions/.
For online serving, the repository exposes json2vec.inference.deployment.Deployment, which wraps a checkpoint-backed model in LitServe. Runtime configuration is environment-driven:
JSON2VEC_CHECKPOINTorCHECKPOINTJSON2VEC_MAX_BATCH_SIZEJSON2VEC_BATCH_TIMEOUTJSON2VEC_WORKERS_PER_DEVICEJSON2VEC_ACCELERATORJSON2VEC_TRACK_REQUESTS
A minimal serve entrypoint is:
from json2vec.inference.deployment import Deployment
Deployment.serve()
Processor Model
Processors are registered Python callables. The built-in default processor returns each observation unchanged.
Custom processors live under src/json2vec/processors/extensions/ and are registered with either @register.transformation or @register.generator.
- transformation processors must return a single
dict - generator processors may yield
dictobjects or return alist[dict] - every emitted object is wrapped as a single-item root context before tensorization
Configured dataset.kwargs are passed into the processor, with unsupported keyword arguments automatically ignored.
Tensorfield Plugins
The current built-in tensorfield types are:
numbercategorydatepartsentityvectortext
Each tensorfield plugin provides a request schema plus the model components needed to encode values, decode predictions, compute losses, and optionally serialize outputs.
The text tensorfield requires the optional transformers dependency and is not installed by default.
Runtime Environment
Training and dataloading behavior is controlled with environment variables such as:
JSON2VEC_LOGGERWANDB_API_KEYNEPTUNE_API_TOKENCOMET_API_KEYMLFLOW_TRACKING_URIJSON2VEC_TENSORBOARD_LOG_DIRJSON2VEC_CSV_LOG_DIRJSON2VEC_NUM_WORKERSJSON2VEC_PERSISTENT_WORKERSJSON2VEC_PIN_MEMORYJSON2VEC_SHARDINGJSON2VEC_CHUNK_BATCH_SIZE
Supported sharding strategies are file, chunk, and record.
Repository Layout
src/json2vec/architecture: model assembly, attention, pooling, and parcel routingsrc/json2vec/data: dataset fetch/read/process/batch/encode pipelinesrc/json2vec/entrypoints: training and evaluation orchestrationsrc/json2vec/inference: serving and prediction callbackssrc/json2vec/logging: tracking and runtime logging helperssrc/json2vec/processors: processor registry and built-in extensionssrc/json2vec/structs: pydantic config models, enums, tree structures, and environment settingssrc/json2vec/tensorfields: tensorfield plugin system and built-in field typestests/: package test suitedocs/summary.typanddocs/whitepaper.typ: longer written documentation
Diagrams
The repository includes architecture and pipeline diagrams:
Development
Run the test suite with:
uv run pytest
Run lint checks with:
uv run ruff check
License
Licensed under the Apache License, Version 2.0. See LICENSE and NOTICE.
References
BIBLIOGRAPHY.mdCITATION.bib
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file json2vec-0.2.0.tar.gz.
File metadata
- Download URL: json2vec-0.2.0.tar.gz
- Upload date:
- Size: 51.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fef011935875a8120e0ee57da1bcfb8490d486f08810544be47433173ac47866
|
|
| MD5 |
0a61d8de85111c38b471a69d159ec725
|
|
| BLAKE2b-256 |
f7b8ff68472a5d7617fe14344c33820c7e1f6d2c1fe0b67b042350cf5ba2479e
|
Provenance
The following attestation bundles were made for json2vec-0.2.0.tar.gz:
Publisher:
pypi-publish.yml on granthamtaylor/json2vec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
json2vec-0.2.0.tar.gz -
Subject digest:
fef011935875a8120e0ee57da1bcfb8490d486f08810544be47433173ac47866 - Sigstore transparency entry: 1364122520
- Sigstore integration time:
-
Permalink:
granthamtaylor/json2vec@4a4b585cd48d6070d279ff8e82f5d6b1c3f3ebfb -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/granthamtaylor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@4a4b585cd48d6070d279ff8e82f5d6b1c3f3ebfb -
Trigger Event:
release
-
Statement type:
File details
Details for the file json2vec-0.2.0-py3-none-any.whl.
File metadata
- Download URL: json2vec-0.2.0-py3-none-any.whl
- Upload date:
- Size: 68.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4c117dabbfdd6dd5182057301ead709de4fe5e063e84c3cf14db43f555bf526
|
|
| MD5 |
76ef136a127a4bd4dd306b143eadea76
|
|
| BLAKE2b-256 |
be64f9c1ef3abe41c8a535457e9fc734973904b6a97a28f05e04d805a0eb759f
|
Provenance
The following attestation bundles were made for json2vec-0.2.0-py3-none-any.whl:
Publisher:
pypi-publish.yml on granthamtaylor/json2vec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
json2vec-0.2.0-py3-none-any.whl -
Subject digest:
d4c117dabbfdd6dd5182057301ead709de4fe5e063e84c3cf14db43f555bf526 - Sigstore transparency entry: 1364122632
- Sigstore integration time:
-
Permalink:
granthamtaylor/json2vec@4a4b585cd48d6070d279ff8e82f5d6b1c3f3ebfb -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/granthamtaylor
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@4a4b585cd48d6070d279ff8e82f5d6b1c3f3ebfb -
Trigger Event:
release
-
Statement type: