Independent banking event-sequence modeling toolkit inspired by arXiv:2604.08649; not affiliated with or endorsed by Revolut.
Project description
pragmatiq
pragmatiq is a developer-friendly implementation of ideas described in the PRAGMA paper (arXiv:2604.08649) by Ostroukhov et al. It turns user histories made of timestamped key-value events into embeddings that downstream teams can use for probes, LoRA fine-tuning, graph-based AML experiments, explainability, and serving.
The repository includes the full stack: a deterministic synthetic banking data generator, tokenizer, padding-free PyTorch model, training pipeline, batch embedding, ONNX/Triton serving, notebooks, and a Streamlit demo. It is built to run on CPU first; CUDA and flash-attn are accelerations, not requirements.
pragmatiq is an independent implementation inspired by the PRAGMA paper (arXiv 2604.08649). It is not affiliated with or endorsed by Revolut.
Contents
- Why this exists
- Quickstart
- What is included
- Architecture
- Repository map
- Workflows
- Using your own data
- Training on GPU
- Extending pragmatiq
- AML over the transfer graph
- Model sizes
- PRAGMA+Nemotron text-embedding variant
- Defaults where the paper is silent
- Serving with Triton
- Streamlit demo
- Observability
- Notebooks
- Development
- Citation
- License
Why this exists
The PRAGMA paper (arXiv:2604.08649) describes a practical foundation-model recipe for heterogeneous banking behavior: transactions, app sessions, trades, communications, profiles, and transfer relationships. The paper does not ship reference code.
pragmatiq provides an open implementation that researchers, ML engineers, and data scientists can run, inspect, adapt, and benchmark. The goal is not to claim novelty over PRAGMA. The goal is to make the implementation path concrete.
Quickstart
# pragmatiq is in public beta on PyPI (a pre-release, so --pre is required):
pip install --pre pragmatiq
pragmatiq quickstart
Or from a clone of the repo (for development):
git clone https://github.com/dynamiq-ai/pragmatiq.git && cd pragmatiq
pip install -e ".[dev]"
quickstart runs a CPU-capable synthetic pipeline end to end:
- generate synthetic users and event histories,
- fit the key-value-time tokenizer,
- pretrain a nano masked-language model,
- embed users,
- run a gradient-boosting credit-risk probe against a raw-count baseline.
For a smaller local smoke test:
pragmatiq quickstart --n-users 2000 --max-steps 80
For development from a source checkout:
pip install -e ".[dev]"
The full pipeline — including the gradient-boosting probe and the AML transfer-graph
GraphSAGE ablation — works with the plain install. Optional extras add focused
tooling: .[serve] for ONNX/Triton export and serving, .[demo] for the Streamlit
demo, .[extras] for the frozen text embedder plus experiment tracking (Weights &
Biases, TensorBoard), and .[full] for all of them.
The same workflow is available from Python:
from pragmatiq import api
result = api.quickstart(n_users=2000, max_steps=80)
print(result["message"])
The CLI is intentionally thin. Commands parse arguments, then call public
functions in pragmatiq/api.py, so notebook and production
callers use the same library surface.
What is included
| Area | What pragmatiq ships |
|---|---|
| Synthetic data | Agent-based banking simulator with deterministic seeds, causal labels, fraud/default/AML scenarios, and realism checks. |
| Tokenization | Key-value-time tokenizer for numeric bins, categorical values, descriptor text, profiles, and unseen [UNK] fallbacks. |
| Model | PRAGMA-style profile, event, and history encoders with TimeRoPE, padding-free varlen attention, and tied MLM head. |
| Training | CPU-safe pretraining, resume-safe checkpoints, probes, LoRA fine-tuning, and configurable heads/maskers/value encoders. |
| AML graph | GraphSAGE ablation over transfer graphs using isolated embeddings, pragmatiq features, and hand-crafted graph features. |
| Inference | Batch embedding, PragmaModel.from_pretrained(run), notebook-friendly embed_records, integrated-gradients attribution, ONNX, and Triton serving. |
| Publication assets | Model card, contribution guide, citation metadata, security policy, Apache-2.0 license, and GitHub templates. |
Architecture
At a high level, pragmatiq follows the PRAGMA recipe:
- Key and value embeddings share one table, also reused for tied MLM logits.
- Continuous time positions use the paper's
8 * ln(1 + delta_t / 8)transform. - Events are encoded independently, then combined with profile information in a bidirectional history encoder.
- Masking combines token, whole-event, and key-level masks;
[UNK]replacements are excluded from MLM loss. - Checkpoints include model, optimizer, scheduler, sampler position, RNG states, tokenizer hash, and resolved config.
- Registries let users customize heads, maskers, and value encoders without forking the library.
See docs/architecture.md for a teaching-grade walkthrough of
the encoders, temporal encoding, and objective, and MODEL_CARD.md for the
objective and limitations write-up.
Paper fidelity
What follows the PRAGMA paper, and what pragmatiq adds on top:
| Component | In the paper | pragmatiq |
|---|---|---|
Key–value–time tokenization, 8·ln(1+Δt/8) time transform, TimeRoPE on continuous log-seconds |
✓ | faithful |
Profile / event / history encoders; 3d MLM head [ẑ_e, z_h(EVT), z_h(USR)] → tied logits + label smoothing |
✓ | faithful |
Masking 15% token / 10% event / 10% key, 10% [UNK]-as-dropout excluded from loss |
✓ | faithful |
| Model sizes 10M / 100M / 1B | ✓ | faithful S/M/L (+ a CPU nano for CI) |
| Pre-training caps (event ≤24 tokens, profile ≤200, ≤6500 events/user) | ✓ | faithful |
| PRAGMA+Nemotron frozen-text-embedding variant (MSE reconstruction) | ✓ | implemented, switchable, off by default |
| Synthetic data generator | the paper uses real Revolut data | our addition — agent-based, deterministic |
| AML over the transfer graph (GraphSAGE ablation) | not in the paper | our addition, presented standalone |
| Downstream probe | LoRA + a probe across tasks | gradient boosting is our default probe head |
Repository map
pragmatiq/
├── pragmatiq/ # the library — all logic lives here
│ ├── api.py # public functions: synthesize / tokenize / pretrain / ...
│ ├── cli.py # Typer CLI (parses args, calls api.py, nothing else)
│ ├── registry.py # @register_head / @register_masker / @register_value_encoder
│ ├── validate.py # data-contract validation with actionable errors
│ ├── data/ # schema, tokenizer, sharding, collation, synthetic generator
│ ├── models/ # encoders, MLM head, LoRA, AML GNN (gnn.py)
│ ├── training/ # pretrainer, masking, Muon+AdamW, probe, finetuner
│ ├── inference/ # batch embedder, attribution (explain.py), ONNX export, benchmarks
│ └── experiments/ # run directories, metric logging, run comparison
├── configs/ # model / pretrain / tokenizer / synthetic / finetune YAMLs
├── notebooks/ # 01–04 guided walkthroughs (see Notebooks below)
├── demo/app.py # Streamlit demo
├── deploy/ # Triton model repo, docker-compose, Prometheus, demo Dockerfile
├── scripts/ # runpod_launch.py (GPU rental), gates/ (maintainer validation)
└── tests/ # the spec in executable form — useful usage examples
The tests/ directory doubles as living documentation: for example
tests/test_tokenizer.py shows the tokenizer
contract end to end, and tests/test_training.py
demonstrates bit-exact checkpoint resume.
Workflows
The quickstart is the best first run. Use the commands below when you want to drive each stage explicitly.
1. Generate and tokenize data
pragmatiq synth generate --out data/synth --config configs/data/synthetic.yaml
pragmatiq tokenize data/synth --out data/tokenized --n-workers 8
The generator takes --n-users, --seed, and --n-workers; the same seed
produces byte-identical output for any worker count. To fit generator priors to
aggregate statistics a bank can share (no raw data), use
pragmatiq synth calibrate --stats configs/data/aggregates.example.yaml.
2. Pretrain and embed
pragmatiq pretrain data/tokenized --name demo --model-size small --config configs/pretrain.yaml
pragmatiq embed data/tokenized --run runs/demo --out embeddings.parquet
Inspect and compare runs at any point:
pragmatiq runs list
pragmatiq runs compare demo other-run
3. Evaluate task signal
probe is the fastest CPU check. Use fine-tuning when you want supervised
adapter training on a labeled task.
pragmatiq probe data/tokenized --run runs/demo --label data/synth/labels/default_12m.parquet
pragmatiq finetune data/tokenized --run runs/demo --label data/synth/labels/default_12m.parquet \
--config configs/finetune/credit.yaml
The probe head is gradient boosting by default (HistGradientBoostingClassifier;
--probe-model logistic or lightgbm switch it), which captures the non-linear
structure of a learned embedding better than a linear head. Both ROC-AUC and PR-AUC
are reported — PR-AUC is the honest headline on the low-prevalence risk tasks — and
the raw-count baseline uses the same classifier, so the gap reflects the
representation, not the model family. On the default synthetic book the pragmatiq
probe beats that baseline. When a label table carries an eval_ts column, histories
are truncated at it before embedding — for both the probe and the baseline — so
metrics are forecasts, never hindcasts.
The embedding's value is not specific to credit. The multi-task probe benchmark
(scripts/benchmarks/multitask_probe.py) probes every user-level task against
the same raw-count baseline. The table below is auto-written by
write_multitask_report on an opt-in run (PRAGMATIQ_WRITE_RESULTS=1), carries
a provenance stamp, and is refused if it would replace a larger-scale result.
Event-level fraud/recurring (transaction/series-level) and comm_uplift
(a treatment-effect task — see pragmatiq uplift) are evaluated by their own
paths, not this user-embedding probe.
| task | probe AUC | raw-count baseline | probe - baseline | prevalence |
|---|---|---|---|---|
| default_12m | 0.770 | 0.543 | +0.228 | 0.03 |
| churn_6m | 0.744 | 0.609 | +0.135 | 0.12 |
| ltv_positive | 0.828 | 0.667 | +0.161 | 0.80 |
Illustrative synthetic-benchmark run (reproducible config): n_users=50000, model=small, steps=2000, seed=0. Treat these as a pipeline check, not a performance claim.
Using your own data
pragmatiq trains on a small parquet contract — four files, strict dtypes
(enforced by pragmatiq validate, defined in
pragmatiq/data/schema.py):
| File | Required columns |
|---|---|
events.parquet |
user_id (string), ts (timestamp[us]), source (string), fields (map<string,string>) |
profiles.parquet |
user_id (string), as_of (timestamp[us]), attributes (map<string,string>), lifelong (list<struct<key: string, ts: timestamp[us]>>) |
transfers.parquet |
optional, for the AML GNN; from_user, to_user (string), ts (timestamp[us]), amount (float64) |
labels/*.parquet |
optional task tables: forecast labels use user_id (string), eval_ts (timestamp[us]), label (int8); AML membership uses user_id, observed_through (timestamp[us]), label (int8) |
The subsections below walk through producing each file from a typical bank data warehouse.
Events: one row per event, fields as a string map
Every event is (user_id, ts, source, fields). source must be one of
transaction, app, trading, communication. fields is a flat
map<string,string>: stringify everything — the tokenizer decides per key
whether a field is numeric (percentile-binned into 64 buckets + a zero
bucket), categorical (one token per value up to 1000 distinct values), or
free text (byte-level BPE). A flattening export looks like this
(Trino/Spark-flavored SQL):
-- one row per event, sorted so each user's rows are adjacent and time-ordered
CREATE TABLE export.events AS
SELECT * FROM (
SELECT CAST(t.customer_id AS VARCHAR) AS user_id,
t.booked_at AS ts,
'transaction' AS source,
MAP(ARRAY['amount', 'currency', 'mcc', 'merchant', 'channel'],
ARRAY[CAST(t.amount AS VARCHAR), t.currency,
CAST(t.mcc AS VARCHAR), t.merchant_name, t.channel]) AS fields
FROM dwh.card_transactions t
UNION ALL
SELECT CAST(a.customer_id AS VARCHAR),
a.event_time,
'app',
MAP(ARRAY['screen', 'action', 'os'],
ARRAY[a.screen, a.action, a.os])
FROM dwh.app_events a
)
ORDER BY user_id, ts;
The ORDER BY (user_id, ts) is not cosmetic: pragmatiq validate errors on
out-of-time-order events within a user and warns when one user's rows appear in
non-adjacent blocks (they would tokenize as multiple records). Different
sources can carry completely different field keys — that heterogeneity is the
point of the key-value scheme.
Profiles: static attributes plus lifelong milestones
One row per user: attributes holds static key-values (country, age band,
plan), lifelong holds dated milestones (account opened, KYC passed, first
card), and as_of is the snapshot timestamp:
CREATE TABLE export.profiles AS
SELECT CAST(c.customer_id AS VARCHAR) AS user_id,
CURRENT_TIMESTAMP AS as_of,
MAP(ARRAY['country', 'age_band', 'plan'],
ARRAY[c.country, c.age_band, c.plan]) AS attributes,
ARRAY[ROW('account_opened', c.opened_at),
ROW('kyc_passed', c.kyc_at)] AS lifelong
FROM dwh.customers c;
Transfers and labels
transfers.parquet is the directed money-flow edge list consumed by the
AML GNN — only needed for graph experiments.
Forecast label tables are (user_id, eval_ts, label):
CREATE TABLE export.labels_default_12m AS
SELECT CAST(customer_id AS VARCHAR) AS user_id,
observation_date AS eval_ts, -- when the decision was made
CAST(defaulted_within_12m AS TINYINT) AS label
FROM risk.default_outcomes;
Why eval_ts matters. pragmatiq probe truncates each user's history at
their eval_ts before embedding, so the embedding only sees what the bank knew
at decision time and the reported AUC is a genuine forecast, not a hindcast on
data that already contains the outcome. The AML task is the deliberate
exception: mule detection is membership classification over observed activity,
so labels/aml.parquet uses (user_id, observed_through, label) and the full
horizon rather than a forecast cut (see MODEL_CARD.md).
Validate, tokenize, pretrain
pragmatiq validate data/mybank
pragmatiq tokenize data/mybank --out data/mybank-tokenized --n-workers 8
pragmatiq pretrain data/mybank-tokenized --name mybank-v1 --model-size small
validate catches dtype mismatches, null ids, null timestamps, unknown sources,
timestamp ordering problems, non-contiguous user rows, and cardinality
pathologies — each finding is a human-readable string with a concrete fix, and
the command exits nonzero on errors. Events must be grouped by user and ascending
in ts; sort by (user_id, ts) before tokenizing (a user whose rows are split
apart is refused rather than silently fragmented). tokenize --n-workers N
parallelizes encoding across processes with a byte-identical guarantee: shard
files are identical for any worker count.
To rehearse this path against sparse data before pointing it at a real feed,
generate synthetic data with missing_field_rate > 0 (a WorldConfig knob): that
fraction of event fields is dropped deterministically, so you can confirm the
validate → tokenize → train flow tolerates missing fields.
Time zones and the calendar features
pragmatiq derives the hour-of-day, day-of-week and day-of-month calendar features
from each event's timestamp, and by default treats timestamps as UTC. If your
data is in a local zone — so the behavioural day/night, weekend and payday
structure is local — set calendar_tz in the tokenizer config to an IANA zone:
# tokenizer.yaml passed via `pragmatiq tokenize --config`
calendar_tz: Europe/London # default: UTC
or programmatically, api.tokenize(raw, out, config={"calendar_tz": "Europe/London"}).
Pass timezone-aware timestamps so the instant itself is correct; only the
wall-clock calendar features are localized (DST included). calendar_tz is folded
into the tokenizer content hash, so a checkpoint refuses to load against a
tokenizer fitted with a different zone.
[UNK] and when to refit the tokenizer
At inference time, keys and values not seen during tokenizer fitting map to
[UNK] with a logged warning instead of raising KeyError. That makes drift
visible instead of fatal — but a rising [UNK] rate is a signal, not a
feature.
- Reuse a tokenizer (
pragmatiq tokenize newdata --tokenizer-dir data/mybank-tokenized/tokenizer ...) whenever you tokenize new data for an existing trained model. This is mandatory, not just convenient: checkpoints embed the tokenizer content hash andPragmaModel.from_pretrained()refuses to load against a mismatched tokenizer. - Refit a tokenizer (the default, no
--tokenizer-dir) when you are starting a new pretraining run — new bank, new event sources, or enough drift that[UNK]warnings are frequent. A refit changes the vocabulary, so it always implies pretraining a new model with it.
Training on GPU
Everything runs on CPU (slow but correct); a GPU is an acceleration, not a
requirement. The trainer is built on Lightning Fabric and auto-detects CUDA:
on GPU it trains in bf16-mixed precision, on CPU in fp32 — no flags
needed, pragmatiq pretrain is the same command on both.
flash-attn is optional. When flash-attn is installed on CUDA, varlen
attention uses flash_attn_varlen_func directly on the packed, padding-free
token stream. Otherwise pragmatiq falls back to PyTorch SDPA over per-segment
padded blocks built from cu_seqlens — the two paths agree to fp32 atol 1e-4
(checked in CI as the padding-equivalence test).
Knobs that matter
All defaults live in configs/pretrain.yaml and map
1:1 onto TrainConfig in
pragmatiq/training/pretrainer.py; any
key can be overridden via --config or programmatically through
api.pretrain(..., key=value).
| Knob | Default | What it does |
|---|---|---|
max_steps |
20000 |
Optimizer steps; also the cosine-schedule horizon. |
token_budget |
16384 |
Tokens per packed forward (the per-device memory knob — raise it on big GPUs). |
grad_accum_steps |
1 |
Micro-batches per optimizer step. Effective batch = token_budget × grad_accum × world_size; raise it for a large, stable batch on a memory-bound GPU without raising token_budget. |
devices / num_nodes |
auto / 1 |
Fabric DDP: per-node device count and host count (multi-node). |
lr_muon / lr_adamw |
3e-3 / 3e-4 |
Muon drives 2-D hidden weights; AdamW drives embeddings/norms/biases. |
warmup_steps |
500 |
Linear warmup before the cosine decay. |
weight_decay / grad_clip |
0.01 / 1.0 |
Applied to both optimizers. |
checkpoint_every_min |
15.0 |
Wall-clock minutes between full checkpoints. |
log_every |
20 |
Steps between metric logs and the stderr heartbeat. |
verbose |
true |
One-line heartbeat (step, loss, mlm_acc, tok/s, ETA); --quiet disables it. |
wandb / wandb_project |
false / pragmatiq |
Weights & Biases mirror (or pass --wandb). |
seed / nan_skip |
0 / true |
Reproducibility; NaN/inf losses dump the batch to debug/ and skip the step. |
max_consecutive_skips |
50 |
Abort if this many steps in a row are skipped for a non-finite loss/grad — a divergence guard so a broken run fails loud instead of burning compute. |
deterministic |
false |
Opt-in reproducible CUDA path (see Determinism); forces fp32 on GPU, at a throughput cost. |
masker, p_token, p_event, p_key, p_unk |
pragma, 0.15, 0.10, 0.10, 0.10 |
Masking strategy (swappable via @register_masker) and its rates. |
text_loss_weight |
1.0 |
Weight λ on the text MSE term in the Nemotron variant (loss = CE + λ·MSE); inert without a text encoder. |
Scaling to 1M–26M records, hands-off
Pointing pragmatiq at a large book should not require tuning the batch or schedule.
Pass config="auto" (or --config auto) and it sizes token_budget, grad_accum_steps,
max_steps, and warmup_steps from the shard index (user count, token distribution) and
the target device, with an explainable rationale logged at startup:
pragmatiq pretrain data/tokenized --name big --model-size medium --config auto
api.pretrain("data/tokenized", "big", model_size="medium", config="auto")
Explicit overrides still win (api.pretrain(..., config="auto", grad_accum_steps=8)), and
autoconfigure is callable directly to inspect the plan.
The three levers it sets are also usable by hand:
- Gradient accumulation decouples the effective batch from device memory.
grad_accum_stepsmicro-batches accumulate before one clip + optimizer step, so a stable batch is reached on a memory-bound GPU without raisingtoken_budget. With the default1the trajectory is byte-identical to no accumulation. - Multi-node DDP is
devices(per-node) ×num_nodes. The rank sampler shards the data per global rank with a per-rank masking seed, so adding ranks trains disjoint slices in lockstep. - Truncation caps (set on the tokenizer:
max_event_tokens=24,max_profile_tokens=200,max_events_per_user=6500) keep heavy-tailed real histories tractable; they do not bind at synthetic scale, so default output is unchanged.
Resume semantics
pragmatiq pretrain data/tokenized --name demo --resume auto
--resume auto picks up runs/demo/checkpoints/last.pt if it exists.
Checkpoints capture the model, both optimizers, the LR scheduler, the
sampler position, all RNG states (torch, numpy, CUDA, and the masking
generator), the tokenizer hash, and the resolved config — so an interrupted
run resumed mid-flight reproduces the exact batch and masking stream of an
uninterrupted one (this is tested bit-exactly). Resuming with a different
tokenizer is refused with a clear error.
Determinism
From a fixed seed, CPU runs are byte-identical — weight init, dropout, the masking stream, and shard/worker output are all seeded (CI-enforced for the generator and the resume test). GPU kernels use a different reduction order, so CPU and GPU outputs are never bit-identical to each other — pick one target and compare against itself.
The opt-in deterministic: true flag makes the GPU path reproducible on
fixed hardware. It enables torch.use_deterministic_algorithms, the cuDNN
deterministic path, flash-attn's deterministic backward, and (because bf16
backward on CUDA has no deterministic implementation upstream) trains in fp32
instead of bf16-mixed. With it on:
- GPU forward / embedding is reproducible run-to-run on the same hardware.
- GPU training is bit-exact in fp32 (same seed → same loss curve).
- GPU bf16 training is not bit-exact — the SDPA/flash bf16 backward stays
nondeterministic upstream. Leaving
deterministic: falsekeeps the default bf16-mixed path; a deterministic bf16 run is run-to-run stable to ~1e-3.
The flag is off by default, so default performance and behaviour are unchanged; turn it on only when you need a reproducible GPU run and can accept the fp32 throughput cost.
Multi-GPU and renting a GPU
Fabric launches DDP across all visible CUDA devices automatically — no flag
changes, just run on a multi-GPU host. Each rank trains a disjoint, equal-sized
shard of every epoch's batches (so the gradient all-reduce covers the full epoch
once) with an independent masking stream, and only rank 0 writes checkpoints and
metrics. For multi-node jobs set num_nodes (and devices per node) in the
config; the rank sharding and per-rank masking extend across the whole world. Pair
this with grad_accum_steps and config: auto (see
Scaling to 1M–26M records) for a large, stable
effective batch without per-device OOM.
As a reference point: the small model sustains roughly 79k tokens/s on a
single A100 at token_budget: 8192 (measured on a full-scale run; throughput is
logged per run in runs/<name>/metrics.jsonl, so measure on your own data).
scripts/runpod_launch.py is a turnkey path for
validating on a rented A100/H100: it creates a RunPod pod via the REST API,
syncs the repo over SSH (no GitHub required), installs, and runs the GPU
end-to-end pipeline:
export RUNPOD_API_KEY=...
python scripts/runpod_launch.py --gpu "NVIDIA A100 80GB PCIe" --run-name a100-smoke
python scripts/runpod_launch.py --terminate <pod_id>
Extending pragmatiq
Heads, maskers, and value encoders are resolved by name through
pragmatiq/registry.py, so you can plug in your own
without forking. Register a component with the matching decorator and reference it
by name in a config.
import torch.nn as nn
from pragmatiq.registry import register_head
@register_head("ranking")
class RankingHead(nn.Module):
"""A custom task head on the user embedding z_h[USR] -> [n_users, n_classes]."""
def __init__(self, dim: int, n_classes: int = 2) -> None:
super().__init__()
self.net = nn.Sequential(nn.Linear(dim, dim), nn.GELU(), nn.Linear(dim, n_classes))
def forward(self, user_repr):
return self.net(user_repr)
Select it in a fine-tune config and run as usual:
# myhead.yaml
head: ranking
n_classes: 2
lora_rank: 8
pragmatiq finetune data/tokenized --run runs/mybank-v1 \
--label data/mybank/labels/churn.parquet --config myhead.yaml
Registration is an import side effect, so import the module that defines your
component before finetune/get_head runs (keep it in a package your config or
entry point imports). @register_masker(name) and @register_value_encoder(name)
work the same way, referenced by the masker and value_encoder config keys.
AML over the transfer graph
Money laundering through mule rings is a relational problem: a mule is defined by who they transact with (fan-in of small credits, layering inside the ring, shared cash-out), not only by their own behavior. pragmatiq ships a transfer-graph extension that tests exactly how much of that signal a graph recovers. The AML GNN path is part of the core install.
The pieces
TransferGraphBuilder(pragmatiq/models/gnn.py) turnstransfers.parquetplus frozen pragmatiq user embeddings into a PyG graph: nodes are embedded users, directed edges are money flows (kept only between users that have features), edge attributes are amount and log-recency, node features are the embeddings, and node labels mark mules.AmlGNNis a 2–3 layer GraphSAGE node classifier with two deliberate design choices. Sum aggregation: mule detection hinges on fan-in, and mean aggregation normalizes degree away — sum keeps it, so message passing can recover the structural signal regardless of which node features are used. A raw-feature skip to the head: each node's own features are projected and concatenated with the graph representation before classification (with residual conv layers), because without it the stack over-smooths and loses the per-user signal an isolated probe already has.
The four-arm ablation
pragmatiq gnn (and api.gnn) runs four setups on identical, stratified
train/val/test splits shared across arms, over multiple seeds:
| Arm | Features | Graph? | Question it answers |
|---|---|---|---|
| (a) | pragmatiq embeddings | no — logistic probe | How far does a per-user embedding get alone? |
| (b) | pragmatiq embeddings | GraphSAGE | Does the transfer graph add to rich learned features? |
| (c) | hand-crafted node stats (degree, volume) | GraphSAGE | What does a fraud analyst's baseline + a graph achieve? |
| (d) | same hand-crafted stats | no — logistic control | Is the graph effect real, or just the features? |
The synthetic mule rings are multi-hop layered laundering chains. Money mules
are modeled as ordinary recruited accounts with no distinctive individual
behavior, and their amounts and counterparty degree are drawn to match ordinary
accounts — so 1-hop degree is not a trivial oracle ((d), a degree-only
logistic baseline, reaches only 0.604). The laundering legs are written to the
transfer ledger (transfers.parquet), not the card-event stream; the
discriminative signal is the multi-hop layering chain in that ledger.
The gated claim — relational recovery. A GraphSAGE over the
transfer graph recovers money-mule rings that a probe on the isolated per-user
embedding cannot: (c) 0.670 ≫ (a) 0.498, so the AML signal lives in the
multi-hop transfer structure an isolated embedding misses. Message passing adds
over the same features without a graph ((c) 0.670 > (d) 0.604). This is what
gate_6 gates, at both CI and full scale.
Reported, not gated — the honest limitation. The learned per-user embedding
adds only a little over the isolated probe ((b) 0.554 > (a) 0.498) and does
not beat hand-crafted features ((b) 0.554 < (c) 0.670). The isolated
embedding sits near chance, so the model does not capture the multi-hop laundering
signal in the per-user representation; recovering it in a learned representation
is the open challenge. This is consistent with the PRAGMA paper's own
observation that AML is a setting where the model underperforms because it
processes user histories in isolation — the GNN here is pragmatiq's honest
extension that probes that gap, not a "the learned embedding wins" result. See
MODEL_CARD.md and
notebooks/04_aml_gnn.ipynb.
Running it
pip install -e .
pragmatiq gnn data/tokenized --run runs/demo \
--transfers data/synth/transfers.parquet \
--aml-label data/synth/labels/aml.parquet \
--seeds 0,1,2 --epochs 150
The command embeds users with the trained run, builds the graph, fits each
arm per seed, and prints per-arm mean ± std ROC-AUC plus a verdict dict.
notebooks/04_aml_gnn.ipynb is the interactive
version of the same experiment.
Latest results
The table below is auto-written by write_aml_report in
pragmatiq/models/gnn.py on a passing full run
(opt-in via PRAGMATIQ_WRITE_RESULTS=1). Generated tables carry a provenance
stamp (node/edge/mule counts, seeds, epochs, git commit), and the writer
refuses to overwrite a larger-scale result with a smaller-scale one — a
CI-scale run can never masquerade as a full-scale result.
| setup | ROC-AUC (mean ± std over seeds) |
|---|---|
| (a) probe on isolated pragmatiq embeddings | 0.498 ± 0.010 |
| (b) GraphSAGE over transfers + pragmatiq features | 0.554 ± 0.026 |
| (c) GraphSAGE + hand-crafted node features | 0.670 ± 0.014 |
| (d) control: logistic regression on the same hand-crafted features, no graph | 0.604 ± 0.028 |
Relational recovery (gated): True — a GraphSAGE over the transfer graph recovers money-mule rings that a probe on isolated pragmatiq embeddings cannot ((c) > (a) = True), so the AML signal lives in the multi-hop transfer structure an isolated per-user embedding misses. Money mules are degree- and volume-matched to ordinary accounts, so the signal is the multi-hop layering chain, not 1-hop degree, and message passing adds over the same features without a graph ((c) > (d) = True). The gate requires both.
Reported, not gated: the learned per-user embedding adds a little over the isolated probe ((b) > (a) = True) but does not beat hand-crafted features ((b) > (c) = False). The isolated embedding sits near chance, so on this synthetic book the model does not capture the multi-hop laundering signal on its own — recovering it in a learned per-user representation is the open challenge (see MODEL_CARD.md).
provenance: n_nodes=12000, n_edges=344388, n_mules=607, seeds=[0, 1, 2], epochs=150, commit=a263737
Model sizes
| Size | d | Heads | Depths (profile / event / history) | Nominal | Actual params at ~28k vocab |
|---|---|---|---|---|---|
small |
192 | 3 | 1 / 5 / 2 | 10M | ~9.1M |
medium |
512 | 8 | 3 / 16 / 6 | 100M | ~94M |
large |
1024 | 16 | 9 / 45 / 18 | 1B | ~940M |
These sizes are the presets in ModelConfig.preset
(pragmatiq/models/pragmatiq.py); configs/model/{small,medium,large}.yaml
document them. Any architecture field (e.g. rope_base, dropout) can be
overridden by passing it in the pretrain config. The test suite checks the
model and MLM head parameter counts against the nominal sizes.
PRAGMA+Nemotron text-embedding variant
The paper describes a variant in which high-cardinality text fields (merchant names, device ids, free-text memos) are not split into BPE pieces. Instead a frozen text model maps each value's full string to a single vector, and the MLM objective reconstructs that continuous vector with MSE rather than predicting sub-word ids. On its text-heavy tasks the paper reports a meaningful credit lift at a modest latency cost, so pragmatiq ships it as a switchable option — off by default, so the BPE path is byte-identical to a build without it.
It is switchable from the data step alone. Tokenize in embed mode and pretrain
auto-builds the matching frozen encoder and MSE reconstruction head — no model flags:
pip install -e ".[extras]" # adds transformers (the frozen embedder)
pragmatiq tokenize data/synth --out data/tokenized --config configs/data/tokenizer_nemotron.yaml
pragmatiq pretrain data/tokenized --name nemo --model-size medium # text branch auto-wired
- Encoders are resolved by name from the registry (
@register_text_encoder): the productionnemotronembedder (frozen, mean-pooled,no_grad), and a deterministic, dependency-freehashstand-in so the whole path — embed-mode tokenization, the text input projection, the MSE branch, and masking that routes text to reconstruction — is exercised on CPU in CI without downloading a multi-GB model. - Objective: masked text tokens are reconstructed by a
Linear(3d → text_dim)head against the frozen vector; the loss isCE + λ·MSE(text_loss_weight, default 1.0). Ordinary tokens keep the cross-entropy MLM objective. - Serving handles both variants — build the Triton image with
PRAGMATIQ_TRITON_EXTRAS=nemotron(see Serving with Triton).
Defaults where the paper is silent
The paper leaves some engineering details unspecified. pragmatiq treats these
as defaults, exposes them in config, and marks source-level guesses with
# GUESS.
| Knob | Default | Where |
|---|---|---|
| Muon LR for 2-D hidden weights | 3e-3 |
configs/pretrain.yaml |
| AdamW LR for embeddings/norms/biases | 3e-4 |
configs/pretrain.yaml |
| Token budget per batch | 16384 |
configs/pretrain.yaml |
| Warmup steps | 500 |
configs/pretrain.yaml |
| Numeric percentile buckets plus zero bucket | 64 |
configs/data/tokenizer.yaml |
| Target total vocabulary | 28000 |
configs/data/tokenizer.yaml |
| RoPE base | 10000.0 |
ModelConfig size preset; override via the pretrain config (e.g. rope_base:) |
[UNK] fraction of masked positions |
10% |
MaskingStrategy(p_unk=0.10) |
Serving with Triton
The production serving path is a Triton python backend that runs the native
varlen PyTorch model — the exact no-padding forward used in training, not an
approximation. The model repository lives at
deploy/triton/model_repository/pragmatiq_embedder/:
config.pbtxt
declares the interface and
1/model.py
loads PragmaModel.from_pretrained(run_dir) once at startup and serves
embed_records for every request.
One-command deploy + smoke
scripts/deploy_serving.sh is the turnkey path: it
builds the serving image, boots tritonserver with your run mounted (the host GPU is
used automatically when present), waits for readiness, then sends a real embedding
request and verifies the [n_users, dim] response.
pragmatiq pretrain data/tokenized --name demo # any trained run works
bash scripts/deploy_serving.sh --run runs/demo # default model
bash scripts/deploy_serving.sh --run runs/nemo --variant nemotron # Nemotron variant
Bring the full stack up
deploy/docker-compose.yaml starts four services:
Triton, Prometheus, Grafana, and the Streamlit demo. The Triton service builds
from deploy/triton/Dockerfile, which installs pragmatiq
into Triton's Python (the python backend imports it; the stock image cannot run the
model) while leaving the image's CUDA build of torch untouched. Point PRAGMATIQ_RUN
at a trained run directory — the one with checkpoints/ and tokenizer/ — mounted
read-only at /models/run:
export PRAGMATIQ_RUN=$PWD/runs/demo
docker compose -f deploy/docker-compose.yaml up -d --build
# Nemotron variant serving: PRAGMATIQ_TRITON_EXTRAS=nemotron docker compose ... up -d --build
The Triton service has a readiness healthcheck (/v2/health/ready); CPU-first by
default, add a GPU reservation (commented in the compose file) for GPU serving.
| Service | Port | What it is |
|---|---|---|
| Triton HTTP / gRPC | 8000 / 8001 | Inference endpoints |
| Triton metrics | 8002 | Prometheus scrape target (5s interval, see deploy/prometheus/prometheus.yml) |
| Prometheus | 9090 | Metrics store |
| Grafana | 3000 | Dashboards (anonymous admin enabled; Prometheus at http://prometheus:9090) |
| Streamlit demo | 8501 | See Streamlit demo |
Request format
One request carries a JSON array of plain user records — the same dicts
PragmaModel.embed_records accepts (user_id, events as
{ts, source, fields} objects with ts in microseconds since epoch, optional
attributes and lifelong). The response is the [n_users, dim] fp32
embedding matrix. Batching happens inside the model: the varlen forward
packs all users in the request with no padding, which is why config.pbtxt
sets max_batch_size: 0 and uses a 2-instance group for request-level
concurrency instead of Triton dynamic batching (the trade-off is documented in
the config file).
curl -s localhost:8000/v2/models/pragmatiq_embedder/infer \
-H 'Content-Type: application/json' \
-d @- <<'JSON'
{
"inputs": [{
"name": "records_json",
"shape": [1],
"datatype": "BYTES",
"data": ["[{\"user_id\": \"u1\", \"events\": [{\"ts\": 1718200000000000, \"source\": \"transaction\", \"fields\": {\"amount\": \"42.50\", \"currency\": \"GBP\", \"merchant\": \"TESCO\"}}], \"attributes\": {\"country\": \"GB\"}, \"lifelong\": []}]"]
}]
}
JSON
Unseen keys or values in a request map to [UNK] with a logged warning —
serving never raises KeyError on vocabulary drift.
Benchmarking
pragmatiq benchmark data/tokenized --run runs/demo --device cuda
benchmark measures local batch-embedding throughput (users/s, tokens/s, and
a USD-per-million-users estimate) and writes deploy/benchmarks/RESULTS.md,
which also includes the ready-to-run perf_analyzer command for sweeping
p50/p95/p99 latency vs concurrency against the live Triton endpoint (latency
percentiles need a real endpoint, so that half is emitted as a command rather
than executed).
The ONNX export
pragmatiq export (requires the serve extra) writes a dense reformulation
of the model: the same weights run over padded tensors, so the exported graph
reproduces the native embeddings (validated against onnxruntime on export and
shape-dynamic in the user/event/token axes). The Triton python backend stays the
high-throughput path because it runs the native varlen model and skips the
padding the dense graph materializes — a deployment choice, not a fidelity gap.
Pick Triton for throughput, ONNX for portability.
Streamlit demo
demo/app.py is a small Streamlit app over a trained run and a
generated dataset: pick a synthetic user in the sidebar and see their event
timeline (recent transactions with amount/merchant), their embedding
computed live via embed_records (the raw embedding and its norm — attach
fine-tuned heads for calibrated fraud/credit/churn scores), and their ego
transfer graph (all transfers in and out of the selected user). For
per-event explanations, the library ships integrated-gradients attribution in
pragmatiq/inference/explain.py
(EventAttributor returns the top-k events behind a prediction).
pip install -e ".[demo]"
pragmatiq quickstart # or point the env vars at existing artifacts
PRAGMATIQ_RUN=runs/demo PRAGMATIQ_RAW=data/synth streamlit run demo/app.py
PRAGMATIQ_RUN (default runs/quickstart) is the trained run directory,
PRAGMATIQ_RAW (default data/synth) the generated dataset, and
PRAGMATIQ_SHARDS (default data/tokenized) the tokenized shards. The demo
also runs as the demo service in
deploy/docker-compose.yaml on port 8501.
Observability
Every training run always writes runs/<name>/metrics.jsonl — one JSON object
per logged step (total and per-masking-type losses, MLM accuracy, grad norm,
LR factor, tokens/sec, GPU memory) — plus run.yaml, meta.json, and a copy
of the tokenizer. On top of that:
- Console: long phases (synth, tokenize, embed) show tqdm progress bars in
terminals and notebooks, and rate-limited log lines when output is piped
(CI,
nohup). Pretraining prints a one-line heartbeat everylog_everysteps (step, loss, mlm_acc, tokens/sec, ETA); disable withverbose: falsein the pretrain config orpragmatiq --quiet .... Progress and logs go to stderr — stdout stays a single parseable JSON document. - Parallelism without nondeterminism:
synth generate --n-workersandtokenize --n-workersfan out across processes but produce byte-identical output for any worker count (CI-enforced), so you can scale CPU phases freely without losing reproducibility. - TensorBoard:
pip install -e ".[extras]", thentensorboard --logdir runs/<name>/tb. The mirror is on whenever thetensorboardpackage is installed; otherwise it is a silent no-op. - Weights & Biases:
pip install -e ".[extras]", then setwandb: true(and optionallywandb_project) in the pretrain config, or pass--wandbtopragmatiq pretrain.
Notebooks
The notebooks are the guided tour; each one runs top to bottom on CPU.
| Notebook | One line |
|---|---|
01_quickstart_and_data.ipynb |
Generate a synthetic banking book with the causal agent-based simulator and explore why its labels are learnable without leakage. |
02_tokenize_and_embed.ipynb |
Fit the key-value-time tokenizer, shard the book, and embed users from shards or from plain Python dicts. |
03_finetune_and_probe.ipynb |
Turn embeddings into task models two ways: a fast linear probe and a LoRA fine-tune. |
04_aml_gnn.ipynb |
Run the AML transfer-graph ablation and unpack the relational-recovery result, including its limitations. |
Development
pip install -e ".[dev]"
Run the fast local checks before opening a PR:
ruff check .
mypy pragmatiq
pytest tests/ -x -q
Every change is gated by tests in CI (lint, types, full suite on Python 3.11
and 3.12, plus an end-to-end CPU run). Maintainers can run the heavier
full-validation orchestrator —
scripts/gates/run_full_validation.sh
— before releases; set PRAGMATIQ_GATE_FULL=1 for full-scale runs. See
CONTRIBUTING.md for the development workflow.
Citation
If you use pragmatiq, cite this repository and the PRAGMA paper that inspired it:
- Software citation metadata:
CITATION.cff - Reference paper: PRAGMA: Revolut Foundation Model, arXiv:2604.08649
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pragmatiq-0.1.0b3.tar.gz.
File metadata
- Download URL: pragmatiq-0.1.0b3.tar.gz
- Upload date:
- Size: 200.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06f1b0bd2693765b52aa5a23a4327011904932c2a547d74e1f0d72e22a1442a1
|
|
| MD5 |
2e691a5e7380fd28624c002159dc7330
|
|
| BLAKE2b-256 |
e5ac4b50e529b67a25d78a10e21575d8b5e488fd12ef691d110c2d3898711000
|
Provenance
The following attestation bundles were made for pragmatiq-0.1.0b3.tar.gz:
Publisher:
release.yml on dynamiq-ai/pragmatiq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pragmatiq-0.1.0b3.tar.gz -
Subject digest:
06f1b0bd2693765b52aa5a23a4327011904932c2a547d74e1f0d72e22a1442a1 - Sigstore transparency entry: 1864699903
- Sigstore integration time:
-
Permalink:
dynamiq-ai/pragmatiq@58f2e943ca0f16f6e9a4ce095b79d62ea2270e25 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dynamiq-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@58f2e943ca0f16f6e9a4ce095b79d62ea2270e25 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pragmatiq-0.1.0b3-py3-none-any.whl.
File metadata
- Download URL: pragmatiq-0.1.0b3-py3-none-any.whl
- Upload date:
- Size: 208.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04be5f942dafacfacf5981531da21bc5ef56c7546babb15dc7be6713668708ab
|
|
| MD5 |
6828485f6472663ef37d44a59883df81
|
|
| BLAKE2b-256 |
7c0b33b5de9d283fd0073420b47f8fbfb3583d74e47aa4ffdaeb8f7399d61bca
|
Provenance
The following attestation bundles were made for pragmatiq-0.1.0b3-py3-none-any.whl:
Publisher:
release.yml on dynamiq-ai/pragmatiq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pragmatiq-0.1.0b3-py3-none-any.whl -
Subject digest:
04be5f942dafacfacf5981531da21bc5ef56c7546babb15dc7be6713668708ab - Sigstore transparency entry: 1864699913
- Sigstore integration time:
-
Permalink:
dynamiq-ai/pragmatiq@58f2e943ca0f16f6e9a4ce095b79d62ea2270e25 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dynamiq-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@58f2e943ca0f16f6e9a4ce095b79d62ea2270e25 -
Trigger Event:
push
-
Statement type: