Dynamic Sparse Attention with Landmark Tokens - High-performance Triton implementation

These details have not been verified by PyPI

Project links

Project description

DSALT: Dynamic Sparse Attention with Landmark Tokens

DSALT is a PyTorch library implementing Dynamic Sparse Attention with Landmark Tokens, a memory-efficient attention mechanism for transformers. Each query attends to an adaptive local causal window plus a small set of global landmark tokens, instead of the full O(N²) set. On CUDA it runs custom Triton kernels; everywhere else it falls back to a masked SDPA path, so the package stays importable and runnable on any platform (including CPU and Windows).

Install: pip install dsalt
Source: https://github.com/LeonardoCofone/dsalt-library
Paper: https://zenodo.org/records/19312826

✨ Key Features

Sparse attention, local adaptive window ∪ top-k landmark tokens per head.
GPU-portable, Triton kernels on CUDA, transparent SDPA fallback otherwise; correct AMP autodetect across GPU generations (bf16 only where natively supported, fp16 on T4-class cards).
One-shot autotune, Triton block sizes are benchmarked once per (head_dim, GPU) at the first launch, then reused for the whole run; heuristic fallback if benchmarking is impossible.
Packed-sequence training, concatenated sequences + cu_seqlens, fused forward/backward with online softmax.
Fused cross-entropy, optional Liger fused linear cross-entropy, or a memory-frugal chunked pure-PyTorch loss.
DDP training, single- and multi-GPU via DistributedDataParallel, gradient accumulation, cosine schedule with warm-up, checkpointing, and rich representation-health diagnostics.

📋 Table of Contents

DSALT: Dynamic Sparse Attention with Landmark Tokens

🛠️ Installation

Requirements

Python 3.10+ (the codebase uses X | None / tuple[...] syntax)
PyTorch 2.0+
CUDA 11.0+ for the GPU path (CPU fallback always available)
Triton 2.0+ (optional; enables the GPU kernels, Linux/CUDA)

From PyPI

pip install dsalt                 # core
pip install "dsalt[triton]"       # + Triton GPU kernels
pip install "dsalt[dev]"          # + lint/type/test tooling

From source

git clone https://github.com/LeonardoCofone/dsalt-library.git
cd dsalt-library
pip install -e .

🚀 Quick Start

Inference

import torch
from dsalt.model import DSALTLMHeadModel

model = DSALTLMHeadModel(
    vocab_size=32000,
    d_model=1024,
    n_layers=24,
    n_heads=16,
    n_min=32,
    n_max=512,
    k_lmk=64,
    max_seq_len=2048,   # required
)

input_ids = torch.randint(0, 32000, (1, 1024))   # [batch, seq_len]
out = model(input_ids)                           # dict
logits = out["logits"]                           # [1, 1024, 32000]
print(logits.shape)

Computing the loss

The forward computes the loss internally (fused) when labels are given, and returns a dict {"loss", "logits", "aux_loss"}:

labels = torch.randint(0, 32000, (1, 1024))
out  = model(input_ids, labels=labels)
loss = out["loss"]          # logits is None here (loss is fused)
loss.backward()

Building from a config

from dsalt.model import DSALTConfig, DSALTLMHeadModel

cfg = DSALTConfig(
    vocab_size=50257, d_model=512, n_layers=6, n_heads=8,
    n_min=64, n_max=256, k_lmk=16, max_seq_len=1024,
)
model = DSALTLMHeadModel.from_config(cfg)
cfg.save("config.json")               # reload with DSALTConfig.load(...)

🏗️ Architecture Overview

DSALT combines a per-token local causal window with global landmark tokens selected per head:

┌─ Local window (adaptive) ──┬─ Global landmarks ──┐
│  Recent tokens up to       │  Top-k informative  │
│  window size               │  tokens per head    │
└────────────────────────────┴─────────────────────┘
                 ↓                      ↓
              Sparse attention output  (W(i) ∪ L(i))

Components:

DSALTAttention, multi-head sparse attention over W(i) ∪ L(i) with RoPE/YaRN positions.
hybrid_scores_per_head, the single source of the hybrid-energy landmark score (§4.3), shared by both the SDPA path and the Triton kernel.
DSALTTransformerBlock / SwiGLUFFN, pre-norm block with a gated SwiGLU FFN.
DSALTLMHeadModel, embeddings + block stack + RMSNorm + (tied) LM head.
Triton kernels, fused forward (dsalt_triton_attention) and backward with online softmax and one-shot autotuned block sizes.

Note. The local window is frozen to (n_min + n_max) // 2 in this release (no learnable window predictor). The learned adaptivity is the per-head

🎯 Training

DSALTTrainer drives single- and multi-GPU (DDP) training. It expects packed batches: (input_ids, labels, cu_seqlens, max_seqlen), where cu_seqlens is an int32 offset tensor of shape [num_seqs + 1] and -100 labels are ignored.

from dsalt.model import DSALTLMHeadModel
from dsalt.training import DSALTTrainer

model = DSALTLMHeadModel(
    vocab_size=32000, d_model=768, n_layers=12, n_heads=12,
    n_min=32, n_max=256, k_lmk=32, max_seq_len=1024,
)

trainer = DSALTTrainer(
    model=model,
    train_loader=train_loader,   # yields (ids, labels, cu_seqlens, max_seqlen)
    val_loader=val_loader,
    lr=3e-4,
    total_steps=10_000,
    warmup_steps=1_000,
    mixed_precision="auto",      # bf16 on sm_80+, fp16 on T4-class, none on CPU
    save_dir="./checkpoints_dsalt",
    log_every=100,
)
trainer.train()

Mixed precision

mixed_precision="auto" selects the dtype from the GPU's compute capability: bf16 on sm_80+ (A100/H100/L4/…), fp16 (with a GradScaler) below that (e.g. T4 sm_75), and no autocast on CPU. You can force "bf16", "fp16", or "none" explicitly.

Multi-GPU (DDP)

Launch one process per GPU and pass the distributed identity through; the trainer wraps the model in DistributedDataParallel when world_size > 1:

torchrun --nproc_per_node=2 your_train_script.py

trainer = DSALTTrainer(
    model=model, train_loader=train_loader, val_loader=val_loader,
    rank=rank, local_rank=local_rank, world_size=world_size,
    ddp_backend="nccl", total_steps=100_000,
)
trainer.train()

Only DDP is supported in this release (no FSDP). The trainer also handles gradient accumulation, gradient clipping, cosine LR decay with warm-up, checkpointing (checkpoint_best/step_N/final.pt), and per-layer representation-health metrics. Resume with trainer.load_checkpoint(path).

📚 API Reference

# Top-level exports
from dsalt import (
    DSALTConfig, DSALTLMHeadModel,
    DSALTAttention, DSALTTransformerBlock, SwiGLUFFN,
    DSALTTrainer,
    dsalt_triton_attention,            # None when Triton is unavailable
    hybrid_scores_per_head,            # single source of the landmark score
    sparse_attention_forward, sparse_attention_forward_packed,
    RMSENorm, compute_window_sizes, apply_rotary_emb, build_rope_cache,
)

# Low-level Triton kernel (packed sequences, CUDA + Triton only)
from dsalt.kernels import dsalt_triton_attention
out = dsalt_triton_attention(q, k, v, lmk_indices, lmk_bias, w_sizes, cu_seqlens)

q, k, v are [total_len, n_heads, head_dim]; cu_seqlens is the int32 sequence-offset tensor. See FEATURE.md for the full signature and semantics of every component.

📖 Hyperparameter Guide

Full, source-verified defaults for DSALTLMHeadModel, DSALTConfig, DSALTAttention, and DSALTTrainer live in FEATURE.md. Highlights:

Component	Required	Notable defaults
`DSALTLMHeadModel`	`vocab_size, d_model, n_layers, n_heads, n_min, n_max, k_lmk, max_seq_len`	`d_ff=None` (→ 8/3·d_model), `loss_fn="chunked"`, `tie_weights=True`, `yarn_scale=1.0`
`DSALTTrainer`	`model, train_loader, val_loader`	`lr=3e-4`, `max_grad_norm=0.5`, `warmup_steps=1000`, `mixed_precision="auto"`

alpha is a learnable per-head parameter (init sigmoid ≈ 0.6), not a constructor flag. The auxiliary loss term is inert in this release (frozen window) and kept only for signature compatibility.

📄 License

Apache 2.0, see https://github.com/LeonardoCofone/dsalt-library/blob/main/LICENSE.

🤝 Contributing

Contributions are welcome, see CONTRIBUTING.md. Especially valuable: Triton kernel optimisation, new architectures (encoder / encoder-decoder), additional training strategies, documentation, and bug fixes.

Issues: https://github.com/LeonardoCofone/dsalt-library/issues
Discussions: https://github.com/LeonardoCofone/dsalt-library/discussions

📝 Citation

If you use DSALT in your research, please cite the paper:

@software{dsalt,
  author  = {Cofone, Leonardo},
  title   = {DSALT: Dynamic Sparse Attention with Landmark Tokens},
  url      = {https://github.com/LeonardoCofone/dsalt-library},
  note    = {https://zenodo.org/records/19312826},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.34

Jun 6, 2026

0.4.33

Jun 6, 2026

0.4.32

Jun 6, 2026

0.4.31

Jun 5, 2026

0.4.29

Jun 5, 2026

0.4.28

Jun 5, 2026

0.4.27

Jun 5, 2026

0.4.26

Jun 5, 2026

0.4.25

Jun 5, 2026

0.4.24

Jun 5, 2026

0.4.23

Jun 5, 2026

0.4.22

Jun 5, 2026

0.4.21

Jun 5, 2026

0.4.20

Jun 5, 2026

0.4.19

Jun 4, 2026

0.4.18

Jun 4, 2026

0.4.17

Jun 4, 2026

0.4.16

Jun 4, 2026

0.4.15

Jun 4, 2026

0.4.14

Jun 4, 2026

0.4.13

Jun 3, 2026

0.4.12

Jun 3, 2026

0.4.11

Jun 3, 2026

0.4.10

Jun 3, 2026

0.4.9

Jun 2, 2026

0.4.8

Jun 2, 2026

0.4.7

Jun 2, 2026

This version

0.4.6

Jun 2, 2026

0.4.5

Jun 2, 2026

0.4.4

Jun 2, 2026

0.4.3

Jun 2, 2026

0.4.2

Jun 2, 2026

0.4.1

Jun 2, 2026

0.4.0

Jun 2, 2026

0.3.95

Jun 2, 2026

0.3.94

Jun 1, 2026

0.3.93

Jun 1, 2026

0.3.92

Jun 1, 2026

0.3.91

Jun 1, 2026

0.3.90

Jun 1, 2026

0.3.89

Jun 1, 2026

0.3.88

Jun 1, 2026

0.3.87

Jun 1, 2026

0.3.86

Jun 1, 2026

0.3.85

May 31, 2026

0.3.84

May 31, 2026

0.3.83

May 31, 2026

0.3.82

May 31, 2026

0.3.81

May 31, 2026

0.3.80

May 31, 2026

0.3.79

May 31, 2026

0.3.78

May 31, 2026

0.3.77

May 31, 2026

0.3.76

May 31, 2026

0.3.75

May 31, 2026

0.3.74

May 31, 2026

0.3.73

May 31, 2026

0.3.72

May 31, 2026

0.3.71

May 31, 2026

0.3.70

May 30, 2026

0.3.69

May 30, 2026

0.3.67

May 30, 2026

0.3.66

May 30, 2026

0.3.65

May 30, 2026

0.3.64

May 30, 2026

0.3.63

May 30, 2026

0.3.62

May 30, 2026

0.3.61

May 30, 2026

0.3.60

May 30, 2026

0.3.59

May 30, 2026

0.3.58

May 30, 2026

0.3.57

May 25, 2026

0.3.56

May 25, 2026

0.3.55

May 25, 2026

0.3.54

May 25, 2026

0.3.53

May 25, 2026

0.3.52

May 25, 2026

0.3.51

May 24, 2026

0.3.50

May 24, 2026

0.3.49

May 24, 2026

0.3.48

May 21, 2026

0.3.47

May 21, 2026

0.3.46

May 21, 2026

0.3.45

May 21, 2026

0.3.44

May 21, 2026

0.3.43

May 21, 2026

0.3.42

May 21, 2026

0.3.41

May 21, 2026

0.3.40

May 20, 2026

0.3.39

May 20, 2026

0.3.38

May 20, 2026

0.3.37

May 19, 2026

0.3.36

May 19, 2026

0.3.35

May 19, 2026

0.3.34

May 19, 2026

0.3.33

May 19, 2026

0.3.32

May 19, 2026

0.3.31

May 19, 2026

0.3.30

May 18, 2026

0.3.29

May 18, 2026

0.3.28

May 17, 2026

0.3.27

May 17, 2026

0.3.26

May 17, 2026

0.3.25

May 17, 2026

0.3.24

May 17, 2026

0.3.23

May 17, 2026

0.3.22

May 17, 2026

0.3.21

May 15, 2026

0.3.20

May 15, 2026

0.3.19

May 15, 2026

0.3.18

May 15, 2026

0.3.17

May 15, 2026

0.3.16

May 15, 2026

0.3.15

May 15, 2026

0.3.14

May 15, 2026

0.3.13

May 15, 2026

0.3.12

May 14, 2026

0.3.11

May 14, 2026

0.3.10

May 14, 2026

0.3.9

May 14, 2026

0.3.8

May 14, 2026

0.3.7

May 14, 2026

0.3.6

May 14, 2026

0.3.5

May 14, 2026

0.3.4

May 14, 2026

0.3.3

May 14, 2026

0.3.2

May 14, 2026

0.3.1

May 14, 2026

0.3.0

May 14, 2026

0.2.71

May 14, 2026

0.2.70

May 14, 2026

0.2.69

May 14, 2026

0.2.68

May 14, 2026

0.2.67

May 14, 2026

0.2.66

May 14, 2026

0.2.65

May 14, 2026

0.2.64

May 14, 2026

0.2.63

May 14, 2026

0.2.62

May 14, 2026

0.2.61

May 14, 2026

0.2.60

May 13, 2026

0.2.59

May 13, 2026

0.2.58

May 13, 2026

0.2.57

May 13, 2026

0.2.56

May 13, 2026

0.2.55

May 13, 2026

0.2.54

May 13, 2026

0.2.53

May 13, 2026

0.2.52

May 13, 2026

0.2.51

May 13, 2026

0.2.50

May 13, 2026

0.2.49

May 13, 2026

0.2.48

May 13, 2026

0.2.47

May 13, 2026

0.2.46

May 13, 2026

0.2.45

May 13, 2026

0.2.44

May 12, 2026

0.2.43

May 12, 2026

0.2.42

May 12, 2026

0.2.41

May 12, 2026

0.2.40

May 12, 2026

0.2.39

May 12, 2026

0.2.38

May 12, 2026

0.2.37

May 12, 2026

0.2.36

May 12, 2026

0.2.35

May 11, 2026

0.2.34

May 11, 2026

0.2.33

May 11, 2026

0.2.32

May 11, 2026

0.2.31

May 11, 2026

0.2.30

May 11, 2026

0.2.29

May 11, 2026

0.2.28

May 11, 2026

0.2.27

May 11, 2026

0.2.26

May 11, 2026

0.2.25

May 11, 2026

0.2.24

May 11, 2026

0.2.23

May 11, 2026

0.2.22

May 11, 2026

0.2.21

May 11, 2026

0.2.20

May 11, 2026

0.2.19

May 11, 2026

0.2.18

May 11, 2026

0.2.17

May 11, 2026

0.2.16

May 11, 2026

0.2.15

May 11, 2026

0.2.14

May 10, 2026

0.2.13

May 10, 2026

0.2.12

May 10, 2026

0.2.11

May 10, 2026

0.2.10

May 10, 2026

0.2.9

May 10, 2026

0.2.8

May 8, 2026

0.2.7

May 8, 2026

0.2.6

May 8, 2026

0.2.5

May 8, 2026

0.2.4

May 8, 2026

0.2.3

May 8, 2026

0.2.2

May 7, 2026

0.2.1

May 4, 2026

0.2.0

May 4, 2026

0.1.20

May 4, 2026

0.1.19

May 4, 2026

0.1.18

May 4, 2026

0.1.17

May 4, 2026

0.1.16

May 4, 2026

0.1.15

May 4, 2026

0.1.14

May 4, 2026

0.1.12

May 3, 2026

0.1.11

May 3, 2026

0.1.10

May 3, 2026

0.1.9

May 3, 2026

0.1.8

May 2, 2026

0.1.7

May 2, 2026

0.1.6

May 2, 2026

0.1.5

May 2, 2026

0.1.4

May 2, 2026

0.1.3

May 2, 2026

0.1.2

May 2, 2026

0.1.1

May 2, 2026

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsalt-0.4.6.tar.gz (68.7 kB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dsalt-0.4.6-py3-none-any.whl (72.5 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file dsalt-0.4.6.tar.gz.

File metadata

Download URL: dsalt-0.4.6.tar.gz
Upload date: Jun 2, 2026
Size: 68.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for dsalt-0.4.6.tar.gz
Algorithm	Hash digest
SHA256	`beb55c4cff08a25ebcdae151ae83936a87ff3ce993e219b515b149d3331ccfdd`
MD5	`10c757c75304c3e2aa71766d6f5bfcff`
BLAKE2b-256	`6edc96a6c0a25e25d4f34d90cd0021eeb9ecc1919f96c15246e21f0617a2d4d2`

See more details on using hashes here.

File details

Details for the file dsalt-0.4.6-py3-none-any.whl.

File metadata

Download URL: dsalt-0.4.6-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 72.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for dsalt-0.4.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed87c955213c4ef707a79501c66c07488a2d0bc59c4c1be4848e298308575a18`
MD5	`38df31f19e6ed2c0862380a803a72000`
BLAKE2b-256	`bd8ff08d5c51ebc7fb8fdd8a3875374315ba3cff43bca5ed38c5a34e68b938f4`

See more details on using hashes here.

dsalt 0.4.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DSALT: Dynamic Sparse Attention with Landmark Tokens

✨ Key Features

📋 Table of Contents

🛠️ Installation

Requirements

From PyPI

From source

🚀 Quick Start

Inference

Computing the loss

Building from a config

🏗️ Architecture Overview

🎯 Training

Mixed precision

Multi-GPU (DDP)

📚 API Reference

📖 Hyperparameter Guide

📄 License

🤝 Contributing

📝 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes