Dynamic Sparse Attention with Landmark Tokens — High-performance Triton implementation

These details have not been verified by PyPI

Project links

Project description

DSALT: Dynamic Sparse Attention with Landmark Tokens

DSALT is a high‑performance PyTorch library that implements Dynamic Sparse Attention with Landmark Tokens – a memory‑efficient attention mechanism for transformers. It relies on Triton kernels and supports distributed training.

Install: pip install dsalt
Source: https://github.com/LeonardoCofone/dsalt-pytorch
Paper: https://zenodo.org/records/19312826
Feature guide: See FEATURE.md here: https://github.com/LeonardoCofone/dsalt-pytorch/blob/main/FEATURE.md

🚀 Key Features

Memory‑efficient sparse attention – Triton‑accelerated kernels provide 4–8× memory savings compared to dense attention.
Adaptive local windows – Token‑wise window sizes that grow with sequence position.
Global landmark tokens – Top‑k informative tokens per head selected via a hybrid energy scoring function.
Production‑ready training – Mixed‑precision, gradient checkpointing, and validation support.
Distributed training – Full DDP and FSDP support for multi‑GPU setups.
Numerical verification – CPU/GPU equivalence tests and gradient stability checks.

🛠️ Installation

Requirements

Python 3.8+
PyTorch 2.0+
CUDA 11.0+ (GPU) – CPU fallback is available
Triton 2.0+ (optional, enables GPU kernels)

From PyPI

pip install dsalt

From source

git clone https://github.com/LeonardoCofone/dsalt-pytorch.git
cd dsalt-pytorch
pip install -e .

Development setup

pip install -r requirements-dev.txt

🚀 Quick Start

1. Language‑model inference

import torch
from dsalt.model import DSALTLMHeadModel

model = DSALTLMHeadModel(
    vocab_size=32000,
    d_model=1024,
    n_layers=24,
    n_heads=16,
    n_min=32,
    n_max=512,
    k_lmk=64,
)

input_ids = torch.randint(0, 32000, (1, 1024))  # [batch, seq_len]
logits = model(input_ids)                     # [1, 1024, 32000]
print(logits.shape)

# With labels – loss is computed internally
labels = torch.randint(0, 32000, (1, 1024))
outputs = model(input_ids, labels=labels)
loss = outputs.loss
loss.backward()

2. Single‑GPU training

import torch
from torch.utils.data import DataLoader, TensorDataset
from dsalt.model import DSALTLMHeadModel
from dsalt.training import DSALTTrainer

vocab_size = 32000
seq_len = 512
train_dataset = TensorDataset(
    torch.randint(0, vocab_size, (1000, seq_len))
)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

model = DSALTLMHeadModel(
    vocab_size=vocab_size,
    d_model=768,
    n_layers=12,
    n_heads=12,
    n_min=32,
    n_max=256,
    k_lmk=32,
)

trainer = DSALTTrainer(
    model=model,
    train_loader=train_loader,
    lr=3e-4,
    total_steps=10_000,
    save_dir="checkpoints",
    dtype=torch.bfloat16,
    log_every=50,
)
trainer.train()

3. Multi‑GPU with DataParallel

import torch
import torch.nn as nn
from dsalt.model import DSALTLMHeadModel
from dsalt.training import DSALTTrainer

model = DSALTLMHeadModel(...).to("cuda")
model = nn.DataParallel(model)  # uses all available GPUs

trainer = DSALTTrainer(
    model=model,
    train_loader=train_loader,
    lr=3e-4,
    total_steps=100_000,
    dtype=torch.bfloat16,
    save_dir="checkpoints",
)
trainer.train()

4. Multi‑GPU with FSDP (model sharding)

torchrun --nproc_per_node=2 train.py

Then configure the trainer with fsdp=True.

🏗️ Architecture Overview

DSALT combines local causal windows (adaptive per token) with global landmark tokens (top‑k per head):

┌─ Local window (adaptive) ──┬─ Global landmarks ──┐
│ Recent N tokens            │ Top‑K informative │
│ (window size grows)        │ tokens per head   │
└────────────────────────────┴────────────────────┘
                ↓                     ↓
            Sparse attention output

Key components:

DSALTAttention – multi‑head sparse attention with adaptive windows and landmark selection.
WindowSizePredictor – learns per‑token window sizes.
HybridEnergyScorer (kernel) – computes landmark scores.
DSALTTransformer – stack of attention + feed‑forward layers.
Triton kernels – fused forward and backward passes for speed and memory efficiency.

🎯 Training & Generation

See the code snippets above for full training loops. The DSALTTrainer handles:

Mixed‑precision (BF16 default)
Gradient checkpointing
Learning‑rate warm‑up and cosine decay
Optional window‑entropy regularisation (window_reg_coef)
Checkpointing and logging utilities

📚 API Reference (excerpt)

from dsalt.model import DSALTLMHeadModel
model = DSALTLMHeadModel(vocab_size=32000, d_model=1024, n_layers=24,
                         n_heads=16, n_min=32, n_max=512, k_lmk=64)
logits, windows = model(input_ids, return_window=True)

Low‑level kernel call:

from dsalt.kernels import dsalt_attention
out = dsalt_attention(Q, K, V, window_sizes, landmark_idx)

📊 Performance & Benchmarks (May 2026)

Attention type	Approx. memory (GB)	Relative speed
Dense (O(N²))	~3.5	1.0×
FlashAttention 2	~1.8	0.5×
DSALT	~0.6	0.17×

📖 Hyperparameter Guide

All hyperparameters are documented in FEATURE.md. Typical configurations are provided for:

Mobile / Edge – tiny models, low memory.
Consumer GPU – e.g., RTX 4090, 24 GB.
Enterprise – H100 80 GB, optional FSDP.
Research – multi‑node, large models.

🧪 Testing

make test-cov          # Full test suite with coverage report
pytest tests/ -v       # Run tests directly

Key test modules:

tests/test_sparse_attn.py – kernel equivalence and backward.
tests/test_hybrid_energy.py – landmark scoring.
tests/test_dsalt_lm.py – language‑model wrapper.
tests/test_main.py – end‑to‑end smoke test.

📄 License

Apache License 2.0.

🤝 Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines. Areas where help is especially valuable:

Triton kernel optimisation
New model architectures (encoder, encoder‑decoder)
Additional training strategies and samplers
Documentation and tutorials
Bug reports and fixes

📞 Support & Questions

Issues: https://github.com/LeonardoCofone/dsalt-pytorch/issues
Discussions: https://github.com/LeonardoCofone/dsalt-pytorch/discussions
Paper: https://zenodo.org/records/19312826

Last Updated: May 2026

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.53

May 13, 2026

0.2.52

May 13, 2026

0.2.51

May 13, 2026

0.2.50

May 13, 2026

0.2.49

May 13, 2026

0.2.48

May 13, 2026

0.2.47

May 13, 2026

0.2.46

May 13, 2026

0.2.45

May 13, 2026

0.2.44

May 12, 2026

0.2.43

May 12, 2026

0.2.42

May 12, 2026

0.2.41

May 12, 2026

0.2.40

May 12, 2026

0.2.39

May 12, 2026

0.2.38

May 12, 2026

0.2.37

May 12, 2026

0.2.36

May 12, 2026

0.2.35

May 11, 2026

0.2.34

May 11, 2026

0.2.33

May 11, 2026

0.2.32

May 11, 2026

0.2.31

May 11, 2026

0.2.30

May 11, 2026

0.2.29

May 11, 2026

0.2.28

May 11, 2026

0.2.27

May 11, 2026

0.2.26

May 11, 2026

0.2.25

May 11, 2026

0.2.24

May 11, 2026

0.2.23

May 11, 2026

0.2.22

May 11, 2026

0.2.21

May 11, 2026

0.2.20

May 11, 2026

0.2.19

May 11, 2026

0.2.18

May 11, 2026

0.2.17

May 11, 2026

0.2.16

May 11, 2026

0.2.15

May 11, 2026

0.2.14

May 10, 2026

0.2.13

May 10, 2026

0.2.12

May 10, 2026

0.2.11

May 10, 2026

0.2.10

May 10, 2026

0.2.9

May 10, 2026

0.2.8

May 8, 2026

0.2.7

May 8, 2026

0.2.6

May 8, 2026

This version

0.2.5

May 8, 2026

0.2.4

May 8, 2026

0.2.3

May 8, 2026

0.2.2

May 7, 2026

0.2.1

May 4, 2026

0.2.0

May 4, 2026

0.1.20

May 4, 2026

0.1.19

May 4, 2026

0.1.18

May 4, 2026

0.1.17

May 4, 2026

0.1.16

May 4, 2026

0.1.15

May 4, 2026

0.1.14

May 4, 2026

0.1.12

May 3, 2026

0.1.11

May 3, 2026

0.1.10

May 3, 2026

0.1.9

May 3, 2026

0.1.8

May 2, 2026

0.1.7

May 2, 2026

0.1.6

May 2, 2026

0.1.5

May 2, 2026

0.1.4

May 2, 2026

0.1.3

May 2, 2026

0.1.2

May 2, 2026

0.1.1

May 2, 2026

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsalt-0.2.5.tar.gz (38.6 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dsalt-0.2.5-py3-none-any.whl (32.2 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file dsalt-0.2.5.tar.gz.

File metadata

Download URL: dsalt-0.2.5.tar.gz
Upload date: May 8, 2026
Size: 38.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for dsalt-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`c3561a9746f099ba7c3dc9b5c3793b3ae4a727edc66a445b9cabea81543796a2`
MD5	`3f6b80b92e244989c07272bf4a423905`
BLAKE2b-256	`eb380b3323ef6835561bf796e54f7793e84f138abbd1e6c74bc4e1d973c1a652`

See more details on using hashes here.

File details

Details for the file dsalt-0.2.5-py3-none-any.whl.

File metadata

Download URL: dsalt-0.2.5-py3-none-any.whl
Upload date: May 8, 2026
Size: 32.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for dsalt-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`896967c1427b50cec07b9c8e8d9cd561a6be078a6402443bb98d1efc90fd458a`
MD5	`c48fcfa858f765c30e5b1e77e87aaefa`
BLAKE2b-256	`62297e7e4e19fe2ced4f482a0daf069b6bef88546de6cd7957336b249d65bdde`

See more details on using hashes here.

dsalt 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DSALT: Dynamic Sparse Attention with Landmark Tokens

🚀 Key Features

📋 Table of Contents

🛠️ Installation

Requirements

From PyPI

From source

Development setup

🚀 Quick Start

1. Language‑model inference

2. Single‑GPU training

3. Multi‑GPU with DataParallel

4. Multi‑GPU with FSDP (model sharding)

🏗️ Architecture Overview

🎯 Training & Generation

📚 API Reference (excerpt)

📊 Performance & Benchmarks (May 2026)

📖 Hyperparameter Guide

🧪 Testing

📄 License

🤝 Contributing

📞 Support & Questions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

📊 Performance & Benchmarks (May 2026)