A convenient lib for sequence-level attention abstraction, powered by flashinfer
Project description
seqattn
A lightweight sequence-level attention abstraction library powered by flashinfer.
Overview
seqattn provides a minimal yet powerful wrapper around flashinfer's paged attention functionality, designed with the KISS (Keep It Simple, Stupid) principle. Instead of introducing new complex concepts, it offers clean abstractions for managing sequence-level KV cache operations.
Key Features
- Lightweight: Minimal overhead with clean, focused API
- Sequence-level abstraction: Manage attention at the sequence level rather than token level
- Paged KV cache: Efficient memory management with page-based allocation
- Reference counting: Safe memory sharing for prefix caching scenarios
- Head-wise operations: Support for head-wise paged attention patterns
- flashinfer integration: Built on top of the high-performance flashinfer library
Core Components
PagedKVCacheManager
Physical memory manager that handles:
- Page allocation and deallocation
- Reference counting for memory sharing
- Key-value cache storage with configurable layouts (NHD/HND)
- Direct integration with flashinfer's append operations
CacheDescriptor
Sequence-level coordinator that provides:
- Mapping from sequence IDs to page allocations
- Automatic page requirement calculation
- Batch operations for multiple sequences
- Packaging data for flashinfer consumption
FlashInferPackedData
Data structure containing all tensors required by flashinfer:
- Page indices and pointers
- Last page lengths for each sequence
- Device transfer utilities
Installation
pip install seqattn
FlashInfer Installation
Important: FlashInfer has complex distribution requirements and is not included as a direct dependency due to:
- PyTorch/CUDA Version Compatibility: FlashInfer requires specific PyTorch and CUDA version combinations
- Multiple Installation Channels: Different installation methods for different environments
- Hardware Requirements: Only supports specific GPU architectures (
sm75,sm80,sm86,sm89,sm90)
For more information, I strongly encourage you to check the installation page of flashinfer.
Please install FlashInfer separately according to your environment:
Option 1 - Prebuilt wheels (Recommended):
# For PyTorch 2.6 + CUDA 12.6
pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6/
# For other combinations, see: https://docs.flashinfer.ai/installation.html
Option 2 - JIT version from PyPI:
pip install flashinfer-python
Option 3 - From source:
git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
pip install --no-build-isolation --verbose .
Check your PyTorch CUDA version with:
python -c "import torch; print(torch.version.cuda)"
Quick Start
import torch
from seqattn import PagedKVCacheManager, CacheDescriptor
# Initialize cache manager
cache_manager = PagedKVCacheManager(
num_pages=1024,
page_size=16,
num_heads=32,
head_dim=128,
dtype=torch.float16,
device=torch.cuda.current_device()
)
# Create sequence descriptor
descriptor = CacheDescriptor(cache_manager)
# Allocate for sequences
seq_ids = [1, 2, 3]
seq_lengths = [100, 150, 80]
descriptor.allocate(seq_ids, seq_lengths)
# Pack for flashinfer
flashinfer_data = descriptor.pack_for_flashinfer(seq_ids)
# Use with your attention computation...
Advanced Usage
Reference Counting for Prefix Caching
# Share pages between sequences with common prefixes
shared_pages = [0, 1, 2] # Pages containing shared prefix
cache_manager.ref(shared_pages) # Increment reference count
# Multiple sequences can now safely reference these pages
Head-wise Operations
from seqattn import HeadIDGenerator
# Generate unique head IDs for head-wise attention
head_gen = HeadIDGenerator(num_kv_heads=32)
head_id = head_gen.get_head_id(seq_id=1, head_idx=5)
## use head-ids as if they are seq-ids.
API Reference
PagedKVCacheManager
allocate(num_pages): Allocate pages and return indicesref(page_indices): Increment reference count for pagesunref(page_indices): Decrement reference countrelease_pages(page_indices): Release pages when ref count reaches zeroappend_kv(keys, values, flashinfer_data, append_indptr_cpu): Append KV pairs
CacheDescriptor
allocate(seq_ids, seq_new_lens): Allocate pages for sequencesallocate_decoding(seq_ids): Allocate for single-token decodingrelease(seq_ids): Release sequences and their pagespack_for_flashinfer(seq_ids): Pack data for flashinfer consumption
Requirements
- Python >= 3.10
- torch
- numpy
- attrs
- flashinfer (install separately)
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seqattn-0.0.2.tar.gz.
File metadata
- Download URL: seqattn-0.0.2.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b58c0dfc5594fa47896312442678cffcbc544ca0a6748ab6cd525c38e66d4758
|
|
| MD5 |
1b041865a3e7776f27586499d10721e0
|
|
| BLAKE2b-256 |
6d346f8401ff71a1dc1ea18da2043e3637770cd470bde7af89595abf1446797c
|
File details
Details for the file seqattn-0.0.2-py3-none-any.whl.
File metadata
- Download URL: seqattn-0.0.2-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d77507ff8540de819be3a6e736bec7de71f105d4deb1b499313a385f2061960
|
|
| MD5 |
5b0e7c8b3ddfe70b250b5eee7a8a499d
|
|
| BLAKE2b-256 |
5fa196d99bb2c5d45dee51d110f13703fde5cc02563aace81aef3dd47be71bf8
|