A convenient lib for sequence-level attention abstraction, powered by flashinfer

These details have not been verified by PyPI

Project links

Project description

seqattn

A lightweight sequence-level attention abstraction library powered by flashinfer.

Overview

seqattn provides a minimal yet powerful wrapper around flashinfer's paged attention functionality, designed with the KISS (Keep It Simple, Stupid) principle. Instead of introducing new complex concepts, it offers clean abstractions for managing sequence-level KV cache operations.

Key Features

Lightweight: Minimal overhead with clean, focused API
Sequence-level abstraction: Manage attention at the sequence level rather than token level
Paged KV cache: Efficient memory management with page-based allocation
Reference counting: Safe memory sharing for prefix caching scenarios
Head-wise operations: Support for head-wise paged attention patterns
flashinfer integration: Built on top of the high-performance flashinfer library

Core Components

PagedKVCacheManager

Physical memory manager that handles:

Page allocation and deallocation
Reference counting for memory sharing
Key-value cache storage with configurable layouts (NHD/HND)
Direct integration with flashinfer's append operations

CacheDescriptor

Sequence-level coordinator that provides:

Mapping from sequence IDs to page allocations
Automatic page requirement calculation
Batch operations for multiple sequences
Packaging data for flashinfer consumption

FlashInferPackedData

Data structure containing all tensors required by flashinfer:

Page indices and pointers
Last page lengths for each sequence
Device transfer utilities

Installation

pip install seqattn

FlashInfer Installation

Important: FlashInfer has complex distribution requirements and is not included as a direct dependency due to:

PyTorch/CUDA Version Compatibility: FlashInfer requires specific PyTorch and CUDA version combinations
Multiple Installation Channels: Different installation methods for different environments
Hardware Requirements: Only supports specific GPU architectures (sm75, sm80, sm86, sm89, sm90)

For more information, I strongly encourage you to check the installation page of flashinfer.

Please install FlashInfer separately according to your environment:

Option 1 - Prebuilt wheels (Recommended):

# For PyTorch 2.6 + CUDA 12.6
pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6/

# For other combinations, see: https://docs.flashinfer.ai/installation.html

Option 2 - JIT version from PyPI:

pip install flashinfer-python

Option 3 - From source:

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
pip install --no-build-isolation --verbose .

Check your PyTorch CUDA version with:

python -c "import torch; print(torch.version.cuda)"

Quick Start

import torch
from seqattn import PagedKVCacheManager, CacheDescriptor

# Initialize cache manager
cache_manager = PagedKVCacheManager(
    num_pages=1024,
    page_size=16,
    num_heads=32,
    head_dim=128,
    dtype=torch.float16,
    device=torch.cuda.current_device()
)

# Create sequence descriptor
descriptor = CacheDescriptor(cache_manager)

# Allocate for sequences
seq_ids = [1, 2, 3]
seq_lengths = [100, 150, 80]
descriptor.allocate(seq_ids, seq_lengths)

# Pack for flashinfer
flashinfer_data = descriptor.pack_for_flashinfer(seq_ids)

# Use with your attention computation...

Advanced Usage

Reference Counting for Prefix Caching

# Share pages between sequences with common prefixes
shared_pages = [0, 1, 2]  # Pages containing shared prefix
cache_manager.ref(shared_pages)  # Increment reference count

# Multiple sequences can now safely reference these pages

Head-wise Operations

from seqattn import HeadIDGenerator

# Generate unique head IDs for head-wise attention
head_gen = HeadIDGenerator(num_kv_heads=32)
head_id = head_gen.get_head_id(seq_id=1, head_idx=5)
## use head-ids as if they are seq-ids.

API Reference

PagedKVCacheManager

allocate(num_pages): Allocate pages and return indices
ref(page_indices): Increment reference count for pages
unref(page_indices): Decrement reference count
release_pages(page_indices): Release pages when ref count reaches zero
append_kv(keys, values, flashinfer_data, append_indptr_cpu): Append KV pairs

CacheDescriptor

allocate(seq_ids, seq_new_lens): Allocate pages for sequences
allocate_decoding(seq_ids): Allocate for single-token decoding
release(seq_ids): Release sequences and their pages
pack_for_flashinfer(seq_ids): Pack data for flashinfer consumption

Requirements

Python >= 3.10
torch
numpy
attrs
flashinfer (install separately)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Jun 19, 2025

0.0.1

Jun 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqattn-0.0.2.tar.gz (9.7 kB view details)

Uploaded Jun 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

seqattn-0.0.2-py3-none-any.whl (8.7 kB view details)

Uploaded Jun 19, 2025 Python 3

File details

Details for the file seqattn-0.0.2.tar.gz.

File metadata

Download URL: seqattn-0.0.2.tar.gz
Upload date: Jun 19, 2025
Size: 9.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for seqattn-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`b58c0dfc5594fa47896312442678cffcbc544ca0a6748ab6cd525c38e66d4758`
MD5	`1b041865a3e7776f27586499d10721e0`
BLAKE2b-256	`6d346f8401ff71a1dc1ea18da2043e3637770cd470bde7af89595abf1446797c`

See more details on using hashes here.

File details

Details for the file seqattn-0.0.2-py3-none-any.whl.

File metadata

Download URL: seqattn-0.0.2-py3-none-any.whl
Upload date: Jun 19, 2025
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for seqattn-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d77507ff8540de819be3a6e736bec7de71f105d4deb1b499313a385f2061960`
MD5	`5b0e7c8b3ddfe70b250b5eee7a8a499d`
BLAKE2b-256	`5fa196d99bb2c5d45dee51d110f13703fde5cc02563aace81aef3dd47be71bf8`

See more details on using hashes here.

seqattn 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

seqattn

Overview

Key Features

Core Components

PagedKVCacheManager

CacheDescriptor

FlashInferPackedData

Installation

FlashInfer Installation

Quick Start

Advanced Usage

Reference Counting for Prefix Caching

Head-wise Operations

API Reference

PagedKVCacheManager

CacheDescriptor

Requirements

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes