Skip to main content

A convenient lib for sequence-level attention abstraction, powered by flashinfer

Project description

seqattn

中文版 README

A lightweight sequence-level attention abstraction library powered by flashinfer.

Overview

seqattn provides a minimal yet powerful wrapper around flashinfer's paged attention functionality, designed with the KISS (Keep It Simple, Stupid) principle. Instead of introducing new complex concepts, it offers clean abstractions for managing sequence-level KV cache operations.

Key Features

  • Lightweight: Minimal overhead with clean, focused API
  • Sequence-level abstraction: Manage attention at the sequence level rather than token level
  • Paged KV cache: Efficient memory management with page-based allocation
  • Reference counting: Safe memory sharing for prefix caching scenarios
  • Head-wise operations: Support for head-wise paged attention patterns
  • flashinfer integration: Built on top of the high-performance flashinfer library

Core Components

PagedKVCacheManager

Physical memory manager that handles:

  • Page allocation and deallocation
  • Reference counting for memory sharing
  • Key-value cache storage with configurable layouts (NHD/HND)
  • Direct integration with flashinfer's append operations

CacheDescriptor

Sequence-level coordinator that provides:

  • Mapping from sequence IDs to page allocations
  • Automatic page requirement calculation
  • Batch operations for multiple sequences
  • Packaging data for flashinfer consumption

FlashInferPackedData

Data structure containing all tensors required by flashinfer:

  • Page indices and pointers
  • Last page lengths for each sequence
  • Device transfer utilities

Installation

pip install seqattn

FlashInfer Installation

Important: FlashInfer has complex distribution requirements and is not included as a direct dependency due to:

  1. PyTorch/CUDA Version Compatibility: FlashInfer requires specific PyTorch and CUDA version combinations
  2. Multiple Installation Channels: Different installation methods for different environments
  3. Hardware Requirements: Only supports specific GPU architectures (sm75, sm80, sm86, sm89, sm90)

For more information, I strongly encourage you to check the installation page of flashinfer.

Please install FlashInfer separately according to your environment:

Option 1 - Prebuilt wheels (Recommended):

# For PyTorch 2.6 + CUDA 12.6
pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6/

# For other combinations, see: https://docs.flashinfer.ai/installation.html

Option 2 - JIT version from PyPI:

pip install flashinfer-python

Option 3 - From source:

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
pip install --no-build-isolation --verbose .

Check your PyTorch CUDA version with:

python -c "import torch; print(torch.version.cuda)"

Quick Start

import torch
from seqattn import PagedKVCacheManager, CacheDescriptor

# Initialize cache manager
cache_manager = PagedKVCacheManager(
    num_pages=1024,
    page_size=16,
    num_heads=32,
    head_dim=128,
    dtype=torch.float16,
    device=torch.cuda.current_device()
)

# Create sequence descriptor
descriptor = CacheDescriptor(cache_manager)

# Allocate for sequences
seq_ids = [1, 2, 3]
seq_lengths = [100, 150, 80]
descriptor.allocate(seq_ids, seq_lengths)

# Pack for flashinfer
flashinfer_data = descriptor.pack_for_flashinfer(seq_ids)

# Use with your attention computation...

Advanced Usage

Reference Counting for Prefix Caching

# Share pages between sequences with common prefixes
shared_pages = [0, 1, 2]  # Pages containing shared prefix
cache_manager.ref(shared_pages)  # Increment reference count

# Multiple sequences can now safely reference these pages

Head-wise Operations

from seqattn import HeadIDGenerator

# Generate unique head IDs for head-wise attention
head_gen = HeadIDGenerator(num_kv_heads=32)
head_id = head_gen.get_head_id(seq_id=1, head_idx=5)
## use head-ids as if they are seq-ids.

API Reference

PagedKVCacheManager

  • allocate(num_pages): Allocate pages and return indices
  • ref(page_indices): Increment reference count for pages
  • unref(page_indices): Decrement reference count
  • release_pages(page_indices): Release pages when ref count reaches zero
  • append_kv(keys, values, flashinfer_data, append_indptr_cpu): Append KV pairs

CacheDescriptor

  • allocate(seq_ids, seq_new_lens): Allocate pages for sequences
  • allocate_decoding(seq_ids): Allocate for single-token decoding
  • release(seq_ids): Release sequences and their pages
  • pack_for_flashinfer(seq_ids): Pack data for flashinfer consumption

Requirements

  • Python >= 3.10
  • torch
  • numpy
  • attrs
  • flashinfer (install separately)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqattn-0.0.2.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seqattn-0.0.2-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file seqattn-0.0.2.tar.gz.

File metadata

  • Download URL: seqattn-0.0.2.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.13

File hashes

Hashes for seqattn-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b58c0dfc5594fa47896312442678cffcbc544ca0a6748ab6cd525c38e66d4758
MD5 1b041865a3e7776f27586499d10721e0
BLAKE2b-256 6d346f8401ff71a1dc1ea18da2043e3637770cd470bde7af89595abf1446797c

See more details on using hashes here.

File details

Details for the file seqattn-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: seqattn-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.13

File hashes

Hashes for seqattn-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0d77507ff8540de819be3a6e736bec7de71f105d4deb1b499313a385f2061960
MD5 5b0e7c8b3ddfe70b250b5eee7a8a499d
BLAKE2b-256 5fa196d99bb2c5d45dee51d110f13703fde5cc02563aace81aef3dd47be71bf8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page