Skip to main content

Fake flash-attn package for T4 / CPU / compatibility

Project description

fake-flash-attention ⚡️ (Fake)

A drop-in, pure-Python shim for the flash-attn package. It redirects all FlashAttention calls to PyTorch's native scaled_dot_product_attention (SDPA).

Why is this necessary?

Modern Large Language Models (LLMs) and popular libraries (like Hugging Face Transformers) often have hard-coded dependencies on the flash-attn package. However, the official flash-attn library has strict requirements:

  • NVIDIA GPU only: Requires Turing, Ampere, Ada, or Hopper architectures (e.g., RTX 20/30/40, A100, H100).
  • No Support for Older GPUs: Common GPUs like the NVIDIA T4 (standard in Google Colab) or GTX 10-series cards cannot run official FlashAttention kernels.
  • No CPU Support: Official flash-attn cannot be installed or run in CPU-only environments.
  • Complex Compilation: The build process is heavy and requires specific CUDA toolkit versions.

fake-flash-attention solves this by:

  1. API Parity: It exports the exact same functions (e.g., flash_attn_func) so that libraries don't crash with an ImportError.
  2. Hardware Portability: It leverages PyTorch's scaled_dot_product_attention, which is highly optimized and works on T4, older GPUs, and CPUs.
  3. Instant Setup: It is a pure-Python package with no C++/CUDA compilation required.

Installation

pip install fake-flash-attention

Note: If installing from source:

pip install .

Usage

If a library or script requires flash-attn, install this package. Existing code will work transparently:

from flash_attn import flash_attn_func
import torch

q, k, v = torch.randn(1, 12, 256, 64), torch.randn(1, 12, 256, 64), torch.randn(1, 12, 256, 64)
# This now uses PyTorch SDPA under the hood!
output = flash_attn_func(q, k, v, causal=True)

Supported Features

  • flash_attn_func
  • flash_attn_varlen_func
  • flash_attn_qkvpacked_func / kvpacked
  • ✅ FlashAttention-2 API compatibility
  • ✅ Device-agnostic (CPU, CUDA, MPS)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fake_flash_attention-2.6.3.post1.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fake_flash_attention-2.6.3.post1-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file fake_flash_attention-2.6.3.post1.tar.gz.

File metadata

File hashes

Hashes for fake_flash_attention-2.6.3.post1.tar.gz
Algorithm Hash digest
SHA256 32b37903f75d69bcc3b11a82b390ae04a2d184bd586e205bfddded3768bda2bf
MD5 8a6c1c130d9e17b29b32fc6688fa2db5
BLAKE2b-256 cc03d3efa832c381b2e612d6a737eb414fec758b5d4c9e059aac01a3e788b245

See more details on using hashes here.

File details

Details for the file fake_flash_attention-2.6.3.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for fake_flash_attention-2.6.3.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 e16229b43fc14ad29b04eacd52a908d74a44cb13eb73ea006eeb0b03c8bcbca4
MD5 eabcfea16395dad7aaad668226a7dcb0
BLAKE2b-256 e759c7c449d1b57d820e330f31c2b4189b4fb61feb749e3390935779c143e88c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page