Fake flash-attn package for T4 / CPU / compatibility
Project description
fake-flash-attention ⚡️ (Fake)
A drop-in, pure-Python shim for the flash-attn package. It redirects all FlashAttention calls to PyTorch's native scaled_dot_product_attention (SDPA).
Why is this necessary?
Modern Large Language Models (LLMs) and popular libraries (like Hugging Face Transformers) often have hard-coded dependencies on the flash-attn package. However, the official flash-attn library has strict requirements:
- NVIDIA GPU only: Requires Turing, Ampere, Ada, or Hopper architectures (e.g., RTX 20/30/40, A100, H100).
- No Support for Older GPUs: Common GPUs like the NVIDIA T4 (standard in Google Colab) or GTX 10-series cards cannot run official FlashAttention kernels.
- No CPU Support: Official
flash-attncannot be installed or run in CPU-only environments. - Complex Compilation: The build process is heavy and requires specific CUDA toolkit versions.
fake-flash-attention solves this by:
- API Parity: It exports the exact same functions (e.g.,
flash_attn_func) so that libraries don't crash with anImportError. - Hardware Portability: It leverages PyTorch's
scaled_dot_product_attention, which is highly optimized and works on T4, older GPUs, and CPUs. - Instant Setup: It is a pure-Python package with no C++/CUDA compilation required.
Installation
pip install fake-flash-attention
Note: If installing from source:
pip install .
Usage
If a library or script requires flash-attn, install this package. Existing code will work transparently:
from flash_attn import flash_attn_func
import torch
q, k, v = torch.randn(1, 12, 256, 64), torch.randn(1, 12, 256, 64), torch.randn(1, 12, 256, 64)
# This now uses PyTorch SDPA under the hood!
output = flash_attn_func(q, k, v, causal=True)
Supported Features
- ✅
flash_attn_func - ✅
flash_attn_varlen_func - ✅
flash_attn_qkvpacked_func/kvpacked - ✅ FlashAttention-2 API compatibility
- ✅ Device-agnostic (CPU, CUDA, MPS)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fake_flash_attention-2.6.3.post1.tar.gz.
File metadata
- Download URL: fake_flash_attention-2.6.3.post1.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32b37903f75d69bcc3b11a82b390ae04a2d184bd586e205bfddded3768bda2bf
|
|
| MD5 |
8a6c1c130d9e17b29b32fc6688fa2db5
|
|
| BLAKE2b-256 |
cc03d3efa832c381b2e612d6a737eb414fec758b5d4c9e059aac01a3e788b245
|
File details
Details for the file fake_flash_attention-2.6.3.post1-py3-none-any.whl.
File metadata
- Download URL: fake_flash_attention-2.6.3.post1-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e16229b43fc14ad29b04eacd52a908d74a44cb13eb73ea006eeb0b03c8bcbca4
|
|
| MD5 |
eabcfea16395dad7aaad668226a7dcb0
|
|
| BLAKE2b-256 |
e759c7c449d1b57d820e330f31c2b4189b4fb61feb749e3390935779c143e88c
|