Skip to main content

HiP Attention

Project description

:sunglasses: HiP Attention could extend the model context length training-free and can serve 3 million tokens with a single L40S 48GB GPU while achieving a 7.24 estimated speedup.

| Paper (Arxiv, InfiniteHiP latest) | Paper (ICLR 2025) | SGlang Integration |

[!NOTE] You can try it in our Playground in DeepAuto.ai!

[!IMPORTANT] This is NOT yet free for commercial use. The license is FSL-1.1-MIT, which is free for non-commercial use but will automatically convert to MIT license two years after each release. Please refer to the LICENSE for more details.

News

  • 2025.01.26: Version 1.2 is now ready! The preprint is now prepared in arxiv.
  • 2025.01.22: HiP Attention is accepted in ICLR 2025!
... More News ...
  • 2025.01.03: Version 1.2 will be released soon. The new version fully supports context extension and better controls pruning hierarchy. It will also have better SGlang support (with proper KV offloading!)
  • 2024.10.05: Version 1.1 is now ready, check ainl-hip-offload. KV offloading feature in under alpha state.
  • 2024.09.09: Version 1.1 will be released soon. Please refer to the ainl-hip-attention2 branch for a preview. It will reduce the latency further and improve the accuracy (and this will fix most of the internal bugs of v1.0). It offers many more experimental options for further research (e.g., key access logs, modular design of masking kernel). As discussed in the Appendix, this release will actually have (hopefully) a KV offloading feature, either UVM or a custom cache management algorithm. Also, SGLang will be supported by this release. Please take a look at our company's fork for a preview.

Usage

After installation, you can access the hip package from any project. hip is the code name of HiP attention.

import torch
from hip_attn import hip_attention_12, HiPAttentionArgs12

device = 'cuda'

batch_size = 1
kv_len = 128 * 1024
q_len = 32 * 1024
num_heads = 32
num_kv_heads = 8
head_dims = 128
dtype = torch.bfloat16

q = torch.randn(
    (batch_size, q_len, num_heads, head_dims),
    dtype=dtype,
    device=device
)
k = torch.randn(
    (batch_size, kv_len, num_kv_heads, head_dims),
    dtype=dtype,
    device=device,
)
v = k.clone()

output, metadata = hip_attention_12(q=q, k=k, v=v, args=HiPAttentionArgs12())
print(output.shape)

# > torch.Size([1, 32768, 32, 128])

Getting Started

Local development

Using uv (Recommended)

It’s recommended to use uv, a very fast Python environment manager, to create and manage Python environments. Please follow the documentation to install uv. After installing uv, you can create a new Python environment and install hip-attention using the following commands:

# Clone this repository
git clone git@github.com:DeepAuto-AI/hip-attention.git
cd hip-attention

# This install all research dev dependencies in .venv/
uv sync
uv run pre-commit install

Then you can run any python program with uv run. uv run automatically picks up .venv/ virtual environment:

  • Script: uv run src/hip_research/main/model_eval.py
  • Module: uv run -m src.hip_research.main.model_eval

Using pip and conda

# Clone this repository
git clone git@github.com:DeepAuto-AI/hip-attention.git
cd hip-attention

# Make new conda environment
conda create --name hip python=3.11
conda activate hip

# Default install
pip install -e "."
# (Optional) For research benchmarks and unit tests
pip install -e "hip-research"

# Optional, depends on your CUDA environment
export CUDACXX=/usr/local/cuda/bin/nvcc

# Install SGLang with support for HiP Attention
pip install -e ".[sglang]" \
"sglang[all] @ git+https://github.com/DeepAuto-AI/sglang.git@deepauto/release#subdirectory=python" \
--no-build-isolation \
--verbose \
--find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

Running

See the following pages for more details:

Building Docker

git clone git@github.com:DeepAuto-AI/hip-attention.git
cd hip-attention
docker build -t hip-attention:latest -t hip-attention:latest-sglang -t hip-attention:$(git rev-parse --short HEAD)-sglang -t hip-attention:v$(uv run python -c 'import importlib.metadata; print(importlib.metadata.version("hip-attn"))')-sglang -f Dockerfile.sglang .

Experiment Reproduce

Check how to reproduce experiment page

Citation

@misc{lee2025_infinite_hip,
      title={InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU},
      author={Heejun Lee and Geon Park and Jaduk Suh and Sung Ju Hwang},
      year={2025},
      eprint={2502.08910},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08910},
}

@inproceedings{lee2025_hip_attention,
      title={A Training-Free Sub-quadratic Cost Transformer Model Serving Framework with Hierarchically Pruned Attention},
      author={Heejun Lee and Geon Park and Youngwan Lee and Jaduk Suh and Jina Kim and Wonyong Jeong and Bumsik Kim and Hyemin Lee and Myeongjae Jeon and Sung Ju Hwang},
      booktitle={The Thirteenth International Conference on Learning Representations},
      year={2025},
      url={https://openreview.net/forum?id=PTcMzQgKmn}
}

Contributing

Building and publishing

  • PyPI
rm -rf dist
uv build --no-sources
uv publish
  • Docker
docker login
docker build -t deepauto/hip-attention:latest -t deepauto/hip-attention:latest-sglang -t deepauto/hip-attention:$(git rev-parse --short HEAD)-sglang -t deepauto/hip-attention:v$(uv run python -c 'import importlib.metadata; print(importlib.metadata.version("hip-attn"))')-sglang -f Dockerfile.sglang .
docker push deepauto/hip-attention:latest
docker push deepauto/hip-attention:latest-sglang
docker push deepauto/hip-attention:$(git rev-parse --short HEAD)-sglang
docker push deepauto/hip-attention:v$(uv run python -c 'import importlib.metadata; print(importlib.metadata.version("hip-attn"))')-sglang

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hip_attn-1.2.5.tar.gz (43.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hip_attn-1.2.5-py3-none-any.whl (415.5 kB view details)

Uploaded Python 3

File details

Details for the file hip_attn-1.2.5.tar.gz.

File metadata

  • Download URL: hip_attn-1.2.5.tar.gz
  • Upload date:
  • Size: 43.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for hip_attn-1.2.5.tar.gz
Algorithm Hash digest
SHA256 bd1e34d1ab14e61c342be1b7d01a09194fa7ffbb645569acdb48bc736006aa76
MD5 e1c0142a2f3122f61d17aee4a0b92fe5
BLAKE2b-256 758e6bf7dc7645d0c4ae594330e4ac5c69b268697dc395de4c57d5d6bbadca23

See more details on using hashes here.

File details

Details for the file hip_attn-1.2.5-py3-none-any.whl.

File metadata

  • Download URL: hip_attn-1.2.5-py3-none-any.whl
  • Upload date:
  • Size: 415.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for hip_attn-1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0fb12f3ec2a4ebdda5feb8849d3469a57fc0c092f3dcbe33fab0eb99d4859f85
MD5 ad42efb92bd080f1517e4ad5ddd088a6
BLAKE2b-256 cfbba0aafad95c78246d08d8ab9b22b742e6f8786e548eb654c5ff77dcf0caec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page