Easy and efficient quantization for transformer

Project description

EETQ

中文README

Easy & Efficient Quantization for Transformers

EETQ

Features

New🔥: Implement gemv in w8a16, performance improvement 10~30%.
INT8 weight only PTQ
- High-performance GEMM kernels from FasterTransformer, original code
- No need for quantization training
Optimized attention layer using Flash-Attention V2
Easy to use, adapt to your pytorch model with one line of code

Getting started

Environment

cuda:>=11.4
python:>=3.8
gcc:>= 7.4.0
torch:>=1.14.0
transformers:>=4.27.0

The above environment is the minimum configuration, and it is best to use a newer version.

Installation

Recommend using Dockerfile.

$ git clone https://github.com/NetEase-FuXi/EETQ.git
$ cd EETQ/
$ git submodule update --init --recursive
$ pip install .

If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS:

$ MAX_JOBS=4 pip install .

Usage

Quantize torch model

from eetq.utils import eet_quantize
eet_quantize(torch_model)

Quantize torch model and optimize with flash attention

...
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16)
from eetq.utils import eet_accelerator
eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0")
model.to("cuda:0")

# inference
res = model.generate(...)

Use EETQ in TGI. see this PR.

text-generation-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...

Use EETQ in LoRAX. See docs here.

lorax-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...

Examples

Model:

examples/models/llama_transformers_example.py

Performance

llama-13b (test on 3090) prompt=1024, max_new_tokens=50

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Mar 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.3 MB view details)

Uploaded Mar 20, 2024 CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 20, 2024
Size: 14.3 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`c332b2771ea1e56fe147dbefe846c6a706293cbcad48f9d5368ba139c9b271ab`
MD5	`dcd0c76c6f019359ea014531db13048b`
BLAKE2b-256	`3f5b423d19fae13b42b2430d35e7fb70bb08a875d93206aa964f8b48bebe7d04`

See more details on using hashes here.

EETQ 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta