Easy and efficient quantization for transformer
Project description
EETQ
Easy & Efficient Quantization for TransformersTable of Contents
Features
- New🔥: Implement gemv in w8a16, performance improvement 10~30%.
- INT8 weight only PTQ
- High-performance GEMM kernels from FasterTransformer, original code
- No need for quantization training
- Optimized attention layer using Flash-Attention V2
- Easy to use, adapt to your pytorch model with one line of code
Getting started
Environment
- cuda:>=11.4
- python:>=3.8
- gcc:>= 7.4.0
- torch:>=1.14.0
- transformers:>=4.27.0
The above environment is the minimum configuration, and it is best to use a newer version.
Installation
Recommend using Dockerfile.
$ git clone https://github.com/NetEase-FuXi/EETQ.git
$ cd EETQ/
$ git submodule update --init --recursive
$ pip install .
If your machine has less than 96GB of RAM and lots of CPU cores, ninja might
run too many parallel compilation jobs that could exhaust the amount of RAM. To
limit the number of parallel compilation jobs, you can set the environment
variable MAX_JOBS:
$ MAX_JOBS=4 pip install .
Usage
- Quantize torch model
from eetq.utils import eet_quantize
eet_quantize(torch_model)
- Quantize torch model and optimize with flash attention
...
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16)
from eetq.utils import eet_accelerator
eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0")
model.to("cuda:0")
# inference
res = model.generate(...)
text-generation-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...
lorax-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...
Examples
Model:
Performance
- llama-13b (test on 3090) prompt=1024, max_new_tokens=50
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 14.3 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c332b2771ea1e56fe147dbefe846c6a706293cbcad48f9d5368ba139c9b271ab
|
|
| MD5 |
dcd0c76c6f019359ea014531db13048b
|
|
| BLAKE2b-256 |
3f5b423d19fae13b42b2430d35e7fb70bb08a875d93206aa964f8b48bebe7d04
|