Skip to main content

Easy and efficient quantization for transformer

Project description

EETQ

Easy & Efficient Quantization for Transformers

Table of Contents

Features

  • New🔥: Implement gemv in w8a16, performance improvement 10~30%.
  • INT8 weight only PTQ
    • High-performance GEMM kernels from FasterTransformer, original code
    • No need for quantization training
  • Optimized attention layer using Flash-Attention V2
  • Easy to use, adapt to your pytorch model with one line of code

Getting started

Environment

  • cuda:>=11.4
  • python:>=3.8
  • gcc:>= 7.4.0
  • torch:>=1.14.0
  • transformers:>=4.27.0

The above environment is the minimum configuration, and it is best to use a newer version.

Installation

Recommend using Dockerfile.

$ git clone https://github.com/NetEase-FuXi/EETQ.git
$ cd EETQ/
$ git submodule update --init --recursive
$ pip install .

If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS:

$ MAX_JOBS=4 pip install .

Usage

  1. Quantize torch model
from eetq.utils import eet_quantize
eet_quantize(torch_model)
  1. Quantize torch model and optimize with flash attention
...
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16)
from eetq.utils import eet_accelerator
eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0")
model.to("cuda:0")

# inference
res = model.generate(...)
  1. Use EETQ in TGI. see this PR.
text-generation-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...
  1. Use EETQ in LoRAX. See docs here.
lorax-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...

Examples

Model:

Performance

  • llama-13b (test on 3090) prompt=1024, max_new_tokens=50

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

File details

Details for the file EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for EETQ-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c332b2771ea1e56fe147dbefe846c6a706293cbcad48f9d5368ba139c9b271ab
MD5 dcd0c76c6f019359ea014531db13048b
BLAKE2b-256 3f5b423d19fae13b42b2430d35e7fb70bb08a875d93206aa964f8b48bebe7d04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page