Skip to main content

No project description provided

Project description

MoE-Infinity

MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference and serving.

MoE-Infinity is cost-effective yet fast:

  • Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
  • Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
  • Supporting LLM acceleration techniques (such as FlashAttention).
  • Supporting multi-GPU environments with numeorous OS-level performance optimizations.
  • Achieving SOTA latency and throughput performance when serving MoEs in a resource-constrained GPU environment (in comparison with HuggingFace Accelerate, DeepSpeed, Mixtral-Offloading, and Ollama/LLama.cpp).

MoE-Infinity is easy-to-use:

Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.

Contents

Performance

Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes FLAN, BIG-Bench and MMLU datasets. Lower per-token-latency is preferable.

switch-large-128 NLLB-MoE-54B Mixtral-7x8b
MoE-Infinity 0.230 0.239 0.895
Accelerate 1.043 3.071 6.633
DeepSpeed 4.578 8.381 2.486
Mixtral Offloading X X 1.752
Ollama X X 0.903

Single GPU A5000, throughput (token/s) for generation with batch size 32. Higher throughput is preferable.

switch-large-128 NLLB-MoE-54B Mixtral-7x8b
MoE-Infinity 69.105 30.300 12.579
Accelerate 5.788 4.344 1.245
DeepSpeed 7.416 4.334 7.727
Mixtral Offloading X X 7.684
Ollama X X 1.107

The Mixtral Offloading experiment was carried out with a batch size of 16, as utilizing a batch size of 32 would result in Out of Memory errors on the GPU.

Ollama does not support batching for generation, so the throughput is calculated with a batch size of 1.

Installation

We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.

Install from conda environment

conda env create --file environment.yml
conda activate moe-infinity

Install from PyPI

pip install moe-infinity
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

Install from Source

git clone https://github.com/TorchMoE/MoE-Infinity.git
cd MoE-Infinity
pip install -e .

Enable FlashAttention (Optional)

Install FlashAttention (>=2.5.2) for faster inference with the following command.

FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn

Post-installation, MoE-Infinity will automatically integrate with FlashAttention to enhance performance.

Usage and Examples

We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.

Sample Code of Huggingface LLM Inference

import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = 'TheBloke/Mixtral-8x7B-v0.1-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
    "device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}

model = MoE(checkpoint, config)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

Running Inference

This command runs the script on selected GPUs.

CUDA_VISIBLE_DEVICES=0,1 python script.py

We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.

CUDA_VISIBLE_DEVICES=0 python example/interface_example.py --model_name_or_path "mistralai/Mixtral-8x7B-Instruct-v0.1" --offload_dir <your local path on SSD> 

Release Plan

We plan to release two functions in the following months:

  • We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
  • Supporting expert parallelism for distributed MoE inference.
  • More (We welcome contributors to join us!)

Citation

If you use MoE-Inifity for your research, please cite our paper:

@inproceedings{moe-infinity2024,
  title={MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving},
  author={Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina},
  booktitle={https://arxiv.org/abs/2401.14361},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

test_proj_8e00a834c8-0.0.1.tar.gz (146.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

test_proj_8e00a834c8-0.0.1-cp311-cp311-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded CPython 3.11

test_proj_8e00a834c8-0.0.1-cp310-cp310-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded CPython 3.10

test_proj_8e00a834c8-0.0.1-cp39-cp39-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded CPython 3.9

test_proj_8e00a834c8-0.0.1-cp38-cp38-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded CPython 3.8

File details

Details for the file test_proj_8e00a834c8-0.0.1.tar.gz.

File metadata

  • Download URL: test_proj_8e00a834c8-0.0.1.tar.gz
  • Upload date:
  • Size: 146.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for test_proj_8e00a834c8-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e7512d4379d5ccdcc4e353e4476c148e7d0f0f3851651420f97762c23da3081a
MD5 25c7d507c21637c63b09a0ddb7ade9bf
BLAKE2b-256 909bffb9dae8e15879647fa8bfac2e0aee62d76bebfd10661ba4900af732b1d0

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.1-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for test_proj_8e00a834c8-0.0.1-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d65d12bd4cdd005d36d58cf9b3ff6c01afa57fadb521548eeec557b8a4eca16c
MD5 9d744fc1cc28d5dc9e3009c4ad8d92cd
BLAKE2b-256 50f8f3fc1d835d25e43438f2a927c08e460b878145c74f927594b7e1d2bafb08

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.1-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for test_proj_8e00a834c8-0.0.1-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 79a760a9af8d99125fd5fa0767f2b4fd479a811fac47cebc9d6d11837c4a93b5
MD5 6982d137d175795ce9e1430cdbd9046f
BLAKE2b-256 7801fee8ea0c87a31e03383ed81223578c14c669cdafe39efa1e97bba4356124

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.1-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for test_proj_8e00a834c8-0.0.1-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 61111f8cb51d28530e59d244c9bcde76189e4263e92bacf95e8356c2dd059401
MD5 ac27aa425aef5125c96f9260a8490ebc
BLAKE2b-256 1fb26e386202b5f187abcc936c5c9d0d472f6ea0a8e84d1ae8059be6cd3e5025

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for test_proj_8e00a834c8-0.0.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 813fa280e7e6d5a6c0216a7312bb744fb3416bad0949dd234e1cfacce75cd774
MD5 7f2f2b899ef265060bfa2a39d69e17a2
BLAKE2b-256 da33767ff74e781bf3034a81c8ff276c111e2d6eb715846b48c034a0d86cdc78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page