Skip to main content

No project description provided

Project description

MoE-Infinity

MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference and serving.

MoE-Infinity is cost-effective yet fast:

  • Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
  • Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
  • Supporting LLM acceleration techniques (such as FlashAttention).
  • Supporting multi-GPU environments with numeorous OS-level performance optimizations.
  • Achieving SOTA latency and throughput performance when serving MoEs in a resource-constrained GPU environment (in comparison with HuggingFace Accelerate, DeepSpeed, Mixtral-Offloading, and Ollama/LLama.cpp).

MoE-Infinity is easy-to-use:

Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.

Contents

Performance

Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes FLAN, BIG-Bench and MMLU datasets. Lower per-token-latency is preferable.

switch-large-128 NLLB-MoE-54B Mixtral-7x8b
MoE-Infinity 0.230 0.239 0.895
Accelerate 1.043 3.071 6.633
DeepSpeed 4.578 8.381 2.486
Mixtral Offloading X X 1.752
Ollama X X 0.903

Single GPU A5000, throughput (token/s) for generation with batch size 32. Higher throughput is preferable.

switch-large-128 NLLB-MoE-54B Mixtral-7x8b
MoE-Infinity 69.105 30.300 12.579
Accelerate 5.788 4.344 1.245
DeepSpeed 7.416 4.334 7.727
Mixtral Offloading X X 7.684
Ollama X X 1.107

The Mixtral Offloading experiment was carried out with a batch size of 16, as utilizing a batch size of 32 would result in Out of Memory errors on the GPU.

Ollama does not support batching for generation, so the throughput is calculated with a batch size of 1.

Installation

We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.

Install from conda environment

conda env create --file environment.yml
conda activate moe-infinity

Install from PyPI

pip install moe-infinity
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

Install from Source

git clone https://github.com/TorchMoE/MoE-Infinity.git
cd MoE-Infinity
pip install -e .

Enable FlashAttention (Optional)

Install FlashAttention (>=2.5.2) for faster inference with the following command.

FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn

Post-installation, MoE-Infinity will automatically integrate with FlashAttention to enhance performance.

Usage and Examples

We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.

Sample Code of Huggingface LLM Inference

import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = 'TheBloke/Mixtral-8x7B-v0.1-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
    "device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}

model = MoE(checkpoint, config)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

Running Inference

This command runs the script on selected GPUs.

CUDA_VISIBLE_DEVICES=0,1 python script.py

We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.

CUDA_VISIBLE_DEVICES=0 python example/interface_example.py --model_name_or_path "mistralai/Mixtral-8x7B-Instruct-v0.1" --offload_dir <your local path on SSD> 

Release Plan

We plan to release two functions in the following months:

  • We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
  • Supporting expert parallelism for distributed MoE inference.
  • More (We welcome contributors to join us!)

Citation

If you use MoE-Inifity for your research, please cite our paper:

@inproceedings{moe-infinity2024,
  title={MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving},
  author={Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina},
  booktitle={https://arxiv.org/abs/2401.14361},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

test_proj_8e00a834c8-0.0.2.tar.gz (146.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

test_proj_8e00a834c8-0.0.2-cp311-cp311-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded CPython 3.11

test_proj_8e00a834c8-0.0.2-cp310-cp310-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded CPython 3.10

test_proj_8e00a834c8-0.0.2-cp39-cp39-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded CPython 3.9

test_proj_8e00a834c8-0.0.2-cp38-cp38-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded CPython 3.8

File details

Details for the file test_proj_8e00a834c8-0.0.2.tar.gz.

File metadata

  • Download URL: test_proj_8e00a834c8-0.0.2.tar.gz
  • Upload date:
  • Size: 146.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for test_proj_8e00a834c8-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c103786540b65393c5ad7f24fb257d576c1d0d6f1c5ed0cd07df0f67131c777b
MD5 7fb27c2bd5d0f26d670cd1e60e175a02
BLAKE2b-256 45c0dd4aa0a67ba6f6d97556f092e4a3cb3e047019e11ce52dd5ede802fbb6dc

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.2-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for test_proj_8e00a834c8-0.0.2-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9ef0434ce54bb0cfb683308b4a4af6375661492255ff8690e242bd84da936d46
MD5 23033d8eb4ad6447b42345af74e44fe7
BLAKE2b-256 b77fd37492154dfe741c493874fe0d9d800c9ea673822e1f07ce7462574cb094

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.2-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for test_proj_8e00a834c8-0.0.2-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 57383d37dd81b51c6e18e777d11e5e0ca85a398999cb1f5328df5a776f820428
MD5 e9d35b2ff48da8470ee0d6da27913255
BLAKE2b-256 24e13a4f123adb34c75039182f22301ba6b1838cfea1302178be985aba998d23

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for test_proj_8e00a834c8-0.0.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 50322975d6434c6caa764624eb4a5153e0da530e2a075f3f1178e628b39c8770
MD5 15d9cd325acfb74e18cac0fd94404bc1
BLAKE2b-256 e08645148081add7ab65a1b3345db0c12c34993521a9af09657efaf56236be3b

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for test_proj_8e00a834c8-0.0.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2e0ac3ac3114ce2f9d6465031642077acf741993853e223d80cfb9fa84fae81d
MD5 cca6fa5252eb5382334cd7d44e5c1fe1
BLAKE2b-256 f095f3b0036cb970e0f757c6bb1dd128a59c3abd438174b38d79b9413b2245ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page