No project description provided

These details have not been verified by PyPI

Project links

Homepage

Project description

MoE-Infinity

MoE-Infinity is a cost-effective, fast, and easy-to-use library for Mixture-of-Experts (MoE) inference and serving.

MoE-Infinity is cost-effective yet fast:

Offloading MoE's experts to host memory, allowing memory-constrained GPUs to serve MoE models.
Minimizing the expert offloading overheads through several novel techniques: expert activation tracing, activation-aware expert prefetching, and activation-aware expert caching.
Supporting LLM acceleration techniques (such as FlashAttention).
Supporting multi-GPU environments with numeorous OS-level performance optimizations.
Achieving SOTA latency and throughput performance when serving MoEs in a resource-constrained GPU environment (in comparison with HuggingFace Accelerate, DeepSpeed, Mixtral-Offloading, and Ollama/LLama.cpp).

MoE-Infinity is easy-to-use:

HuggingFace model compatible, and HuggingFace programmer friendly.
Supporting all available MoE checkpoints (including Google Switch Transformers, Meta NLLB-MoE, and Mixtral).

Note that: The open-sourced MoE-Infinity has been redesigned for making it HuggingFace-users friendly. This version is different from the version reported in the paper, which takes extreme performance as the top priority. As a result, distributed inference is currently not supported in this open-sourced version.

Performance
Installation
Usage and Examples
- Sample Code of Huggingface LLM Inference
- Running Inference
Release Plan
Citation

Performance

Single GPU A5000 (24GB Memory), per-token-latency (seconds) for generation with a mixed dataset that includes FLAN, BIG-Bench and MMLU datasets. Lower per-token-latency is preferable.

	switch-large-128	NLLB-MoE-54B	Mixtral-7x8b
MoE-Infinity	0.230	0.239	0.895
Accelerate	1.043	3.071	6.633
DeepSpeed	4.578	8.381	2.486
Mixtral Offloading	X	X	1.752
Ollama	X	X	0.903

Single GPU A5000, throughput (token/s) for generation with batch size 32. Higher throughput is preferable.

	switch-large-128	NLLB-MoE-54B	Mixtral-7x8b
MoE-Infinity	69.105	30.300	12.579
Accelerate	5.788	4.344	1.245
DeepSpeed	7.416	4.334	7.727
Mixtral Offloading	X	X	7.684
Ollama	X	X	1.107

The Mixtral Offloading experiment was carried out with a batch size of 16, as utilizing a batch size of 32 would result in Out of Memory errors on the GPU.

Ollama does not support batching for generation, so the throughput is calculated with a batch size of 1.

Installation

We recommend installing MoE-Infinity in a virtual environment. To install MoE-Infinity, you can either install it from PyPI or build it from source.

Install from conda environment

conda env create --file environment.yml
conda activate moe-infinity

Install from PyPI

pip install moe-infinity
conda install -c conda-forge libstdcxx-ng=12 # assume using conda, otherwise install libstdcxx-ng=12 using your package manager or gcc=12

Install from Source

git clone https://github.com/TorchMoE/MoE-Infinity.git
cd MoE-Infinity
pip install -e .

Enable FlashAttention (Optional)

Install FlashAttention (>=2.5.2) for faster inference with the following command.

FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn

Post-installation, MoE-Infinity will automatically integrate with FlashAttention to enhance performance.

Usage and Examples

We provide a simple API for diverse setups, including single GPU, multiple GPUs, and multiple nodes. The following examples show how to use MoE-Infinity to run generation on a Huggingface LLM model.

Sample Code of Huggingface LLM Inference

import torch
import os
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration
from moe_infinity import MoE

user_home = os.path.expanduser('~')

checkpoint = 'TheBloke/Mixtral-8x7B-v0.1-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config = {
    "offload_path": os.path.join(user_home, "moe-infinity"),
    "device_memory_ratio": 0.75, # 75% of the device memory is used for caching, change the value according to your device memory size on OOM
}

model = MoE(checkpoint, config)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda:0")

output_ids = model.generate(input_ids)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

Running Inference

This command runs the script on selected GPUs.

CUDA_VISIBLE_DEVICES=0,1 python script.py

We provide a simple example to run inference on a Huggingface LLM model. The script will download the model checkpoint and run inference on the specified input text. The output will be printed to the console.

CUDA_VISIBLE_DEVICES=0 python example/interface_example.py --model_name_or_path "mistralai/Mixtral-8x7B-Instruct-v0.1" --offload_dir <your local path on SSD>

Release Plan

We plan to release two functions in the following months:

We currently support PyTorch as the default inference engine, and we are in the process of supporting vLLM as another inference runtime, which includes the support of KV cache offloading.
Supporting expert parallelism for distributed MoE inference.
More (We welcome contributors to join us!)

Citation

If you use MoE-Inifity for your research, please cite our paper:

@inproceedings{moe-infinity2024,
  title={MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving},
  author={Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina},
  booktitle={https://arxiv.org/abs/2401.14361},
  year={2024}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.2

Apr 14, 2024

0.0.1

Apr 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

test_proj_8e00a834c8-0.0.2.tar.gz (146.4 kB view details)

Uploaded Apr 14, 2024 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

test_proj_8e00a834c8-0.0.2-cp311-cp311-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded Apr 14, 2024 CPython 3.11

test_proj_8e00a834c8-0.0.2-cp310-cp310-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded Apr 14, 2024 CPython 3.10

test_proj_8e00a834c8-0.0.2-cp39-cp39-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded Apr 14, 2024 CPython 3.9

test_proj_8e00a834c8-0.0.2-cp38-cp38-manylinux1_x86_64.whl (24.5 MB view details)

Uploaded Apr 14, 2024 CPython 3.8

File details

Details for the file test_proj_8e00a834c8-0.0.2.tar.gz.

File metadata

Download URL: test_proj_8e00a834c8-0.0.2.tar.gz
Upload date: Apr 14, 2024
Size: 146.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for test_proj_8e00a834c8-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`c103786540b65393c5ad7f24fb257d576c1d0d6f1c5ed0cd07df0f67131c777b`
MD5	`7fb27c2bd5d0f26d670cd1e60e175a02`
BLAKE2b-256	`45c0dd4aa0a67ba6f6d97556f092e4a3cb3e047019e11ce52dd5ede802fbb6dc`

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.2-cp311-cp311-manylinux1_x86_64.whl.

File metadata

Download URL: test_proj_8e00a834c8-0.0.2-cp311-cp311-manylinux1_x86_64.whl
Upload date: Apr 14, 2024
Size: 24.5 MB
Tags: CPython 3.11
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for test_proj_8e00a834c8-0.0.2-cp311-cp311-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`9ef0434ce54bb0cfb683308b4a4af6375661492255ff8690e242bd84da936d46`
MD5	`23033d8eb4ad6447b42345af74e44fe7`
BLAKE2b-256	`b77fd37492154dfe741c493874fe0d9d800c9ea673822e1f07ce7462574cb094`

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.2-cp310-cp310-manylinux1_x86_64.whl.

File metadata

Download URL: test_proj_8e00a834c8-0.0.2-cp310-cp310-manylinux1_x86_64.whl
Upload date: Apr 14, 2024
Size: 24.5 MB
Tags: CPython 3.10
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for test_proj_8e00a834c8-0.0.2-cp310-cp310-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`57383d37dd81b51c6e18e777d11e5e0ca85a398999cb1f5328df5a776f820428`
MD5	`e9d35b2ff48da8470ee0d6da27913255`
BLAKE2b-256	`24e13a4f123adb34c75039182f22301ba6b1838cfea1302178be985aba998d23`

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

Download URL: test_proj_8e00a834c8-0.0.2-cp39-cp39-manylinux1_x86_64.whl
Upload date: Apr 14, 2024
Size: 24.5 MB
Tags: CPython 3.9
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for test_proj_8e00a834c8-0.0.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`50322975d6434c6caa764624eb4a5153e0da530e2a075f3f1178e628b39c8770`
MD5	`15d9cd325acfb74e18cac0fd94404bc1`
BLAKE2b-256	`e08645148081add7ab65a1b3345db0c12c34993521a9af09657efaf56236be3b`

See more details on using hashes here.

File details

Details for the file test_proj_8e00a834c8-0.0.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

Download URL: test_proj_8e00a834c8-0.0.2-cp38-cp38-manylinux1_x86_64.whl
Upload date: Apr 14, 2024
Size: 24.5 MB
Tags: CPython 3.8
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for test_proj_8e00a834c8-0.0.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`2e0ac3ac3114ce2f9d6465031642077acf741993853e223d80cfb9fa84fae81d`
MD5	`cca6fa5252eb5382334cd7d44e5c1fe1`
BLAKE2b-256	`f095f3b0036cb970e0f757c6bb1dd128a59c3abd438174b38d79b9413b2245ae`

See more details on using hashes here.

test-proj-8e00a834c8 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MoE-Infinity

Contents

Performance

Installation

Install from conda environment

Install from PyPI

Install from Source

Enable FlashAttention (Optional)

Usage and Examples

Sample Code of Huggingface LLM Inference

Running Inference

Release Plan

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes