Skip to main content

A high-performance inference system for large language models.

Project description

ScaleLLM

An efficient LLM Inference solution

Docs PyPI Twitter Discord build License


ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama3, Gemma, Bloom, GPT-NeoX, and more.

ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our Roadmap for more details.

News:

Key Features

Table of contents

Getting Started

ScaleLLM is available as a Python Wheel package on PyPI. You can install it using pip:

# Install scalellm with CUDA 12.1 and Pytorch 2.4.0
pip install scalellm

If you want to install ScaleLLM with different version of CUDA and Pytorch, you can pip install it with provding index URL of the version. For example, to install ScaleLLM with CUDA 12.1 and Pytorch 2.2.2, you can use the following command:

pip install scalellm -i https://whl.vectorch.com/cu121/torch2.2.2/

Build from source

If no wheel package is available for your configuration, you can build ScaleLLM from source code. You can clone the repository and install it locally using the following commands:

python setup.py bdist_wheel
pip install dist/scalellm-*.whl

OpenAI-Compatible Server

You can start the OpenAI-compatible REST API server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct

Chatbot UI

A local Chatbot UI is also available on localhost:3000. You can start it with latest image using the following command:

docker pull docker.io/vectorchai/chatbot-ui:latest
docker run -it --net=host \
  -e OPENAI_API_HOST=http://127.0.0.1:8080 \
  -e OPENAI_API_KEY=YOUR_API_KEY \
  docker.io/vectorchai/chatbot-ui:latest

Usage Examples

You can use ScaleLLM for offline batch inference, or online distributed inference. Below are some examples to help you get started. More examples can be found in the examples folder.

Chat Completions

Start rest api server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct

You can query the chat completions with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

or with openai python client:

import openai

client = openai.Client(
    base_url="http://localhost:8080/v1",
    api_key="EMPTY",
)

# List available models
models = client.models.list()
print("==== Available models ====")
for model in models.data:
    print(model.id)

# choose the first model
model = models.data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello"},
    ],
    stream=True,
)

print(f"==== Model: {model} ====")
for chunk in stream:
    choice = chunk.choices[0]
    delta = choice.delta
    if delta.content:
        print(delta.content, end="")
print()

Completions

Start rest api server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B

For regular completions, you can use this example:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "prompt": "hello",
    "max_tokens": 32,
    "temperature": 0.7,
    "stream": true
  }'
import openai

client = openai.Client(
    base_url="http://localhost:8080/v1",
    api_key="EMPTY",
)

# List available models
models = client.models.list()

print("==== Available models ====")
for model in models.data:
    print(model.id)

# choose the first model
model = models.data[0].id

stream = client.completions.create(
    model=model,
    prompt="hello",
    max_tokens=32,
    temperature=0.7,
    stream=True,
)

print(f"==== Model: {model} ====")
for chunk in stream:
    choice = chunk.choices[0]
    if choice.text:
        print(choice.text, end="")
print()

Advanced Features

CUDA Graph

CUDA Graph can improve performance by reducing the overhead of launching kernels. ScaleLLM supports CUDA Graph for decoding by default. In addition, It also allows user to specify which batch size to capture by setting the --cuda_graph_batch_sizes flag.

for example:

python3 -m scalellm.serve.api_server \
  --model=meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable_cuda_graph=true \
  --cuda_graph_batch_sizes=1,2,4,8

The limitations of CUDA Graph could cause problems during development and debugging. If you encounter any issues related to it, you can disable CUDA Graph by setting the --enable_cuda_graph=false flag.

Prefix Cache

The KV cache is a technique that caches the intermediate kv states to avoid redundant computation during LLM inference. Prefix cache extends this idea by allowing kv caches with the same prefix to be shared among different requests.

ScaleLLM supports Prefix Cache and enables it by default. You can disable it by setting the --enable_prefix_cache=false flag.

Chunked Prefill

Chunked Prefill splits a long user prompt into multiple chunks and populates the remaining slots with decodes. This technique can improve decoding throughput and enhance the user experience caused by long stalls. However it may slightly increase Time to First Token (TTFT). ScaleLLM supports Chunked Prefill, and its behavior can be controlled by setting the following flags:

  • --max_tokens_per_batch: The maximum tokens for each batch, default is 512.
  • --max_seqs_per_batch: The maximum sequences for each batch, default is 128.

Speculative Decoding

Speculative Decoding is a common used technique to speed up LLM inference without changing distribution. During inference, it employs an economical approximation to generate speculative tokens, subsequently validated by the target model. For now, ScaleLLM supports Speculative Decoding with a draft model to generate draft tokens, which can be enabled by configuring a draft model and setting the speculative steps.

for example:

python3 -m scalellm.serve.api_server \
  --model=google/gemma-7b-it \
  --draft_model=google/gemma-2b-it \
  --num_speculative_tokens=5 \
  --device=cuda:0 \
  --draft_device=cuda:0

Quantization

Quantization is a crucial process for reducing the memory footprint of models. ScaleLLM offers support for two quantization techniques: Accurate Post-Training Quantization (GPTQ) and Activation-aware Weight Quantization (AWQ), with seamless integration into the following libraries: autogptq, exllama, exllamav2, and awq.

By default, exllamav2 is employed for GPTQ 4-bit quantization. However, you have the flexibility to choose a specific implementation by configuring the "--qlinear_gptq_impl" option, which allows you to select from exllama, exllamav2, or auto option.

Supported Models

Models Tensor Parallel Quantization Chat API HF models examples
Aquila Yes Yes Yes BAAI/Aquila-7B, BAAI/AquilaChat-7B
Bloom Yes Yes No bigscience/bloom
Baichuan Yes Yes Yes baichuan-inc/Baichuan2-7B-Chat
ChatGLM3 Yes Yes Yes THUDM/chatglm3-6b
Gemma Yes Yes Yes google/gemma-2b
GPT_j Yes Yes No EleutherAI/gpt-j-6b
GPT_NeoX Yes Yes No EleutherAI/gpt-neox-20b
GPT2 Yes Yes No gpt2
InternLM Yes Yes Yes internlm/internlm-7b
Llama3/2 Yes Yes Yes meta-llama/Meta-Llama-3.1-8B-Instruct, meta-llama/Meta-Llama-3.1-8B, meta-llama/Llama-2-7b
Mistral Yes Yes Yes mistralai/Mistral-7B-v0.1
MPT Yes Yes Yes mosaicml/mpt-30b
Phi2 Yes Yes No microsoft/phi-2
Qwen Yes Yes Yes Qwen/Qwen-72B-Chat
Yi Yes Yes Yes 01-ai/Yi-6B, 01-ai/Yi-34B-Chat-4bits, 01-ai/Yi-6B-200K

If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on GitHub Issues.

Limitations

There are several known limitations we are looking to address in the coming months, including:

  • Only supports GPUs that newer than Turing architecture.

Contributing

If you have any questions or want to contribute, please don't hesitate to ask in our "Discussions" forum or join our "Discord" chat room. We welcome your input and contributions to make ScaleLLM even better. Please follow the Contributing.md to get started.

Acknowledgements

The following open-source projects have been used in this project, either in their original form or modified to meet our needs:

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

scalellm-0.1.8.post1-cp312-cp312-manylinux1_x86_64.whl (80.4 MB view details)

Uploaded CPython 3.12

scalellm-0.1.8.post1-cp311-cp311-manylinux1_x86_64.whl (80.4 MB view details)

Uploaded CPython 3.11

scalellm-0.1.8.post1-cp310-cp310-manylinux1_x86_64.whl (80.4 MB view details)

Uploaded CPython 3.10

scalellm-0.1.8.post1-cp39-cp39-manylinux1_x86_64.whl (80.4 MB view details)

Uploaded CPython 3.9

scalellm-0.1.8.post1-cp38-cp38-manylinux1_x86_64.whl (80.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file scalellm-0.1.8.post1-cp312-cp312-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for scalellm-0.1.8.post1-cp312-cp312-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 58b799bc8f792c41d657cfa33ef651385b7fea0801f6125b10a180a69160a2fd
MD5 c06453d2a295a1b0a48c45681f4c73f6
BLAKE2b-256 af5580808fb014ea9bbe2d1f8c11da681bf94c7d514d1a54ff04488574d20bf3

See more details on using hashes here.

File details

Details for the file scalellm-0.1.8.post1-cp311-cp311-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for scalellm-0.1.8.post1-cp311-cp311-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2012937d7102a88744a4a8aa98e1e4becd8598d965dea706cb960c4595c70bdd
MD5 15ced50a4a2b2ffc183c883fbe6018a5
BLAKE2b-256 f5e2638faa0176bf372f6beb7947ad951fc732a3e9bf5e05cb1c606cee67705d

See more details on using hashes here.

File details

Details for the file scalellm-0.1.8.post1-cp310-cp310-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for scalellm-0.1.8.post1-cp310-cp310-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 237ebdba747e28fa3160195b85991e47a102978739e5af64f58618767cf55439
MD5 72392f59681405f2d91180d86b1d54ef
BLAKE2b-256 2a3bdf16bb720e811656b1bdb315f6602e2d50ff8f4a6ad86443507a7e0835fb

See more details on using hashes here.

File details

Details for the file scalellm-0.1.8.post1-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for scalellm-0.1.8.post1-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a9d51afa69ddbd5693874032385ce5a1eefaf1dcaa8beb329c9ec451e3c4a811
MD5 69f02df715287f5e6aa2b898827ec2dd
BLAKE2b-256 15ec4d035381b2a28de53932afbebf134528bc23e2b9f948ce4217ad55c1cdf3

See more details on using hashes here.

File details

Details for the file scalellm-0.1.8.post1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for scalellm-0.1.8.post1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d9f1c52acf96fd587bd39b5415875852ec01d7ae6a3b0a1146f0d7e1a155d8b1
MD5 60af6b69a1ebfd3e86b966c3f4e032e3
BLAKE2b-256 1eafa1eb10e3dcdd0e1520a278a8a514b89ee611205f6b3caf9383fde554fc87

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page