No project description provided

Project description

ServerlessLLM

ServerlessLLM

ServerlessLLM (sllm, pronounced "slim") is an open-source serverless framework designed to make custom and elastic LLM deployment easy, fast, and affordable. As LLMs grow in size and complexity, deploying them on AI hardware has become increasingly costly and technically challenging, limiting custom LLM deployment to only a select few. ServerlessLLM solves these challenges with a full-stack, LLM-centric serverless system design, optimizing everything from checkpoint formats and inference runtimes to the storage layer and cluster scheduler.

Curious about how it works under the hood? Check out our System Walkthrough for a deep dive into the technical design—perfect if you're exploring your own research or building with ServerlessLLM.

News

[03/25] We're excited to share that we'll be giving a ServerlessLLM tutorial at the SESAME workshop, co-located with ASPLOS/EuroSys 2025 in Rotterdam, Netherlands, on March 31. Slides | More info
[11/24] We have added experimental support of fast checkpoint loading for AMD GPUs (ROCm) when using with vLLM, PyTorch and HuggingFace Accelerate. Please refer to the documentation for more details.
[10/24] ServerlessLLM was invited to present at a global AI tech vision forum in Singapore.
[10/24] We hosted the first ServerlessLLM developer meetup in Edinburgh, attracting over 50 attendees both offline and online. Together, we brainstormed many exciting new features to develop. If you have great ideas, we’d love for you to join us!
[10/24] We made the first public release of ServerlessLLM. Check out the details of the release here.
[09/24] ServerlessLLM now supports embedding-based RAG + LLM deployment. We’re preparing a blog and demo—stay tuned!
[08/24] ServerlessLLM added support for vLLM.
[07/24] We presented ServerlessLLM at Nvidia's headquarters.
[06/24] ServerlessLLM officially went public.

Goals

ServerlessLLM is designed to support multiple LLMs in efficiently sharing limited AI hardware and dynamically switching between them on demand, which can increase hardware utilization and reduce the cost of LLM services. This multi-LLM scenario, commonly referred to as Serverless, is highly sought after by AI practitioners, as seen in solutions like Serverless Inference, Inference Endpoints, and Model Endpoints. However, these existing offerings often face performance overhead and scalability challenges, which ServerlessLLM effectively addresses through three key capabilities:

ServerlessLLM is Fast:

Supports leading LLM inference libraries like vLLM and HuggingFace Transformers. Through vLLM, ServerlessLLM can support various types of AI hardware (summarized by vLLM at here)
Achieves 5-10X faster loading speeds compared to Safetensors and the PyTorch Checkpoint Loader.
Features an optimized model loading scheduler, offering 5-100X lower start-up latency than Ray Serve and KServe.

ServerlessLLM is Cost-Efficient:

Allows multiple LLM models to share GPUs with minimal model switching overhead and supports seamless inference live migration.
Maximizes the use of local storage on multi-GPU servers, reducing the need for expensive storage servers and excessive network bandwidth.

ServerlessLLM is Easy-to-Use:

Simplifies deployment through Ray Cluster and Kubernetes via KubeRay.
Supports seamless deployment of HuggingFace Transformers and custom LLM models.
Supports NVIDIA and AMD GPUs
Easily integrates with the OpenAI Query API.

Getting Started

Install ServerlessLLM with pip or from source.

conda create -n sllm python=3.10 -y
conda activate sllm
pip install serverless-llm

Start a local ServerlessLLM cluster using the Quick Start Guide.
Want to try fast checkpoint loading in your own code? Check out the ServerlessLLM Store Guide.

Documentation

To install ServerlessLLM, please follow the steps outlined in our documentation. ServerlessLLM also offers Python APIs for loading and unloading checkpoints, as well as CLI tools to launch an LLM cluster. Both the CLI tools and APIs are demonstrated in the documentation.

Benchmark

Benchmark results for ServerlessLLM can be found here.

Community

ServerlessLLM is maintained by a global team of over 10 developers, and this number is growing. If you're interested in learning more or getting involved, we invite you to join our community on Discord and WeChat. Share your ideas, ask questions, and contribute to the development of ServerlessLLM. For becoming a contributor, please refer to our Contributor Guide.

Citation

If you use ServerlessLLM for your research, please cite our paper:

@inproceedings{fu2024serverlessllm,
  title={ServerlessLLM: Low-Latency Serverless Inference for Large Language Models},
  author={Fu, Yao and Xue, Leyang and Huang, Yeqi and Brabete, Andrei-Octavian and Ustiugov, Dmitrii and Patel, Yuvraj and Mai, Luo},
  booktitle={18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)},
  pages={135--153},
  year={2024}
}

Project details

Release history Release notifications | RSS feed

This version

0.8.0

Nov 3, 2025

0.7.0

Jun 6, 2025

0.6.3

Mar 30, 2025

0.6.2

Feb 14, 2025

0.6.1

Feb 12, 2025

0.6.0

Dec 17, 2024

0.5.2

Dec 3, 2024

0.5.1

Oct 28, 2024

0.5.0

Oct 21, 2024

0.0.0

Jun 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serverless_llm-0.8.0.tar.gz (48.4 kB view details)

Uploaded Nov 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

serverless_llm-0.8.0-py3-none-any.whl (70.6 kB view details)

Uploaded Nov 3, 2025 Python 3

File details

Details for the file serverless_llm-0.8.0.tar.gz.

File metadata

Download URL: serverless_llm-0.8.0.tar.gz
Upload date: Nov 3, 2025
Size: 48.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for serverless_llm-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`e59033d57ea5e82bc0cacb5c3294ce71ba154948f2fb673f8085f09bfaecda79`
MD5	`0cad942362a4e43ee6364f233b88e4c5`
BLAKE2b-256	`fedbe12484d1b8d9448b9fb4c562130ac997d897ffe55d027d6797f2c0e68663`

See more details on using hashes here.

File details

Details for the file serverless_llm-0.8.0-py3-none-any.whl.

File metadata

Download URL: serverless_llm-0.8.0-py3-none-any.whl
Upload date: Nov 3, 2025
Size: 70.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for serverless_llm-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f782573bc3f74dfdf09621691cf92948a3c9750a060f463f8632c90a067c7943`
MD5	`7fdca97b9082c4bd6bdcf543560e712e`
BLAKE2b-256	`0733252159cebfa50b4bba0a6cfe3b3a25e2607471b936ecb0a90a1a77d1b9b7`

See more details on using hashes here.

serverless-llm 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project description

ServerlessLLM

News

Goals

Getting Started

Documentation

Benchmark

Community

Citation

Project details

Verified details

Maintainers

Unverified details

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes