☸️ Easy, advanced inference platform for large language models on Kubernetes
Project description
Easy, advanced inference platform for large language models on Kubernetes
llmaz (pronounced /lima:z/
), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.
🌱 llmaz is alpha now, so API may change before graduating to Beta.
Architecture
Features Overview
- Easy of Use: People can quick deploy a LLM service with minimal configurations.
- Broad Backends Support: llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.
- Model Distribution: Out-of-the-box model cache system with Manta.
- Accelerator Fungibility: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- SOTA Inference: llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise(WIP) to run on Kubernetes.
- Various Model Providers: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
- Multi-hosts Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 0.
- Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to meet elastic demands.
Quick Start
Installation
Read the Installation for guidance.
Deploy
Here's a toy example for deploying facebook/opt-125m
, all you need to do
is to apply a Model
and a Playground
.
Please refer to examples to learn more.
Note: if your model needs Huggingface token for weight downloads, please run
kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token>
ahead.
Model
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
name: opt-125m
spec:
familyName: opt
source:
modelHub:
modelID: facebook/opt-125m
inferenceFlavors:
- name: t4 # GPU type
requests:
nvidia.com/gpu: 1
Inference Playground
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
name: opt-125m
spec:
replicas: 1
modelClaim:
modelName: opt-125m
Test
Expose the service
kubectl port-forward pod/opt-125m-0 8080:8080
Get registered models
curl http://localhost:8080/v1/models
Request a query
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 10,
"temperature": 0
}'
More than quick-start
If you want to learn more about this project, please refer to develop.md.
Roadmap
- Gateway support for traffic routing
- Metrics support
- Serverless support for cloud-agnostic users
- CLI tool support
- Model training, fine tuning in the long-term
Community
Join us for more discussions:
- Slack Channel: #llmaz
Contributions
All kinds of contributions are welcomed ! Please following CONTRIBUTING.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file llmaz-0.0.1.tar.gz
.
File metadata
- Download URL: llmaz-0.0.1.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a05dd0104af8faece8304265e83ad4ef07e75a28a37f0817e4fd0da3b0772b2c |
|
MD5 | bf4938602f8d3b01f34f0dda8b48aead |
|
BLAKE2b-256 | a14d3e666b2afece4dfdad424462781e66324f46518a770335cbe2caa9702418 |
File details
Details for the file llmaz-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: llmaz-0.0.1-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e61b2a6642dc5b355af7c62eac27a6cfae1f59b752ea1c991646b540a8ecc5d |
|
MD5 | 9090db1acd6a82143048bc479adcbc36 |
|
BLAKE2b-256 | cdf3ff196d8f54b2667ec425c74e70d93b48c01fef3937327f86625895ac9073 |