Skip to main content

☸️ Easy, advanced inference platform for large language models on Kubernetes

Project description

llmaz

Easy, advanced inference platform for large language models on Kubernetes

stability-alpha GoReport Widget Latest Release

llmaz (pronounced /lima:z/), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.

🌱 llmaz is alpha now, so API may change before graduating to Beta.

Architecture

architecture

Features Overview

  • Easy of Use: People can quick deploy a LLM service with minimal configurations.
  • Broad Backends Support: llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.
  • Model Distribution: Out-of-the-box model cache system with Manta.
  • Accelerator Fungibility: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
  • SOTA Inference: llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise(WIP) to run on Kubernetes.
  • Various Model Providers: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
  • Multi-hosts Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 0.
  • Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to meet elastic demands.

Quick Start

Installation

Read the Installation for guidance.

Deploy

Here's a toy example for deploying facebook/opt-125m, all you need to do is to apply a Model and a Playground.

Please refer to examples to learn more.

Note: if your model needs Huggingface token for weight downloads, please run kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token> ahead.

Model

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-125m
spec:
  familyName: opt
  source:
    modelHub:
      modelID: facebook/opt-125m
  inferenceFlavors:
  - name: t4 # GPU type
    requests:
      nvidia.com/gpu: 1

Inference Playground

apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: opt-125m
spec:
  replicas: 1
  modelClaim:
    modelName: opt-125m

Test

Expose the service

kubectl port-forward pod/opt-125m-0 8080:8080

Get registered models

curl http://localhost:8080/v1/models

Request a query

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 10,
    "temperature": 0
}'

More than quick-start

If you want to learn more about this project, please refer to develop.md.

Roadmap

  • Gateway support for traffic routing
  • Metrics support
  • Serverless support for cloud-agnostic users
  • CLI tool support
  • Model training, fine tuning in the long-term

Community

Join us for more discussions:

Contributions

All kinds of contributions are welcomed ! Please following CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmaz-0.0.1.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

llmaz-0.0.1-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file llmaz-0.0.1.tar.gz.

File metadata

  • Download URL: llmaz-0.0.1.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.4.0

File hashes

Hashes for llmaz-0.0.1.tar.gz
Algorithm Hash digest
SHA256 a05dd0104af8faece8304265e83ad4ef07e75a28a37f0817e4fd0da3b0772b2c
MD5 bf4938602f8d3b01f34f0dda8b48aead
BLAKE2b-256 a14d3e666b2afece4dfdad424462781e66324f46518a770335cbe2caa9702418

See more details on using hashes here.

File details

Details for the file llmaz-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: llmaz-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/23.4.0

File hashes

Hashes for llmaz-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e61b2a6642dc5b355af7c62eac27a6cfae1f59b752ea1c991646b540a8ecc5d
MD5 9090db1acd6a82143048bc479adcbc36
BLAKE2b-256 cdf3ff196d8f54b2667ec425c74e70d93b48c01fef3937327f86625895ac9073

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page