An MLOps library for LLM deployment w/ the vLLM engine on RunPod's infra.
Project description
SuperLaser
⚠️Not yet ready for primetime ⚠️
SuperLaser provides a comprehensive suite of tools and scripts designed for deploying LLMs onto RunPod's pod and serverless infrastructure. Additionally, the deployment utilizes a containerized vLLM engine during runtime, ensuring memory-efficient and high-performance inference capabilities.
Features ![](https://pypi-camo.freetls.fastly.net/e18a9c58851f3d4b462a6fc40994393462eff045/68747470733a2f2f6d656469612e67697068792e636f6d2f6d656469612f76312e59326c6b505463354d4749334e6a45784f54427161574e726347786e6154647a4d47527a4e544e30624749326433413459576b78616a6873623246354d5738345a32647861435a6c634431324d563970626e526c636d35686246396e61575a66596e6c666157516d593351395a772f3236744f5a34324d6736706254555048572f67697068792e676966)
- Scalable Deployment: Easily scale your LLM inference tasks with vLLM and RunPod serverless capabilities.
- Cost-Effective: Optimize resource and hardware usage: tensor parallelism and other GPU assets.
- Uses OpenAI's API: Use the SuperLaser client for with chat, non-chat, and streaming options.
Install ![](https://pypi-camo.freetls.fastly.net/f643c5ded20f5057807c7222657f0a43e78082e7/68747470733a2f2f6d656469612e67697068792e636f6d2f6d656469612f73554c4b4567444d58384c63492f67697068792e676966)
pip install superlaser
Before you begin, ensure you have:
- A RunPod account.
RunPod Config ![](https://pypi-camo.freetls.fastly.net/f69f087b6e5b9bf77bacfe2e8e51172a024e1f7d/2e2f696d672f72756e706f642d6c6f676f2e706e67)
First step is to obtain an API key from RunPod. Go to your account's console, in the Settings
section, click on API Keys
.
After obtaining a key, set it as an environment variable:
export RUNPOD_API_KEY=<YOUR-API-KEY>
Configure Template
Before spinning up a serverless endpoint, let's first configure a template that we'll pass into the endpoint during staging. The template allows you to select a serverless or pod asset, your docker image name, and the container's and volume's disk space.
Configure your template with the following attributes:
import os
from superlaser import RunpodHandler as runpod
api_key = os.environ.get("RUNPOD_API_KEY")
template_data = runpod.set_template(
serverless="true",
template_name="superlaser-inf", # Give a name to your template
container_image="runpod/worker-vllm:0.3.1-cuda12.1.0", # Docker image stub
model_name="mistralai/Mistral-7B-v0.1", # Hugging Face model stub
max_model_length=340, # Maximum number of tokens for the engine to handle per request.
container_disk=15,
volume_disk=15,
)
Create Template on RunPod
template = runpod(api_key, data=template_data)
print(template().text)
Configure Endpoint
After your template is created, it will return a data dicitionary that includes your template ID. We will pass this template id when configuring the serverless endpoint in the section below:
endpoint_data = runpod.set_endpoint(
gpu_ids="AMPERE_24", # options for gpuIds are "AMPERE_16,AMPERE_24,AMPERE_48,AMPERE_80,ADA_24"
idle_timeout=5,
name="vllm_endpoint",
scaler_type="QUEUE_DELAY",
scaler_value=1,
template_id="template-id",
workers_max=1,
workers_min=0,
)
Start Endpoint on RunPod
endpoint = runpod(api_key, data=endpoint_data)
print(endpoint().text)
Call Endpoint ![](https://pypi-camo.freetls.fastly.net/4d43d2e33849e85d712c905673b703c1fb598a88/2e2f696d672f766c6c6d2d6c6f676f2e706e67)
After your endpoint is staged, it will return a dictionary with your endpoint ID. Pass this endpoint ID to the OpenAI
client and start making API requests!
from openai import OpenAI
endpoint_id = "you-endpoint-id"
client = OpenAI(
api_key=api_key,
base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1",
)
Chat w/ Streaming
stream = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
messages=[{"role": "user", "content": "To be or not to be"}],
temperature=0,
max_tokens=100,
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Completion w/ Streaming
stream = client.completions.create(
model="meta-llama/Llama-2-7b-hf",
prompt="To be or not to be",
temperature=0,
max_tokens=100,
stream=True,
)
for response in stream:
print(response.choices[0].text or "", end="", flush=True)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for superlaser-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4defd824787c03c8c146b13736999a9ebadecca3b79c4e33411d2b7c4b014e0f |
|
MD5 | 12f30a32c161a40e4f7719d45b8297c7 |
|
BLAKE2b-256 | fef22fed79e68cf12918f12ea761227bd1a4161a2c541a5d0795e01af0431b6b |