Minimal Python SDK for the vLLM API
Project description
vLLM SDK
Minimal Python SDK for the vLLM API. This package provides a lightweight client library for interacting with vLLM API servers, with only httpx and pydantic as dependencies.
Installation
pip install vllm-sdk
Quick Start
import asyncio
import os
from dotenv import load_dotenv
from vllm_sdk import AsyncClient, ChatMessage, Variant
load_dotenv()
MODEL_70B = "meta-llama/Llama-3.3-70B-Instruct"
async def main() -> None:
api_key = os.getenv("API_KEY")
if not api_key:
raise RuntimeError("Set API_KEY in your environment or .env file")
# Create async client and model variant
client = AsyncClient(api_key=api_key)
variant = Variant(MODEL_70B)
# Basic non-streaming chat completion
response = await client.chat.completions.create(
model=variant,
messages=[ChatMessage(role="user", content="Say hello in one sentence.")],
max_completion_tokens=50,
temperature=0.7,
)
print(response.choices[0].message.content)
asyncio.run(main())
Features
- Minimal Dependencies: Only requires
httpxandpydantic - Type Safety: Full Pydantic schema validation for requests and responses
- Async Support: Built on
httpxfor async/await support - Streaming: Support for streaming chat completions
- Feature Search: Search SAE features by semantic similarity
API Reference
HTTP API Routes
-
POST
/v1/chat/completions: Create chat completions.- Request body:
ChatCompletionRequest - Response:
ChatCompletionResponsefor non-streaming calls, or a server-sent events stream ofChatCompletionChunkobjects whenstream=True.
- Request body:
-
POST
/v1/features/search: Search SAE features by semantic similarity.- Request body:
FeatureSearchRequest - Response:
FeatureSearchResponse
- Request body:
-
POST
/v1/features/rerank: Rerank an existing list of SAE features for a new query.- Request body:
FeatureRerankRequest - Response:
FeatureRerankResponse
- Request body:
-
POST
/v1/chat_attribution/inspect: Inspect which SAE features are most active for a given chat trace.- Request body:
FeatureInspectionRequest - Response:
FeatureInspectionResponse
- Request body:
-
POST
/v1/chat_attribution/activations: Retrieve raw SAE feature activations for a chat trace.- Request body:
ActivationsRequest - Response:
ActivationsResponse
- Request body:
-
POST
/v1/chat_attribution/logits: Retrieve token logits for a chat trace, optionally with feature interventions applied.- Request body:
LogitsRequest - Response:
LogitsResponse
- Request body:
-
POST
/v1/chat_attribution/contrast: Compute features that distinguish two datasets of conversations.- Request body:
ContrastRequest - Response:
ContrastResponse
- Request body:
Client classes
The SDK exposes both synchronous and asynchronous clients:
- Client: Synchronous client.
- AsyncClient: Asynchronous client (recommended for most applications).
- Variant: Helper object that bundles a model name and optional SAE feature interventions.
Methods
-
Chat completions
- Sync:
client.chat.completions.create(...)→ChatCompletionResponse - Async:
await async_client.chat.completions.create(...)→ChatCompletionResponse - Async streaming:
async for chunk in async_client.chat.completions.create_stream(...): ...→ yieldsChatCompletionChunk
- Sync:
-
Feature search and rerank
- Search:
client.features.search(...)/await async_client.features.search(...)→FeatureSearchResponse - Rerank:
client.features.rerank(...)/await async_client.features.rerank(...)→FeatureRerankResponse
- Search:
-
Chat attribution
- Inspect features:
client.features.inspect(...)/await async_client.features.inspect(...)→FeatureInspectionResponse - Raw activations:
client.features.activations(...)/await async_client.features.activations(...)→ActivationsResponse - Logits:
client.features.logits(...)/await async_client.features.logits(...)→LogitsResponse - Contrast datasets:
client.features.contrast(...)/await async_client.features.contrast(...)→ContrastResponse
- Inspect features:
Schemas
All request and response models live in vllm_sdk.schemas and are used by the client methods above:
-
Core
ModelName- Enum of supported model namesRoleLiteral- Literal type for message roles ("system","user","assistant")InterventionSpec- Single SAE feature intervention (index, strength, mode)
-
Chat completions
ChatMessage- Individual chat messageChatCompletionRequest- Chat completion request payloadChatCompletionMessage- Assistant message in a completionChatCompletionChoice- Single choice in a completionChatCompletionUsage- Token usage informationChatCompletionResponse- Non-streaming chat completion responseChatCompletionDelta- Incremental update in a streamed responseChatCompletionChunkChoice- Choice within a streamed chunkChatCompletionChunk- Streaming chunk object
-
Feature search and rerank
FeatureItem- Single SAE feature (id, label, layer, index)FeatureSearchRequest/FeatureSearchResponse- Feature search request/responseFeatureRerankRequest/FeatureRerankResponse- Feature rerank request/response
-
Chat attribution / interpretability
FeatureInspectionRequest/FeatureInspectionResponse- Feature inspection over a conversationFeatureInspection- Single inspected feature with activation scoreActivationsRequest/ActivationsResponse- Raw feature activations over a conversationLogitsRequest/LogitsResponse- Token logits, optionally with interventions and index rangesContrastRequest/ContrastResponse- Features that distinguish two datasets of conversations
Examples
chat completion
import asyncio
import os
from dotenv import load_dotenv
from vllm_sdk import AsyncClient, ChatMessage, Variant
load_dotenv()
MODEL_70B = "meta-llama/Llama-3.3-70B-Instruct"
async def main() -> None:
api_key = os.getenv("API_KEY")
client = AsyncClient(api_key=api_key)
variant = Variant(MODEL_70B)
chat_response = await client.chat.completions.create(
model=variant,
messages=[{"role": "user", "content": "Say hello in one sentence."}],
max_completion_tokens=50,
temperature=0.7,
)
print(f" Content: {chat_response.choices[0].message.content}")
asyncio.run(main())
Sample output:
Content: Hello, it's nice to meet you and I'm here to help with any questions or topics you'd like to discuss!
Feature search
feature_response = await client.features.search(
query="roleplay as pirate",
model=variant,
top_k=3,
)
print(f" Found {len(feature_response.data)} features:")
for i, feature in enumerate(feature_response.data[:3], 1):
print(
f" {i}. {feature.label} "
f"(layer {feature.layer}, dim {feature.dimension} "
f"feature_index_in_sae {feature.index_in_sae}) feature label {feature.id}"
)
Sample output (truncated):
Found 3 features:
1. The assistant should engage with pirate-themed content or roleplay as a pirate (layer 50, dim None feature_index_in_sae 11828) feature id 1695f2b7-b149-4f65-a433-a966158180f2
2. The assistant should roleplay as a pirate (layer 50, dim None feature_index_in_sae 34737) feature id 594809d3-b146-4f44-b58f-cb6b3f011737
3. Descriptions of pirate capabilities and ongoing actions (layer 50, dim None feature_index_in_sae 24694) feature id ddd50c94-2379-4b4c-a303-648ccff11971
Chat completion with interventions
# Use the first two features from the previous search as intervention with steering strength 0.40
variant.set(features=[feature_response.data[0], feature_response.data[1]], strength=0.40)
chat_response = await client.chat.completions.create(
model=variant,
messages=[
{
"role": "user",
"content": "About 100 words, please give me some tourist information about Tokyo.",
}
],
max_completion_tokens=512,
temperature=0.7,
)
print(f" Content: {chat_response.choices[0].message.content}")
Sample output:
Content: Tokyo be a grand place to visit, matey! Ye be lookin' fer a swashbucklin' good time, eh? Well, hoist the colors and set sail fer the Tokyo Bay, where ye can find the famous Tsukiji Market, full o' seafood and hidden treasure! Then, chart a course fer the Shibuya Crossing, where ye can find yerself a chest overflowin' with gold doubloons... er, shopping and dining booty! And don't ferget to visit the Meiji Temple, a grand ol' ship o' a building, full o' history and mystery! Arrr!
Streaming chat completion
from vllm_sdk.schemas import ChatCompletionChunk
response_text = ""
chat_response = client.chat.completions.create_stream(
model=variant,
messages=[
{
"role": "user",
"content": "About 100 words, please give me some tourist information about Tokyo.",
}
],
max_completion_tokens=512,
temperature=0.7,
)
i = 0
async for chunk in chat_response:
if chunk.choices[0].delta.content:
i += 1
response_text += chunk.choices[0].delta.content
if i % 50 == 0:
print(f"response_text after {i} tokens: {response_text}")
print(f" Response text: {response_text}")
Sample output:
response_text after 50 tokens: Tokyo be a grand place fer visitin'! Yer lookin' fer a swashbucklin' good time, eh? Well, matey, ye should set sail fer the Tokyo Bay, where ye can find yerself a
response_text after 100 tokens: Tokyo be a grand place fer visitin'! Yer lookin' fer a swashbucklin' good time, eh? Well, matey, ye should set sail fer the Tokyo Bay, where ye can find yerself a fine spot o' treasure at the Tsukiji Market. And don't be forgettin' to chart a course fer the Meiji Temple, where ye can find yerself a chest overflowin' with golden doubloons... er, I
response_text after 150 tokens: Tokyo be a grand place fer visitin'! Yer lookin' fer a swashbucklin' good time, eh? Well, matey, ye should set sail fer the Tokyo Bay, where ye can find yerself a fine spot o' treasure at the Tsukiji Market. And don't be forgettin' to chart a course fer the Meiji Temple, where ye can find yerself a chest overflowin' with golden doubloons... er, I mean, a fine spot o' relaxation! And if ye be lookin' fer a bit o' adventure, hoist the colors and head to the Shibuya Crossing, where ye can find yerself in the midst o' a swash
Response text: Tokyo be a grand place fer visitin'! Yer lookin' fer a swashbucklin' good time, eh? Well, matey, ye should set sail fer the Tokyo Bay, where ye can find yerself a fine spot o' treasure at the Tsukiji Market. And don't be forgettin' to chart a course fer the Meiji Temple, where ye can find yerself a chest overflowin' with golden doubloons... er, I mean, a fine spot o' relaxation! And if ye be lookin' fer a bit o' adventure, hoist the colors and head to the Shibuya Crossing, where ye can find yerself in the midst o' a swashbucklin' good time! Arrr!
Feature inspection, activations, logits, and contrast
from vllm_sdk import ChatMessage
feature_inspection = await client.features.inspect(
model=variant,
messages=[
ChatMessage(
role="user",
content="About 100 words, please give me some tourist information about Tokyo.",
),
ChatMessage(
role="assistant",
content="Ahoy, matey! Here be the best places to visit in Tokyo, the scurvy dog's life for ye: ...",
),
],
top_k=10,
)
print(f" Features: {feature_inspection.features}")
feature_activations = await client.features.activations(
model=variant,
messages=[
ChatMessage(
role="user",
content="About 100 words, please give me some tourist information about Tokyo.",
),
ChatMessage(
role="assistant",
content="Ahoy, matey! If ye be lookin' fer a swashbucklin' adventure, Tokyo be the place fer ye! ...",
),
],
)
print(f" Activations: {len(feature_activations.activations)} activations")
print(f" Activations max: {max(feature_activations.activations)}")
print(
f\" Activations number of non-zero activations: "
f\"{sum(1 for x in feature_activations.activations if x != 0)}\"
)
logits = await client.features.logits(
model=variant,
messages=[
{"role": "user", "content": "Say hello in one sentence."},
{
"role": "assistant",
"content": "Ahoy, matey! If ye be lookin' fer a swashbucklin' adventure, Tokyo be the place fer ye! ...",
},
],
end_idx=142,
)
print(f" Logits: {len(logits.logits)} logits")
print(f" Logits max: {max(logits.logits.values())}")
top = sorted(logits.logits.items(), key=lambda x: x[1], reverse=True)[:10]
print(f" Top 10 logits: {top}")
Sample output (truncated):
Features: [FeatureInspection(feature=FeatureItem(id='1695f2b7-b149-4f65-a433-a966158180f2', label='The assistant should engage with pirate-themed content or roleplay as a pirate', ...), ...]
Activations: 65536 activations
Activations max: 3.90625
Activations number of non-zero activations: 5990
Logits: 126948 logits
Logits max: 27.875
Top 10 logits: [(' plank', 27.875), (' gang', 16.875), (' pl', 15.0625), ...]
Contrast and rerank
default_conversation = [
[
{"role": "user", "content": "Hello how are you?"},
{
"role": "assistant",
"content": "I am a helpful assistant. How can I help you?",
},
]
]
joke_conversation = [
[
{"role": "user", "content": "Hello how are you?"},
{
"role": "assistant",
"content": "What do you call an alligator in a vest? An investigator!",
},
]
]
contrast = await client.features.contrast(
model=variant,
dataset_1=default_conversation,
dataset_2=joke_conversation,
k_to_add=30,
k_to_remove=30,
)
print(f" Contrast added features: {contrast.top_to_add}")
print(f" Contrast removed features: {contrast.top_to_remove}")
rerank = await client.features.rerank(
query="funny",
model=variant,
features=contrast.top_to_remove,
top_k=10,
)
print(f" ✓ Rerank successful")
print(f" Reranked features: {rerank.data}")
Sample output (truncated):
Contrast added features: [FeatureItem(id='8bb53ed7-ed2f-4166-9161-13df6006451d', label='Assistant responding to casual greetings about its wellbeing', ...), ...]
Contrast removed features: [FeatureItem(id='20addfb3-7f1e-4fb3-be0d-54f70c61a27d', label='Action phrases in joke setups and story narratives', ...), ...]
Reranked features: [FeatureItem(id='fff8afce-908c-4d4b-a161-389eb8b83c4a', label='Transition between joke setup and punchline', ...), ...]
License
Apache 2.0
Links
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_sdk-0.2.0.tar.gz.
File metadata
- Download URL: vllm_sdk-0.2.0.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
357685f5a67e0fbd1caba78f56eb1e9347a9e75ca626bc0c22275fd0f9ee0e50
|
|
| MD5 |
284246ee6c8051fc0b76a0a928e7fcde
|
|
| BLAKE2b-256 |
0092f29315027f673d4729395dd89c1fc68283560caf5ac7ee4538c68babff8b
|
File details
Details for the file vllm_sdk-0.2.0-py3-none-any.whl.
File metadata
- Download URL: vllm_sdk-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
498f467cde5973765ed5fdb46f38024af9a7c28336c148aa10d3c265d7560793
|
|
| MD5 |
2a93d816adf6130997c20f42c7159565
|
|
| BLAKE2b-256 |
8f9198363f4d78596c51bfedee1c8024e1afde99fc42654b696c6248476125d0
|