Python client for the ComputeGateway GPU Model Host (embeddings, entities, rerank, translate, OCR, LLM, search, fetch).
Project description
gpuhost-client
Python client for the ComputeGateway GPU Model Host — embeddings, entity extraction, reranking, translation, OCR, LLM completion, search and fetch — all exposed via a single typed client with production-tuned default batch sizes baked in.
pip install gpuhost-client
📘 Full user guide: docs/USER_GUIDE.md — every endpoint, batching strategy, error handling, retries, recipes, and FAQ.
Quick start
from gpuhost_client import GPUHostClient
client = GPUHostClient(
host="https://",
api_key="...",
)
# Health
print(client.health())
# Embeddings — single
vec = client.embed("hello world")
# Embeddings — batch (auto-chunked at the optimal batch size)
vecs = client.embed(["a", "b", "c", "d", ...])
# Translate
en = client.translate("صباح الخير", src="ar")
# Entity extraction with zero-shot labels
ents = client.entities(
"Acme Corp acquired Globex on 2025-01-15.",
labels=["company", "date"],
)
# Rerank — keep documents <= 32 for best p95
ranked = client.rerank(query="what is RAG?", documents=docs, top_k=5)
# Rerank — multiple queries in one round-trip (each with its own document set)
batched = client.rerank_batch([
{"query": "what is RAG?", "documents": docs_a, "top_k": 5},
{"query": "what is fine-tuning?", "documents": docs_b, "top_k": 5},
])
# OCR — accepts bytes, file paths, or base64
text = client.ocr("./scan.png")
# Batch OCR — list of any of the above
texts = client.ocr(["./a.png", img_bytes, b64_str])
# Search — single query (Google web by default)
hits = client.search("ComputeGateway gpu host", top_n=5)
# Search — many queries in parallel (I/O-bound; raise max_parallel)
all_hits = client.search(["q1", "q2", "q3"], provider="bing", mode="news", max_parallel=8)
# Fetch — render a URL to markdown (or many in parallel)
page = client.fetch("https://example.com")
pages = client.fetch(["https://a.com", "https://b.com"], max_parallel=8)
# LLM
out = client.llm(
[{"role": "user", "content": "say hi"}],
provider="auto",
task_type="general",
)
# Streaming LLM
for chunk in client.llm_stream([{"role": "user", "content": "tell a joke"}]):
print(chunk)
# LLM — fan-out batch (each item is a full request, partial-success aware)
results = client.llm_batch([
{"messages": [{"role": "user", "content": "translate: hi"}], "model": "gpt-4o-mini"},
{"messages": [{"role": "user", "content": "translate: hello"}], "model": "gpt-4o-mini"},
])
client.close()
The client is also a context manager:
with GPUHostClient(host=..., api_key=...) as client:
vecs = client.embed(many_texts)
Single vs batch — the same method handles both
Every inference method accepts either a scalar or a sequence:
| Method | Scalar input → returns | List input → returns |
|---|---|---|
embed(texts) |
list[float] (one vector) |
list[list[float]] aligned to input |
entities(text) |
list[entity-dict] |
list[list[entity-dict]] |
translate(text) |
str |
list[str] |
ocr(images) |
dict (one OCR result) |
list[dict] |
search(query) |
dict |
list[dict] |
fetch(url) |
dict |
list[dict] |
When you pass a list, the client:
- Auto-chunks into requests of
OPTIMAL_BATCH_SIZE[endpoint]items. - Calls the dedicated
/v1/<endpoint>/batchalias — guaranteed batch envelope, indexed per-item results. - Reassembles results in input order before returning.
- Raises
GPUHostHTTPErroron the first failed item, unless you opt in to error inclusion (e.g.client.llm_batch(..., )).
Recommended batch sizes (sweet spots)
Defaults are sourced from the production T4 sweep at rev
ca-nas-prd-wus3--0000078. They are the balanced point — lowest p95
within ~90 % of peak items/s — safe for interactive callers.
| Endpoint / model | Default batch_size |
Speed-up over bs=1 |
|---|---|---|
embed-baseline (Qwen3) |
128 | 23.9 × |
embed-mpnet-legacy |
512 | 22.3 × |
entity-gliner |
128 | 8.6 × |
rerank (bge-v2-m3) |
32 | 5.6 × |
translate (any pair) |
256 | ≈34 × |
ocr-paddle |
8 | 1.4 × |
Override per call when steady-state throughput matters more than tail latency:
vecs = client.embed(big_list, batch_size=256) # peak embed throughput
out = client.translate(many_sents, src="zh",
batch_size=512, max_parallel=4)
The constants are exposed for inspection / overrides:
from gpuhost_client import OPTIMAL_BATCH_SIZE, SERVER_BATCH_CAP
SERVER_BATCH_CAP reflects the gateway's hard-rejection thresholds
(T-037/T-038). Anything above the cap is rejected with HTTP 400 — the
client clamps user-supplied batch_size to the cap as a safety net, but
the gateway is the source of truth.
Errors
All errors derive from GPUHostError:
GPUHostHTTPError— non-2xx response. Carriesstatus_code,code(e.g.BadRequest,ModelLoadRejected),message,retryable,retry_after_ms,request_id.GPUHostQuotaError— HTTP 429 specifically (subclass of HTTP error).GPUHostTimeoutError— transport-level timeout.
from gpuhost_client import GPUHostHTTPError, GPUHostQuotaError
try:
out = client.embed(texts)
except GPUHostQuotaError as e:
print("rate limited; retry after", e.retry_after_ms, "ms")
except GPUHostHTTPError as e:
print(e.status_code, e.code, e.message)
OCR input flexibility
ocr(...) accepts:
- raw bytes (
bytes/bytearray), - a filesystem path (
strorpathlib.Path), - an already-encoded base64 string.
MIME type is sniffed from the magic bytes (PNG/JPEG/GIF/WebP). Pass a list
to OCR many images in one call — the client uses the
/v1/ocr/batch alias and reassembles results.
LLM streaming
llm_stream(...) yields one chunk dict per SSE data: line. The iterator
exits when the gateway emits data: [DONE]. Network/timeout errors
propagate as exceptions, not as in-stream events.
Async?
Out of scope for v0.1. The synchronous client is built on httpx.Client
which uses connection pooling, so concurrent callers should construct one
client per process/thread group rather than per request. An asyncio-based
sibling is on the roadmap and will share the same method names.
Compatibility
Client 0.1.x |
GPU Model Host | API surface |
|---|---|---|
| ✓ | rev ≥ 76 | /v1/... (T-037 caps + T-038 rerank cap) |
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpuhost_client-0.1.4.tar.gz.
File metadata
- Download URL: gpuhost_client-0.1.4.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1df19e793576a3544f1d0717f870e80cc9c7bdab28d3c85ead6bad0813364bb7
|
|
| MD5 |
edd6b9dcb02ea9a65950f000136a524a
|
|
| BLAKE2b-256 |
0fe59d4d1da7ae553d0308e296b16554f22acc3346eefe788be713822a009226
|
File details
Details for the file gpuhost_client-0.1.4-py3-none-any.whl.
File metadata
- Download URL: gpuhost_client-0.1.4-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
823de3b42ee31031bba71d4e67c1eade739132d18895610f7b3923625bac9e50
|
|
| MD5 |
cf5a35cccca132a381cdec577483e9ca
|
|
| BLAKE2b-256 |
ecef21533358ab0c3b1477a8ba19591f47777eae74dbf854fed5cc4ec6ffbe18
|