Skip to main content

Python client for the ComputeGateway GPU Model Host (embeddings, entities, rerank, translate, OCR, LLM, search, fetch).

Project description

gpuhost-client

Python client for the ComputeGateway GPU Model Host — embeddings, entity extraction, reranking, translation, OCR, LLM completion, search and fetch — all exposed via a single typed client with production-tuned default batch sizes baked in.

pip install gpuhost-client

Quick start

from gpuhost_client import GPUHostClient

client = GPUHostClient(
    host="https://",
    api_key="...",
)

# Health
print(client.health())

# Embeddings — single
vec = client.embed("hello world")

# Embeddings — batch (auto-chunked at the optimal batch size)
vecs = client.embed(["a", "b", "c", "d", ...])

# Translate
en = client.translate("صباح الخير", src="ar")

# Entity extraction with zero-shot labels
ents = client.entities(
    "Acme Corp acquired Globex on 2025-01-15.",
    labels=["company", "date"],
)

# Rerank — keep documents <= 32 for best p95
ranked = client.rerank(query="what is RAG?", documents=docs, top_k=5)

# OCR — accepts bytes, file paths, or base64
text = client.ocr("./scan.png")

# LLM
out = client.llm(
    [{"role": "user", "content": "say hi"}],
    provider="auto",
    task_type="general",
)

# Streaming LLM
for chunk in client.llm_stream([{"role": "user", "content": "tell a joke"}]):
    print(chunk)

client.close()

The client is also a context manager:

with GPUHostClient(host=..., api_key=...) as client:
    vecs = client.embed(many_texts)

Single vs batch — the same method handles both

Every inference method accepts either a scalar or a sequence:

Method Scalar input → returns List input → returns
embed(texts) list[float] (one vector) list[list[float]] aligned to input
entities(text) list[entity-dict] list[list[entity-dict]]
translate(text) str list[str]
ocr(images) dict (one OCR result) list[dict]
search(query) dict list[dict]
fetch(url) dict list[dict]

When you pass a list, the client:

  1. Auto-chunks into requests of OPTIMAL_BATCH_SIZE[endpoint] items.
  2. Calls the dedicated /v1/<endpoint>/batch alias — guaranteed batch envelope, indexed per-item results.
  3. Reassembles results in input order before returning.
  4. Raises GPUHostHTTPError on the first failed item, unless you opt in to error inclusion (e.g. client.llm_batch(..., )).

Recommended batch sizes (sweet spots)

Defaults are sourced from the production T4 sweep at rev ca-nas-prd-wus3--0000078. They are the balanced point — lowest p95 within ~90 % of peak items/s — safe for interactive callers.

Endpoint / model Default batch_size Speed-up over bs=1
embed-baseline (Qwen3) 128 23.9 ×
embed-mpnet-legacy 512 22.3 ×
entity-gliner 128 8.6 ×
rerank (bge-v2-m3) 32 5.6 ×
translate (any pair) 256 ≈34 ×
ocr-paddle 8 1.4 ×

Override per call when steady-state throughput matters more than tail latency:

vecs = client.embed(big_list, batch_size=256)        # peak embed throughput
out  = client.translate(many_sents, src="zh",
                        batch_size=512, max_parallel=4)

The constants are exposed for inspection / overrides:

from gpuhost_client import OPTIMAL_BATCH_SIZE, SERVER_BATCH_CAP

SERVER_BATCH_CAP reflects the gateway's hard-rejection thresholds (T-037/T-038). Anything above the cap is rejected with HTTP 400 — the client clamps user-supplied batch_size to the cap as a safety net, but the gateway is the source of truth.

Errors

All errors derive from GPUHostError:

  • GPUHostHTTPError — non-2xx response. Carries status_code, code (e.g. BadRequest, ModelLoadRejected), message, retryable, retry_after_ms, request_id.
  • GPUHostQuotaError — HTTP 429 specifically (subclass of HTTP error).
  • GPUHostTimeoutError — transport-level timeout.
from gpuhost_client import GPUHostHTTPError, GPUHostQuotaError

try:
    out = client.embed(texts)
except GPUHostQuotaError as e:
    print("rate limited; retry after", e.retry_after_ms, "ms")
except GPUHostHTTPError as e:
    print(e.status_code, e.code, e.message)

OCR input flexibility

ocr(...) accepts:

  • raw bytes (bytes/bytearray),
  • a filesystem path (str or pathlib.Path),
  • an already-encoded base64 string.

MIME type is sniffed from the magic bytes (PNG/JPEG/GIF/WebP). Pass a list to OCR many images in one call — the client uses the /v1/ocr/batch alias and reassembles results.

LLM streaming

llm_stream(...) yields one chunk dict per SSE data: line. The iterator exits when the gateway emits data: [DONE]. Network/timeout errors propagate as exceptions, not as in-stream events.

Async?

Out of scope for v0.1. The synchronous client is built on httpx.Client which uses connection pooling, so concurrent callers should construct one client per process/thread group rather than per request. An asyncio-based sibling is on the roadmap and will share the same method names.

Compatibility

Client 0.1.x GPU Model Host API surface
rev ≥ 76 /v1/... (T-037 caps + T-038 rerank cap)

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpuhost_client-0.1.1.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpuhost_client-0.1.1-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file gpuhost_client-0.1.1.tar.gz.

File metadata

  • Download URL: gpuhost_client-0.1.1.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gpuhost_client-0.1.1.tar.gz
Algorithm Hash digest
SHA256 13c6a35831217613452a01b960ff57ec40a7d0a25320ebbd14b24a949c00218c
MD5 a4b477ee4b26e3e86ea1776c024edaac
BLAKE2b-256 75f7197e31185f20b3d5fc0a0a0c1e037d8b25b936d356b0a5a278118c4054fc

See more details on using hashes here.

File details

Details for the file gpuhost_client-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: gpuhost_client-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gpuhost_client-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d99cb2ef3f6ca6f41946f1ce2b162bc6d125f9d8bddefef609f52f852183deb0
MD5 40df42ffcc78b763d959f34642327dd6
BLAKE2b-256 98dfd15c7ef8d6370d4b6d0bc77c31e45434ecef2fc6656ab164cec273d4aff4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page