Skip to main content

Python client for the ComputeGateway GPU Model Host (embeddings, entities, rerank, translate, OCR, LLM, search, fetch).

Project description

gpuhost-client

Python client for the ComputeGateway GPU Model Host — embeddings, entity extraction, reranking, translation, OCR, LLM completion, search and fetch — all exposed via a single typed client with production-tuned default batch sizes baked in.

pip install gpuhost-client

📘 Full user guide: docs/USER_GUIDE.md — every endpoint, batching strategy, error handling, retries, recipes, and FAQ.

Quick start

from gpuhost_client import GPUHostClient

client = GPUHostClient(
    host="https://",
    api_key="...",
)

# Health
print(client.health())

# Embeddings — single
vec = client.embed("hello world")

# Embeddings — batch (auto-chunked at the optimal batch size)
vecs = client.embed(["a", "b", "c", "d", ...])

# Translate
en = client.translate("صباح الخير", src="ar")

# Entity extraction with zero-shot labels
ents = client.entities(
    "Acme Corp acquired Globex on 2025-01-15.",
    labels=["company", "date"],
)

# Rerank — keep documents <= 32 for best p95
ranked = client.rerank(query="what is RAG?", documents=docs, top_k=5)

# OCR — accepts bytes, file paths, or base64
text = client.ocr("./scan.png")

# LLM
out = client.llm(
    [{"role": "user", "content": "say hi"}],
    provider="auto",
    task_type="general",
)

# Streaming LLM
for chunk in client.llm_stream([{"role": "user", "content": "tell a joke"}]):
    print(chunk)

client.close()

The client is also a context manager:

with GPUHostClient(host=..., api_key=...) as client:
    vecs = client.embed(many_texts)

Single vs batch — the same method handles both

Every inference method accepts either a scalar or a sequence:

Method Scalar input → returns List input → returns
embed(texts) list[float] (one vector) list[list[float]] aligned to input
entities(text) list[entity-dict] list[list[entity-dict]]
translate(text) str list[str]
ocr(images) dict (one OCR result) list[dict]
search(query) dict list[dict]
fetch(url) dict list[dict]

When you pass a list, the client:

  1. Auto-chunks into requests of OPTIMAL_BATCH_SIZE[endpoint] items.
  2. Calls the dedicated /v1/<endpoint>/batch alias — guaranteed batch envelope, indexed per-item results.
  3. Reassembles results in input order before returning.
  4. Raises GPUHostHTTPError on the first failed item, unless you opt in to error inclusion (e.g. client.llm_batch(..., )).

Recommended batch sizes (sweet spots)

Defaults are sourced from the production T4 sweep at rev ca-nas-prd-wus3--0000078. They are the balanced point — lowest p95 within ~90 % of peak items/s — safe for interactive callers.

Endpoint / model Default batch_size Speed-up over bs=1
embed-baseline (Qwen3) 128 23.9 ×
embed-mpnet-legacy 512 22.3 ×
entity-gliner 128 8.6 ×
rerank (bge-v2-m3) 32 5.6 ×
translate (any pair) 256 ≈34 ×
ocr-paddle 8 1.4 ×

Override per call when steady-state throughput matters more than tail latency:

vecs = client.embed(big_list, batch_size=256)        # peak embed throughput
out  = client.translate(many_sents, src="zh",
                        batch_size=512, max_parallel=4)

The constants are exposed for inspection / overrides:

from gpuhost_client import OPTIMAL_BATCH_SIZE, SERVER_BATCH_CAP

SERVER_BATCH_CAP reflects the gateway's hard-rejection thresholds (T-037/T-038). Anything above the cap is rejected with HTTP 400 — the client clamps user-supplied batch_size to the cap as a safety net, but the gateway is the source of truth.

Errors

All errors derive from GPUHostError:

  • GPUHostHTTPError — non-2xx response. Carries status_code, code (e.g. BadRequest, ModelLoadRejected), message, retryable, retry_after_ms, request_id.
  • GPUHostQuotaError — HTTP 429 specifically (subclass of HTTP error).
  • GPUHostTimeoutError — transport-level timeout.
from gpuhost_client import GPUHostHTTPError, GPUHostQuotaError

try:
    out = client.embed(texts)
except GPUHostQuotaError as e:
    print("rate limited; retry after", e.retry_after_ms, "ms")
except GPUHostHTTPError as e:
    print(e.status_code, e.code, e.message)

OCR input flexibility

ocr(...) accepts:

  • raw bytes (bytes/bytearray),
  • a filesystem path (str or pathlib.Path),
  • an already-encoded base64 string.

MIME type is sniffed from the magic bytes (PNG/JPEG/GIF/WebP). Pass a list to OCR many images in one call — the client uses the /v1/ocr/batch alias and reassembles results.

LLM streaming

llm_stream(...) yields one chunk dict per SSE data: line. The iterator exits when the gateway emits data: [DONE]. Network/timeout errors propagate as exceptions, not as in-stream events.

Async?

Out of scope for v0.1. The synchronous client is built on httpx.Client which uses connection pooling, so concurrent callers should construct one client per process/thread group rather than per request. An asyncio-based sibling is on the roadmap and will share the same method names.

Compatibility

Client 0.1.x GPU Model Host API surface
rev ≥ 76 /v1/... (T-037 caps + T-038 rerank cap)

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpuhost_client-0.1.2.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpuhost_client-0.1.2-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file gpuhost_client-0.1.2.tar.gz.

File metadata

  • Download URL: gpuhost_client-0.1.2.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gpuhost_client-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a7f1fe900a92f98eb1fbeb1767aeee49348ab1dea844a03bc3d81a2daf25a58b
MD5 b9f44150b3b4078e590ebf8e63f168e6
BLAKE2b-256 1bf388fd01ac1106f6323df7a5b2f9c39387fea1e2d16850407c3f46d03fccb1

See more details on using hashes here.

File details

Details for the file gpuhost_client-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: gpuhost_client-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gpuhost_client-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 825b34f85e052ee7061ac2b5dbb4b069e61828073d9dfdec51658d71cecf3915
MD5 3de3823c7ffc041c5a92e3cc9343b156
BLAKE2b-256 9dccd03b9ad0bf61781259a698fb3ea8039a8b2ea440480be3cbcbbf759ffd6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page