Skip to main content

Python client for the ComputeGateway GPU Model Host (embeddings, entities, rerank, translate, OCR, LLM, search, fetch).

Project description

gpuhost-client

Python client for the ComputeGateway GPU Model Host — embeddings, entity extraction, reranking, translation, OCR, LLM completion, search and fetch — all exposed via a single typed client with production-tuned default batch sizes baked in.

pip install gpuhost-client

📘 Full user guide: docs/USER_GUIDE.md — every endpoint, batching strategy, error handling, retries, recipes, and FAQ.

Quick start

from gpuhost_client import GPUHostClient

client = GPUHostClient(
    host="https://",
    api_key="...",
)

# Health
print(client.health())

# Embeddings — single
vec = client.embed("hello world")

# Embeddings — batch (auto-chunked at the optimal batch size)
vecs = client.embed(["a", "b", "c", "d", ...])

# Translate
en = client.translate("صباح الخير", src="ar")

# Entity extraction with zero-shot labels
ents = client.entities(
    "Acme Corp acquired Globex on 2025-01-15.",
    labels=["company", "date"],
)

# Rerank — keep documents <= 32 for best p95
ranked = client.rerank(query="what is RAG?", documents=docs, top_k=5)

# Rerank — multiple queries in one round-trip (each with its own document set)
batched = client.rerank_batch([
    {"query": "what is RAG?",         "documents": docs_a, "top_k": 5},
    {"query": "what is fine-tuning?", "documents": docs_b, "top_k": 5},
])

# OCR — accepts bytes, file paths, or base64
text = client.ocr("./scan.png")

# Batch OCR — list of any of the above
texts = client.ocr(["./a.png", img_bytes, b64_str])

# Search — single query (Google web by default)
hits = client.search("ComputeGateway gpu host", top_n=5)

# Search — many queries in parallel (I/O-bound; raise max_parallel)
all_hits = client.search(["q1", "q2", "q3"], provider="bing", mode="news", max_parallel=8)

# Fetch — render a URL to markdown (or many in parallel)
page  = client.fetch("https://example.com")
pages = client.fetch(["https://a.com", "https://b.com"], max_parallel=8)

# LLM
out = client.llm(
    [{"role": "user", "content": "say hi"}],
    provider="auto",
    task_type="general",
)

# Streaming LLM
for chunk in client.llm_stream([{"role": "user", "content": "tell a joke"}]):
    print(chunk)

# LLM — fan-out batch (each item is a full request, partial-success aware)
results = client.llm_batch([
    {"messages": [{"role": "user", "content": "translate: hi"}],    "model": "gpt-4o-mini"},
    {"messages": [{"role": "user", "content": "translate: hello"}], "model": "gpt-4o-mini"},
])

client.close()

The client is also a context manager:

with GPUHostClient(host=..., api_key=...) as client:
    vecs = client.embed(many_texts)

Single vs batch — the same method handles both

Every inference method accepts either a scalar or a sequence:

Method Scalar input → returns List input → returns
embed(texts) list[float] (one vector) list[list[float]] aligned to input
entities(text) list[entity-dict] list[list[entity-dict]]
translate(text) str list[str]
ocr(images) dict (one OCR result) list[dict]
search(query) dict list[dict]
fetch(url) dict list[dict]

When you pass a list, the client:

  1. Auto-chunks into requests of OPTIMAL_BATCH_SIZE[endpoint] items.
  2. Calls the dedicated /v1/<endpoint>/batch alias — guaranteed batch envelope, indexed per-item results.
  3. Reassembles results in input order before returning.
  4. Raises GPUHostHTTPError on the first failed item, unless you opt in to error inclusion (e.g. client.llm_batch(..., )).

Recommended batch sizes (sweet spots)

Defaults are sourced from the production T4 sweep at rev ca-nas-prd-wus3--0000078. They are the balanced point — lowest p95 within ~90 % of peak items/s — safe for interactive callers.

Endpoint / model Default batch_size Speed-up over bs=1
embed-baseline (Qwen3) 128 23.9 ×
embed-mpnet-legacy 512 22.3 ×
entity-gliner 128 8.6 ×
rerank (bge-v2-m3) 32 5.6 ×
translate (any pair) 256 ≈34 ×
ocr-paddle 8 1.4 ×

Override per call when steady-state throughput matters more than tail latency:

vecs = client.embed(big_list, batch_size=256)        # peak embed throughput
out  = client.translate(many_sents, src="zh",
                        batch_size=512, max_parallel=4)

The constants are exposed for inspection / overrides:

from gpuhost_client import OPTIMAL_BATCH_SIZE, SERVER_BATCH_CAP

SERVER_BATCH_CAP reflects the gateway's hard-rejection thresholds (T-037/T-038). Anything above the cap is rejected with HTTP 400 — the client clamps user-supplied batch_size to the cap as a safety net, but the gateway is the source of truth.

Errors

All errors derive from GPUHostError:

  • GPUHostHTTPError — non-2xx response. Carries status_code, code (e.g. BadRequest, ModelLoadRejected), message, retryable, retry_after_ms, request_id.
  • GPUHostQuotaError — HTTP 429 specifically (subclass of HTTP error).
  • GPUHostTimeoutError — transport-level timeout.
from gpuhost_client import GPUHostHTTPError, GPUHostQuotaError

try:
    out = client.embed(texts)
except GPUHostQuotaError as e:
    print("rate limited; retry after", e.retry_after_ms, "ms")
except GPUHostHTTPError as e:
    print(e.status_code, e.code, e.message)

OCR input flexibility

ocr(...) accepts:

  • raw bytes (bytes/bytearray),
  • a filesystem path (str or pathlib.Path),
  • an already-encoded base64 string.

MIME type is sniffed from the magic bytes (PNG/JPEG/GIF/WebP). Pass a list to OCR many images in one call — the client uses the /v1/ocr/batch alias and reassembles results.

LLM streaming

llm_stream(...) yields one chunk dict per SSE data: line. The iterator exits when the gateway emits data: [DONE]. Network/timeout errors propagate as exceptions, not as in-stream events.

Async?

Out of scope for v0.1. The synchronous client is built on httpx.Client which uses connection pooling, so concurrent callers should construct one client per process/thread group rather than per request. An asyncio-based sibling is on the roadmap and will share the same method names.

Compatibility

Client 0.1.x GPU Model Host API surface
rev ≥ 76 /v1/... (T-037 caps + T-038 rerank cap)

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpuhost_client-0.1.4.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpuhost_client-0.1.4-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file gpuhost_client-0.1.4.tar.gz.

File metadata

  • Download URL: gpuhost_client-0.1.4.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gpuhost_client-0.1.4.tar.gz
Algorithm Hash digest
SHA256 1df19e793576a3544f1d0717f870e80cc9c7bdab28d3c85ead6bad0813364bb7
MD5 edd6b9dcb02ea9a65950f000136a524a
BLAKE2b-256 0fe59d4d1da7ae553d0308e296b16554f22acc3346eefe788be713822a009226

See more details on using hashes here.

File details

Details for the file gpuhost_client-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: gpuhost_client-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gpuhost_client-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 823de3b42ee31031bba71d4e67c1eade739132d18895610f7b3923625bac9e50
MD5 cf5a35cccca132a381cdec577483e9ca
BLAKE2b-256 ecef21533358ab0c3b1477a8ba19591f47777eae74dbf854fed5cc4ec6ffbe18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page