Python client for the ComputeGateway GPU Model Host (embeddings, entities, rerank, translate, OCR, LLM, search, fetch).

These details have not been verified by PyPI

Project links

Project description

gpuhost-client

Python client for the ComputeGateway GPU Model Host — embeddings, entity extraction, reranking, translation, OCR, LLM completion, search and fetch — all exposed via a single typed client with production-tuned default batch sizes baked in.

pip install gpuhost-client

📘 Full user guide: docs/USER_GUIDE.md — every endpoint, batching strategy, error handling, retries, recipes, and FAQ.

Quick start

from gpuhost_client import GPUHostClient

client = GPUHostClient(
    host="https://",
    api_key="...",
)

# Health
print(client.health())

# Embeddings — single
vec = client.embed("hello world")

# Embeddings — batch (auto-chunked at the optimal batch size)
vecs = client.embed(["a", "b", "c", "d", ...])

# Translate
en = client.translate("صباح الخير", src="ar")

# Entity extraction with zero-shot labels
ents = client.entities(
    "Acme Corp acquired Globex on 2025-01-15.",
    labels=["company", "date"],
)

# Rerank — keep documents <= 32 for best p95
ranked = client.rerank(query="what is RAG?", documents=docs, top_k=5)

# Rerank — multiple queries in one round-trip (each with its own document set)
batched = client.rerank_batch([
    {"query": "what is RAG?",         "documents": docs_a, "top_k": 5},
    {"query": "what is fine-tuning?", "documents": docs_b, "top_k": 5},
])

# OCR — accepts bytes, file paths, or base64
text = client.ocr("./scan.png")

# Batch OCR — list of any of the above
texts = client.ocr(["./a.png", img_bytes, b64_str])

# Search — single query (Google web by default)
hits = client.search("ComputeGateway gpu host", top_n=5)

# Search — many queries in parallel (I/O-bound; raise max_parallel)
all_hits = client.search(["q1", "q2", "q3"], provider="bing", mode="news", max_parallel=8)

# Fetch — render a URL to markdown (or many in parallel)
page  = client.fetch("https://example.com")
pages = client.fetch(["https://a.com", "https://b.com"], max_parallel=8)

# LLM
out = client.llm(
    [{"role": "user", "content": "say hi"}],
    provider="auto",
    task_type="general",
)

# Streaming LLM
for chunk in client.llm_stream([{"role": "user", "content": "tell a joke"}]):
    print(chunk)

# LLM — fan-out batch (each item is a full request, partial-success aware)
results = client.llm_batch([
    {"messages": [{"role": "user", "content": "translate: hi"}],    "model": "gpt-4o-mini"},
    {"messages": [{"role": "user", "content": "translate: hello"}], "model": "gpt-4o-mini"},
])

client.close()

The client is also a context manager:

with GPUHostClient(host=..., api_key=...) as client:
    vecs = client.embed(many_texts)

Single vs batch — the same method handles both

Every inference method accepts either a scalar or a sequence:

Method	Scalar input → returns	List input → returns
`embed(texts)`	`list[float]` (one vector)	`list[list[float]]` aligned to input
`entities(text)`	`list[entity-dict]`	`list[list[entity-dict]]`
`translate(text)`	`str`	`list[str]`
`ocr(images)`	`dict` (one OCR result)	`list[dict]`
`search(query)`	`dict`	`list[dict]`
`fetch(url)`	`dict`	`list[dict]`

When you pass a list, the client:

Auto-chunks into requests of OPTIMAL_BATCH_SIZE[endpoint] items.
Calls the dedicated /v1/<endpoint>/batch alias — guaranteed batch envelope, indexed per-item results.
Reassembles results in input order before returning.
Raises GPUHostHTTPError on the first failed item, unless you opt in to error inclusion (e.g. client.llm_batch(..., )).

Recommended batch sizes (sweet spots)

Defaults are sourced from the production T4 sweep at rev ca-nas-prd-wus3--0000078. They are the balanced point — lowest p95 within ~90 % of peak items/s — safe for interactive callers.

Endpoint / model	Default `batch_size`	Speed-up over `bs=1`
`embed-baseline` (Qwen3)	128	23.9 ×
`embed-mpnet-legacy`	512	22.3 ×
`entity-gliner`	128	8.6 ×
`rerank` (bge-v2-m3)	32	5.6 ×
`translate` (any pair)	256	≈34 ×
`ocr-paddle`	8	1.4 ×

Override per call when steady-state throughput matters more than tail latency:

vecs = client.embed(big_list, batch_size=256)        # peak embed throughput
out  = client.translate(many_sents, src="zh",
                        batch_size=512, max_parallel=4)

The constants are exposed for inspection / overrides:

from gpuhost_client import OPTIMAL_BATCH_SIZE, SERVER_BATCH_CAP

SERVER_BATCH_CAP reflects the gateway's hard-rejection thresholds (T-037/T-038). Anything above the cap is rejected with HTTP 400 — the client clamps user-supplied batch_size to the cap as a safety net, but the gateway is the source of truth.

Errors

All errors derive from GPUHostError:

GPUHostHTTPError — non-2xx response. Carries status_code, code (e.g. BadRequest, ModelLoadRejected), message, retryable, retry_after_ms, request_id.
GPUHostQuotaError — HTTP 429 specifically (subclass of HTTP error).
GPUHostTimeoutError — transport-level timeout.

from gpuhost_client import GPUHostHTTPError, GPUHostQuotaError

try:
    out = client.embed(texts)
except GPUHostQuotaError as e:
    print("rate limited; retry after", e.retry_after_ms, "ms")
except GPUHostHTTPError as e:
    print(e.status_code, e.code, e.message)

OCR input flexibility

ocr(...) accepts:

raw bytes (bytes/bytearray),
a filesystem path (str or pathlib.Path),
an already-encoded base64 string.

MIME type is sniffed from the magic bytes (PNG/JPEG/GIF/WebP). Pass a list to OCR many images in one call — the client uses the /v1/ocr/batch alias and reassembles results.

LLM streaming

llm_stream(...) yields one chunk dict per SSE data: line. The iterator exits when the gateway emits data: [DONE]. Network/timeout errors propagate as exceptions, not as in-stream events.

Async?

Out of scope for v0.1. The synchronous client is built on httpx.Client which uses connection pooling, so concurrent callers should construct one client per process/thread group rather than per request. An asyncio-based sibling is on the roadmap and will share the same method names.

Compatibility

Client `0.1.x`	GPU Model Host	API surface
✓	rev ≥ 76	`/v1/...` (T-037 caps + T-038 rerank cap)

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.4

May 11, 2026

0.1.2

May 10, 2026

0.1.1

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpuhost_client-0.1.4.tar.gz (23.2 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpuhost_client-0.1.4-py3-none-any.whl (14.5 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file gpuhost_client-0.1.4.tar.gz.

File metadata

Download URL: gpuhost_client-0.1.4.tar.gz
Upload date: May 11, 2026
Size: 23.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gpuhost_client-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`1df19e793576a3544f1d0717f870e80cc9c7bdab28d3c85ead6bad0813364bb7`
MD5	`edd6b9dcb02ea9a65950f000136a524a`
BLAKE2b-256	`0fe59d4d1da7ae553d0308e296b16554f22acc3346eefe788be713822a009226`

See more details on using hashes here.

File details

Details for the file gpuhost_client-0.1.4-py3-none-any.whl.

File metadata

Download URL: gpuhost_client-0.1.4-py3-none-any.whl
Upload date: May 11, 2026
Size: 14.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gpuhost_client-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`823de3b42ee31031bba71d4e67c1eade739132d18895610f7b3923625bac9e50`
MD5	`cf5a35cccca132a381cdec577483e9ca`
BLAKE2b-256	`ecef21533358ab0c3b1477a8ba19591f47777eae74dbf854fed5cc4ec6ffbe18`

See more details on using hashes here.

gpuhost-client 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gpuhost-client

Quick start

Single vs batch — the same method handles both

Recommended batch sizes (sweet spots)

Errors

OCR input flexibility

LLM streaming

Async?

Compatibility

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes