Skip to main content

Azure OpenAI client wrapper with rate limiting, cost tracking, and retry logic

Project description

azure-llm-toolkit/README.md#L1-220

Azure LLM Toolkit (v0.2.4)

A Python toolkit that wraps Azure OpenAI interactions with production-friendly features:

  • Rate limiting (RPM / TPM)
  • Cost estimation & pluggable cost tracking
  • Reasoning token tracking for o1/GPT-5 models
  • Retry logic and circuit-breaker patterns
  • Disk-based caching for embeddings & chat completions
  • Batch embedding (Polars-based high-performance embedder)
  • Enhanced logging with timeout and performance monitoring
  • Utilities: token counting, streaming, reranking helpers

This repository is packaged as azure-llm-toolkit (see pyproject.toml, version 0.2.4).


Key components (API surface)

Top-level imports you will typically use:

  • AzureConfig — configuration loader for environment / constructor-based config
  • AzureLLMClient — async client with:
    • embed_text(...) — embed a single text (async)
    • chat_completion(...) — chat completion (async)
    • chat_completion_stream(...) — streaming chat completions (async generator)
    • token counting helpers: count_tokens(...), count_message_tokens(...)
    • cost estimation helpers: estimate_embedding_cost(...), estimate_chat_cost(...)
  • AzureLLMClientSync — synchronous wrapper that runs the async client in an event loop
  • PolarsBatchEmbedder — high-performance batch embedder for large datasets (async)
  • CostEstimator, CostTracker, InMemoryCostTracker — cost estimation and tracking
  • RateLimiter, RateLimiterPool — rate limiting primitives
  • CacheManager, EmbeddingCache, ChatCache — disk-based caches for embeddings / chat responses
  • LogprobReranker, create_reranker — logprob-based reranker utilities
  • detect_embedding_dimension(config) — probe or read cached embedding dimensionality

(See the package azure_llm_toolkit.__init__ for the full exported list.)


Installation

Install from PyPI:

pip install azure-llm-toolkit

Or install editable from source:

git clone https://github.com/tsoernes/azure-llm-toolkit.git
cd azure-llm-toolkit
pip install -e .

Development extras:

pip install -e ".[dev]"

GPT-5 Model Support

The toolkit automatically handles parameter conversion for GPT-5 models:

  • Automatic max_tokensmax_completion_tokens conversion: GPT-5 models require max_completion_tokens instead of max_tokens. The client automatically converts this parameter and logs a warning.
  • Automatic temperature removal: GPT-5 models don't support the temperature parameter. The client automatically removes it and logs a warning.
  • Case-insensitive detection: Works with any model name containing "gpt-5" (e.g., "gpt-5", "GPT-5-mini", "gpt-5-turbo").

This means you can use the same code for both GPT-4 and GPT-5 models without modification:

import base64

with open("image.jpg", "rb") as f:
    base64_image = base64.b64encode(f.read()).decode()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},
        ],
    }
]

result = await client.chat_completion(messages=messages, max_tokens=100)

Token counting with vision messages:

The toolkit automatically extracts text from vision messages for token estimation:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What do you see?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
        ],
    }
]

# Token counting works correctly with vision messages
token_count = client.count_message_tokens(messages)
print(f"Estimated tokens: {token_count}")

Vision Model Support

The toolkit supports vision models with both URL-based and base64-encoded images:

Simple vision message:

from azure_llm_toolkit import AzureConfig, AzureLLMClient

client = AzureLLMClient(AzureConfig())

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
        ],
    }
]

result = await client.chat_completion(messages=messages, max_tokens=100)
print(result.content)

Multiple images:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these images:"},
            {"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
            {"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}},
        ],
    }
]

result = await client.chat_completion(messages=messages, max_tokens=150)

Base64-encoded images:

from azure_llm_toolkit import AzureConfig, AzureLLMClient

client = AzureLLMClient(AzureConfig())

# Works with both GPT-4 and GPT-5 models
# For GPT-5, max_tokens is automatically converted to max_completion_tokens
# and temperature is automatically removed
result = await client.chat_completion(
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=100,
    temperature=0.7  # Automatically removed for GPT-5
)

Configuration

The library loads configuration from environment variables by default. Common variables:

  • AZURE_OPENAI_API_KEY (or OPENAI_API_KEY) — REQUIRED
  • AZURE_ENDPOINT (or AZURE_OPENAI_ENDPOINT) — REQUIRED (e.g. https://your-resource.openai.azure.com)
  • AZURE_API_VERSION — default: 2024-12-01-preview
  • AZURE_CHAT_DEPLOYMENT — default: gpt-5-mini
  • AZURE_RERANKER_DEPLOYMENT — default: gpt-4o-east-US
  • AZURE_EMBEDDING_DEPLOYMENT — default: text-embedding-3-large
  • AZURE_TIMEOUT_SECONDS — request timeout in seconds (default: None = infinite, recommended for reasoning models)
  • AZURE_MAX_RETRIES — default: 5
  • TOKENIZER_MODEL — model used by tiktoken for token counting (defaults to chat deployment)
  • FORCE_EMBED_DIM — optional integer to force embedding dim (useful in tests/offline)

You can also pass these values directly when constructing AzureConfig(...).


Quick start — async (basic)

Below are succinct examples showing common workflows.

Embed a single text (async):

import asyncio
from azure_llm_toolkit import AzureConfig, AzureLLMClient

async def main():
    config = AzureConfig()  # loads from env by default
    client = AzureLLMClient(config=config)

    emb = await client.embed_text("Hello, world!")
    print(f"Embedding length: {len(emb)}")
    print(f"First 8 dims: {emb[:8]}")

asyncio.run(main())

Chat completion (async):

import asyncio
from azure_llm_toolkit import AzureConfig, AzureLLMClient

async def main():
    config = AzureConfig()
    client = AzureLLMClient(config=config)

    messages = [{"role": "user", "content": "Explain supervised learning in simple terms."}]
    result = await client.chat_completion(messages=messages, system_prompt="You are a helpful assistant.")
    print("Response:")
    print(result.content)
    print("Usage (tokens):", result.usage.total_tokens)

asyncio.run(main())

Streaming chat completion:

import asyncio
from azure_llm_toolkit import AzureConfig, AzureLLMClient

async def stream_example():
    client = AzureLLMClient(AzureConfig())
    async for chunk in client.chat_completion_stream(
        messages=[{"role":"user","content":"Tell me a short story about a robot."}],
        system_prompt="You are a creative storyteller."
    ):
        print(chunk, end="", flush=True)

asyncio.run(stream_example())

Quick start — batch embeddings (Polars)

When embedding large corpora, use PolarsBatchEmbedder which tokenizes in parallel, batches intelligently, and supports weighted averaging for splits.

The batch embedder uses a dual rate-limiting approach:

  • Built-in batching with sleep delays between batches (always active)
  • Optional integration with RateLimiter for coordinated throttling (set use_rate_limiting=True)

Example (async):

import asyncio
import polars as pl
from azure_llm_toolkit import AzureConfig, PolarsBatchEmbedder

async def main():
    config = AzureConfig()
    embedder = PolarsBatchEmbedder(config=config, max_tokens_per_minute=450_000, max_lists_per_query=1024)

    df = pl.DataFrame({"id": list(range(1000)), "text": [f"Document {i}" for i in range(1000)]})
    result_df = await embedder.embed_dataframe(df, text_column="text", verbose=True)

    # result_df includes columns: text, text.token_count, text.embedding
    print("Embedded rows:", len(result_df))

asyncio.run(main())

For more examples including rate limiter integration, cost tracking, and handling large datasets, see examples/polars_batch_embedder_comprehensive.py.


Caching

If enabled, the client caches embeddings and chat completions on disk (content-based keys). Example usage:

from azure_llm_toolkit import AzureConfig, AzureLLMClient

config = AzureConfig()
client = AzureLLMClient(config=config, enable_cache=True)

# First call — hits API
emb1 = await client.embed_text("Cache demo text", use_cache=True)

# Second call — should be a cache hit
emb2 = await client.embed_text("Cache demo text", use_cache=True)

You can access cache statistics via client.cache_manager.get_stats() when CacheManager is used.


Rate limiting

By default, AzureLLMClient creates a RateLimiterPool to throttle requests. You can provide a custom pool:

from azure_llm_toolkit import AzureConfig, AzureLLMClient, RateLimiterPool

pool = RateLimiterPool(default_rpm=3000, default_tpm=300_000)
client = AzureLLMClient(config=AzureConfig(), rate_limiter_pool=pool, enable_rate_limiting=True)

The Polars embedder also respects token/list limits configured at construction.


Examples

The examples/ directory contains comprehensive runnable demonstrations:

  • basic_usage.py — Simple async client usage for embeddings and chat
  • sync_client_example.py — Synchronous wrapper examples
  • batch_embedding_example.py — Basic batch embedding patterns
  • polars_batch_embedder_comprehensive.py — High-performance batch embeddings with Polars
  • caching_example.py — Disk-based caching for embeddings and completions
  • function_calling_example.py — Function/tool calling with Azure OpenAI
  • reasoning_tokens_example.py — Tracking reasoning tokens for o1/GPT-5 models
  • reranker_demo_simple.py — Basic logprob-based reranking
  • reranker_example.py — Advanced reranking with scoring
  • reranker_rate_limiting_example.py — Reranking with rate limit coordination
  • prometheus_demo_simple.py — Basic Prometheus metrics export
  • prometheus_live_demo.py — Live metrics collection and export
  • prometheus_dashboard_example.py — Dashboard generation for monitoring
  • otel_jaeger_demo.py — OpenTelemetry tracing with Jaeger integration
  • check_batch_quota.py — Check Azure batch API quotas

Browse examples at: https://github.com/tsoernes/azure-llm-toolkit/tree/master/examples


Cost estimation & tracking

Use CostEstimator to estimate costs before making calls; use InMemoryCostTracker (or implement CostTracker) to record costs after calls.

Estimate cost for a chat:

from azure_llm_toolkit import AzureConfig, AzureLLMClient, CostEstimator

config = AzureConfig()
client = AzureLLMClient(config=config)
est = client.estimate_chat_cost(messages=[{"role":"user","content":"Hello"}], estimated_output_tokens=200)
print("Estimated cost:", est)

Record costs automatically by passing a CostTracker to the client (example in docs and tests). InMemoryCostTracker can be used for quick local tracking.


Reranker (logprob-based)

The toolkit includes a logprob-based reranker that uses token log probabilities to produce calibrated relevance scores. Typical flow:

  • Retrieve candidate docs via vector DB
  • Use LogprobReranker / create_reranker to score documents
  • Optionally rerank and return top-K

Example (async):

from azure_llm_toolkit import AzureConfig, AzureLLMClient
from azure_llm_toolkit.reranker import create_reranker

config = AzureConfig()
client = AzureLLMClient(config=config)

reranker = create_reranker(client=client, model="gpt-4o")
results = await reranker.rerank("What is machine learning?", ["Doc A text", "Doc B text"], top_k=3)

for r in results:
    print(r.score, r.document)

Note: the reranker requires a model that supports logprobs.


Synchronous usage (legacy code)

The AzureLLMClientSync provides blocking wrappers:

from azure_llm_toolkit import AzureConfig, AzureLLMClientSync

client = AzureLLMClientSync(config=AzureConfig())
embedding = client.embed_text("Hello sync world")
response = client.chat_completion(messages=[{"role":"user","content":"Hi"}])
print(response.content)

(Under the hood this runs the async client in an event loop or a background thread if already inside an event loop.)


Utilities

  • detect_embedding_dimension(config) — probe the configured embedding deployment to detect vector dimensionality (with caching).
  • AzureConfig.count_tokens(...) and client helpers for token counting.
  • Streaming sinks, tools for function-calling integrations, health checks, metrics collector interfaces (Prometheus / OpenTelemetry helpers), and more — see src/azure_llm_toolkit/ for modules and docstrings.

Development & testing

Install dev dependencies:

pip install -e ".[dev]"

Run tests:

pytest -q

Type checking:

basedpyright src/
mypy src/

Formatting & linting:

ruff format .
ruff check .

Contributing

  1. Fork the repo
  2. Create a branch (git checkout -b feature/awesome)
  3. Add tests for new functionality
  4. Ensure tests and static checks pass
  5. Open a PR with a clear description

See CONTRIBUTING.md for more details.


License

MIT — see the LICENSE file.


Where to look next (code entry points)

  • src/azure_llm_toolkit/client.py — async client implementation and chat/embedding primitives
  • src/azure_llm_toolkit/config.py — configuration and tokenization helpers
  • src/azure_llm_toolkit/batch_embedder.pyPolarsBatchEmbedder implementation
  • src/azure_llm_toolkit/sync_client.py — synchronous wrapper
  • src/azure_llm_toolkit/reranker.py — reranking utilities
  • src/azure_llm_toolkit/cache.py — caching primitives

If you need curated examples, the examples/ directory contains runnable demos for caching, batching, reranking, and Prometheus / dashboard integrations.


If you want, I can:

  • Open/produce a one-file example matching your exact environment (async or sync),
  • Or update the examples/ directory to include a minimal runnable script demonstrating embed + chat + caching + cost tracking with your preferred settings.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_llm_toolkit-0.2.4.tar.gz (307.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_llm_toolkit-0.2.4-py3-none-any.whl (99.6 kB view details)

Uploaded Python 3

File details

Details for the file azure_llm_toolkit-0.2.4.tar.gz.

File metadata

  • Download URL: azure_llm_toolkit-0.2.4.tar.gz
  • Upload date:
  • Size: 307.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for azure_llm_toolkit-0.2.4.tar.gz
Algorithm Hash digest
SHA256 c9613688200ffc08456078bb80808450d258f6aafc525eef39fcf22cd821fde5
MD5 54e817e00115fe858f3a29c0dedbaf24
BLAKE2b-256 e8fff44bb733416ee4a88330411a66da7251c6c5f044415e8bb9bf089d572adc

See more details on using hashes here.

File details

Details for the file azure_llm_toolkit-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for azure_llm_toolkit-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 19a6d8a52178bbf1b1ef37b755caebe1a6fc66932904ae69947cfe7cbcc4e686
MD5 9490f1b4dd0acdf1266e47891b575f69
BLAKE2b-256 d60654ad5a53ce299ed77da35a00ad4938d6862ccf281cb0f48bf625045e3919

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page