Azure OpenAI client wrapper with rate limiting, cost tracking, and retry logic
Project description
azure-llm-toolkit/README.md#L1-220
Azure LLM Toolkit (v0.1.5)
A Python toolkit that wraps Azure OpenAI interactions with production-friendly features:
- Rate limiting (RPM / TPM)
- Cost estimation & pluggable cost tracking
- Retry logic and circuit-breaker patterns
- Disk-based caching for embeddings & chat completions
- Batch embedding (Polars-based high-performance embedder)
- Utilities: token counting, streaming, reranking helpers
This repository is packaged as azure-llm-toolkit (see pyproject.toml, version 0.1.5).
Key components (API surface)
Top-level imports you will typically use:
AzureConfig— configuration loader for environment / constructor-based configAzureLLMClient— async client with:embed_text(...)— embed a single text (async)chat_completion(...)— chat completion (async)chat_completion_stream(...)— streaming chat completions (async generator)- token counting helpers:
count_tokens(...),count_message_tokens(...) - cost estimation helpers:
estimate_embedding_cost(...),estimate_chat_cost(...)
AzureLLMClientSync— synchronous wrapper that runs the async client in an event loopPolarsBatchEmbedder— high-performance batch embedder for large datasets (async)CostEstimator,CostTracker,InMemoryCostTracker— cost estimation and trackingRateLimiter,RateLimiterPool— rate limiting primitivesCacheManager,EmbeddingCache,ChatCache— disk-based caches for embeddings / chat responsesLogprobReranker,create_reranker— logprob-based reranker utilitiesdetect_embedding_dimension(config)— probe or read cached embedding dimensionality
(See the package azure_llm_toolkit.__init__ for the full exported list.)
Installation
Install from PyPI:
pip install azure-llm-toolkit
Or install editable from source:
git clone https://github.com/torsteinsornes/azure-llm-toolkit.git
cd azure-llm-toolkit
pip install -e .
Development extras:
pip install -e ".[dev]"
Configuration
The library loads configuration from environment variables by default. Common variables:
AZURE_OPENAI_API_KEY(orOPENAI_API_KEY) — REQUIREDAZURE_ENDPOINT(orAZURE_OPENAI_ENDPOINT) — REQUIRED (e.g.https://your-resource.openai.azure.com)AZURE_API_VERSION— default:2024-12-01-previewAZURE_CHAT_DEPLOYMENT— default:gpt-5-miniAZURE_RERANKER_DEPLOYMENT— default:gpt-4o-east-USAZURE_EMBEDDING_DEPLOYMENT— default:text-embedding-3-largeAZURE_TIMEOUT_SECONDS— request timeout in seconds (default:None= infinite, recommended for reasoning models)AZURE_MAX_RETRIES— default:5TOKENIZER_MODEL— model used by tiktoken for token counting (defaults to chat deployment)FORCE_EMBED_DIM— optional integer to force embedding dim (useful in tests/offline)
You can also pass these values directly when constructing AzureConfig(...).
Quick start — async (basic)
Below are succinct examples showing common workflows.
Embed a single text (async):
import asyncio
from azure_llm_toolkit import AzureConfig, AzureLLMClient
async def main():
config = AzureConfig() # loads from env by default
client = AzureLLMClient(config=config)
emb = await client.embed_text("Hello, world!")
print(f"Embedding length: {len(emb)}")
print(f"First 8 dims: {emb[:8]}")
asyncio.run(main())
Chat completion (async):
import asyncio
from azure_llm_toolkit import AzureConfig, AzureLLMClient
async def main():
config = AzureConfig()
client = AzureLLMClient(config=config)
messages = [{"role": "user", "content": "Explain supervised learning in simple terms."}]
result = await client.chat_completion(messages=messages, system_prompt="You are a helpful assistant.")
print("Response:")
print(result.content)
print("Usage (tokens):", result.usage.total_tokens)
asyncio.run(main())
Streaming chat completion:
import asyncio
from azure_llm_toolkit import AzureConfig, AzureLLMClient
async def stream_example():
client = AzureLLMClient(AzureConfig())
async for chunk in client.chat_completion_stream(
messages=[{"role":"user","content":"Tell me a short story about a robot."}],
system_prompt="You are a creative storyteller."
):
print(chunk, end="", flush=True)
asyncio.run(stream_example())
Quick start — batch embeddings (Polars)
When embedding large corpora, use PolarsBatchEmbedder which tokenizes in parallel, batches intelligently, and supports weighted averaging for splits.
The batch embedder uses a dual rate-limiting approach:
- Built-in batching with sleep delays between batches (always active)
- Optional integration with
RateLimiterfor coordinated throttling (setuse_rate_limiting=True)
Example (async):
import asyncio
import polars as pl
from azure_llm_toolkit import AzureConfig, PolarsBatchEmbedder
async def main():
config = AzureConfig()
embedder = PolarsBatchEmbedder(config=config, max_tokens_per_minute=450_000, max_lists_per_query=1024)
df = pl.DataFrame({"id": list(range(1000)), "text": [f"Document {i}" for i in range(1000)]})
result_df = await embedder.embed_dataframe(df, text_column="text", verbose=True)
# result_df includes columns: text, text.token_count, text.embedding
print("Embedded rows:", len(result_df))
asyncio.run(main())
For more examples including rate limiter integration, cost tracking, and handling large datasets, see examples/polars_batch_embedder_comprehensive.py.
Caching
If enabled, the client caches embeddings and chat completions on disk (content-based keys). Example usage:
from azure_llm_toolkit import AzureConfig, AzureLLMClient
config = AzureConfig()
client = AzureLLMClient(config=config, enable_cache=True)
# First call — hits API
emb1 = await client.embed_text("Cache demo text", use_cache=True)
# Second call — should be a cache hit
emb2 = await client.embed_text("Cache demo text", use_cache=True)
You can access cache statistics via client.cache_manager.get_stats() when CacheManager is used.
Rate limiting
By default, AzureLLMClient creates a RateLimiterPool to throttle requests. You can provide a custom pool:
from azure_llm_toolkit import AzureConfig, AzureLLMClient, RateLimiterPool
pool = RateLimiterPool(default_rpm=3000, default_tpm=300_000)
client = AzureLLMClient(config=AzureConfig(), rate_limiter_pool=pool, enable_rate_limiting=True)
The Polars embedder also respects token/list limits configured at construction.
Cost estimation & tracking
Use CostEstimator to estimate costs before making calls; use InMemoryCostTracker (or implement CostTracker) to record costs after calls.
Estimate cost for a chat:
from azure_llm_toolkit import AzureConfig, AzureLLMClient, CostEstimator
config = AzureConfig()
client = AzureLLMClient(config=config)
est = client.estimate_chat_cost(messages=[{"role":"user","content":"Hello"}], estimated_output_tokens=200)
print("Estimated cost:", est)
Record costs automatically by passing a CostTracker to the client (example in docs and tests). InMemoryCostTracker can be used for quick local tracking.
Reranker (logprob-based)
The toolkit includes a logprob-based reranker that uses token log probabilities to produce calibrated relevance scores. Typical flow:
- Retrieve candidate docs via vector DB
- Use
LogprobReranker/create_rerankerto score documents - Optionally rerank and return top-K
Example (async):
from azure_llm_toolkit import AzureConfig, AzureLLMClient
from azure_llm_toolkit.reranker import create_reranker
config = AzureConfig()
client = AzureLLMClient(config=config)
reranker = create_reranker(client=client, model="gpt-4o")
results = await reranker.rerank("What is machine learning?", ["Doc A text", "Doc B text"], top_k=3)
for r in results:
print(r.score, r.document)
Note: the reranker requires a model that supports logprobs.
Synchronous usage (legacy code)
The AzureLLMClientSync provides blocking wrappers:
from azure_llm_toolkit import AzureConfig, AzureLLMClientSync
client = AzureLLMClientSync(config=AzureConfig())
embedding = client.embed_text("Hello sync world")
response = client.chat_completion(messages=[{"role":"user","content":"Hi"}])
print(response.content)
(Under the hood this runs the async client in an event loop or a background thread if already inside an event loop.)
Utilities
detect_embedding_dimension(config)— probe the configured embedding deployment to detect vector dimensionality (with caching).AzureConfig.count_tokens(...)and client helpers for token counting.- Streaming sinks, tools for function-calling integrations, health checks, metrics collector interfaces (Prometheus / OpenTelemetry helpers), and more — see
src/azure_llm_toolkit/for modules and docstrings.
Development & testing
Install dev dependencies:
pip install -e ".[dev]"
Run tests:
pytest -q
Type checking:
basedpyright src/
mypy src/
Formatting & linting:
ruff format .
ruff check .
Contributing
- Fork the repo
- Create a branch (
git checkout -b feature/awesome) - Add tests for new functionality
- Ensure tests and static checks pass
- Open a PR with a clear description
See CONTRIBUTING.md for more details.
License
MIT — see the LICENSE file.
Where to look next (code entry points)
src/azure_llm_toolkit/client.py— async client implementation and chat/embedding primitivessrc/azure_llm_toolkit/config.py— configuration and tokenization helperssrc/azure_llm_toolkit/batch_embedder.py—PolarsBatchEmbedderimplementationsrc/azure_llm_toolkit/sync_client.py— synchronous wrappersrc/azure_llm_toolkit/reranker.py— reranking utilitiessrc/azure_llm_toolkit/cache.py— caching primitives
If you need curated examples, the examples/ directory contains runnable demos for caching, batching, reranking, and Prometheus / dashboard integrations.
If you want, I can:
- Open/produce a one-file example matching your exact environment (async or sync),
- Or update the examples/ directory to include a minimal runnable script demonstrating embed + chat + caching + cost tracking with your preferred settings.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file azure_llm_toolkit-0.2.0.tar.gz.
File metadata
- Download URL: azure_llm_toolkit-0.2.0.tar.gz
- Upload date:
- Size: 294.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f21d05c1c20460dfa7c5b6b44e8fa79e1633838c14db4adad3b22ab18c7fb15
|
|
| MD5 |
59f5f759336c6029c02416c4d8f7b173
|
|
| BLAKE2b-256 |
bbe024103512c8e69de25e42d20aaedc97e9d2d064e73924b49853f598404de3
|
File details
Details for the file azure_llm_toolkit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: azure_llm_toolkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 97.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a25555982175f5f94930431a64d3412cc97b082f02217da4089b36c6f040f868
|
|
| MD5 |
46104a82b853b9a359885af8a5d6381f
|
|
| BLAKE2b-256 |
bf0d86ed350f0187ffac23c412696b0a43602549ef7a6f72a6dbca4af79cd926
|