Skip to main content

A client for interacting with LLM completion APIs and tracking usage.

Project description

llm-api-client :robot::zap:

Docs status Tests status PyPI status PyPI version PyPI - License Python compatibility

A Python helper library for efficiently managing concurrent, rate-limited API requests to LLM providers via LiteLLM.

It provides an APIClient that handles:

  • Concurrency: Making multiple API calls simultaneously using threads.
  • Rate Limiting: Respecting API limits for requests per minute (RPM) and tokens per minute (TPM).
  • Retries: Automatically retrying failed requests.
  • Request Sanitization: Cleaning up request parameters to ensure compatibility with different models/providers.
  • LLM Context Management: Truncating message history to fit within model context windows.
  • Usage Tracking: Monitoring API costs, token counts, and response times via an integrated APIUsageTracker.

Full documentation: https://andrefcruz.github.io/llm-api-client/

For API reference and more examples, see:

Installation

Install the package directly from PyPI:

pip install llm-api-client

Quick start

Single request

from llm_api_client import APIClient

client = APIClient()  # Defaults approximate OpenAI Tier 4 limits

responses = client.make_requests([
    {
        "model": "gpt-4o-mini",
        "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    }
])

print(responses[0].choices[0].message.content)

With retries

responses = client.make_requests_with_retries([
    {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hi!"}]}
], max_retries=2)

Disable request sanitization (advanced)

responses = client.make_requests([
    {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hi!"}]}
], sanitize=False)

Control concurrency and rate limits

client = APIClient(max_requests_per_minute=600, max_tokens_per_minute=100_000, max_workers=50)

Usage

The primary way to interact with the APIClient is through its make_requests and make_requests_with_retries methods, which handle concurrent execution, rate limiting, and retrying failed requests.

Here's a basic example of using APIClient to make multiple completion requests concurrently:

import os
from llm_api_client import APIClient

# Ensure your API key is set (e.g., OPENAI_API_KEY environment variable)
# os.environ["OPENAI_API_KEY"] = "your-api-key"

# Create a client with specific rate limits (adjust as needed)
# Defaults use OpenAI Tier 4 limits if not specified.
client = APIClient(
    max_requests_per_minute=1000,
    max_tokens_per_minute=100000
)

# Prepare your API requests
prompts = [
    "Explain the theory of relativity in simple terms.",
    "Write a short poem about a cat.",
    "What is the capital of France?",
]

requests_data = [
    {
        "model": "gpt-4o-mini",
        "messages": [{"role": "user", "content": prompt}],
        # Add other parameters like temperature, max_tokens etc. if needed
        # "temperature": 0.7,
        # "max_tokens": 150,
    }
    for prompt in prompts
]

# Make the requests concurrently
# Use make_requests_with_retries for built-in retry logic
responses = client.make_requests(requests_data)

# Process the responses
for i, response in enumerate(responses):
    if response:
        # Access response content (structure depends on the API/model)
        # For OpenAI/LiteLLM completion:
        try:
            message_content = response.choices[0].message.content
            print(f"Response {i+1}: {message_content[:100]}...") # Print first 100 chars
        except (AttributeError, IndexError, TypeError) as e:
            print(f"Response {i+1}: Could not parse response content. Error: {e}")
            print(f"Raw response: {response}")
    else:
        print(f"Response {i+1}: Request failed.")

Usage statistics and tracking

APIClient integrates an APIUsageTracker that accumulates cost, token usage, and response time stats across all calls.

Quick peek:

print(client.tracker)  # human-readable summary
print(client.tracker.details)  # machine-friendly dict

print(f"Total cost: ${client.tracker.total_cost:.4f}")
print(f"Total prompt tokens: {client.tracker.total_prompt_tokens}")
print(f"Total completion tokens: {client.tracker.total_completion_tokens}")
print(f"Number of API calls: {client.tracker.num_api_calls}")
print(f"Mean response time: {client.tracker.mean_response_time:.2f}s")

See tracker API: https://andrefcruz.github.io/llm-api-client/api.html#module-llm_api_client.api_tracker

Client Parameters

The APIClient constructor accepts:

  • max_requests_per_minute (int, default 10000): Maximum API requests per minute (RPM).
  • max_tokens_per_minute (int, default 2000000): Maximum tokens per minute (TPM).
  • max_workers (int, optional): Maximum worker threads. If not set, defaults to max_requests_per_minute when provided, otherwise to CPU count * 20.

Method Parameters

Both make_requests and make_requests_with_retries accept the following core parameters:

  • requests (list[dict]): A list where each dictionary represents the parameters for a single API call (e.g., model, messages, temperature, etc.) -- follows the openai API standard via litellm.
  • max_workers (int, optional): Maximum concurrent threads. Defaults to the client's configured worker count (set via the constructor).
  • sanitize (bool, optional): If True (default), the client will attempt to remove parameters that are incompatible with the specified model and provider before making the request. It also truncates message history to fit the model's context window.
  • timeout (float, optional): The maximum number of seconds to wait for all requests to complete. If None (default), it waits indefinitely.

The make_requests_with_retries method includes one additional parameter:

  • max_retries (int, optional): The maximum number of times to retry a failed request. Defaults to 2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_api_client-0.1.6.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_api_client-0.1.6-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file llm_api_client-0.1.6.tar.gz.

File metadata

  • Download URL: llm_api_client-0.1.6.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_api_client-0.1.6.tar.gz
Algorithm Hash digest
SHA256 8135edc16f54d3d1b228849df3fa08be51c06f92814a1b1fa5eccbd1c375cc55
MD5 148f314eb690972811bc972c6f174656
BLAKE2b-256 93fa7c49e8206816a11ebc24853280d3730b63fc2390aa5184f94cd9f578535e

See more details on using hashes here.

File details

Details for the file llm_api_client-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: llm_api_client-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_api_client-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c0ef5720f1ddd3acf1895b5bc968b1f3f2b42f84427b2447b58d24b8e2b3b746
MD5 2ef39dbbeddc65f4f9c2abca8eede5c2
BLAKE2b-256 2d0d194bb302d4869c1c455a69e6b0bab6a861912d130e1364dd86a9e1cf87da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page