Skip to main content

High-performance parallel LLM API request tool with caching and multiple provider support

Project description

FastLLM

High-performance parallel LLM API request tool with support for multiple providers and caching capabilities.

Features

  • Parallel request processing with configurable concurrency
    • Allows you to process 20000+ prompt tokens per second and 1500+ output tokens per second even for extremely large LLMs, such as Deepseek-V3.
  • Built-in caching support (in-memory and disk-based)
  • Progress tracking with token usage statistics
  • Support for multiple LLM providers (OpenAI, OpenRouter, etc.)
  • OpenAI-style API for request batching
  • Retry mechanism with configurable attempts and delays
  • Request deduplication and response ordering

Installation

Use pip:

pip install fastllm-kit

Alternatively, use uv:

uv pip install fastllm-kit

Important: fastllm does not support yet libsqlite3.49.1, please use libsqlite3.49.0 or lower. See this issue for more details. This might be an issue for users with conda environments.

For development:

# Clone the repository
git clone https://github.com/Rexhaif/fastllm.git
cd fastllm

# Create a virtual environment and install dependencies
uv venv
uv pip install -e ".[dev]"

Dependencies

FastLLM requires Python 3.9 or later and depends on the following packages:

  • httpx (^0.27.2) - For async HTTP requests
  • pydantic (^2.10.6) - For data validation and settings management
  • rich (^13.9.4) - For beautiful terminal output and progress bars
  • diskcache (^5.6.3) - For persistent disk caching
  • asyncio (^3.4.3) - For asynchronous operations
  • anyio (^4.8.0) - For async I/O operations
  • tqdm (^4.67.1) - For progress tracking
  • typing_extensions (^4.12.2) - For enhanced type hints

Development dependencies:

  • ruff (^0.3.7) - For linting and formatting
  • pytest (^8.3.4) - For testing
  • pytest-asyncio (^0.23.8) - For async tests
  • pytest-cov (^4.1.0) - For test coverage
  • black (^24.10.0) - For code formatting
  • coverage (^7.6.10) - For code coverage reporting

Development

The project uses just for task automation and uv for dependency management.

Common tasks:

# Install dependencies
just install

# Run tests
just test

# Format code
just format

# Run linting
just lint

# Clean up cache files
just clean

Quick Start

from fastllm import RequestBatch, RequestManager, OpenAIProvider, InMemoryCache

# Create a provider
provider = OpenAIProvider(
    api_key="your-api-key",
    # Optional: custom API base URL
    api_base="https://api.openai.com/v1",
)

# Create a cache provider (optional)
cache = InMemoryCache()  # or DiskCache(directory="./cache")

# Create a request manager
manager = RequestManager(
    provider=provider,
    concurrency=50,  # Number of concurrent requests
    show_progress=True,  # Show progress bar
    caching_provider=cache,  # Enable caching
)

# Create a batch of requests
request_ids = []  # Store request IDs for later use
with RequestBatch() as batch:
    # Add requests to the batch
    for i in range(10):
        # create() returns the request ID (caching key)
        request_id = batch.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{
                "role": "user",
                "content": f"What is {i} + {i}?"
            }],
            temperature=0.7,
            include_reasoning=True,  # Optional: include model reasoning
        )
        request_ids.append(request_id)

# Process the batch
responses = manager.process_batch(batch)

# Process responses
for request_id, response in zip(request_ids, responses):
    print(f"Request {request_id}: {response.response.choices[0].message.content}")
        
# You can use request IDs to check cache status
for request_id in request_ids:
    is_cached = await cache.exists(request_id)
    print(f"Request {request_id} is {'cached' if is_cached else 'not cached'}")

Advanced Usage

Async Support

FastLLM can be used both synchronously and asynchronously, and works seamlessly in regular Python environments, async applications, and Jupyter notebooks:

import asyncio
from fastllm import RequestBatch, RequestManager, OpenAIProvider

# Works in Jupyter notebooks
provider = OpenAIProvider(api_key="your-api-key")
manager = RequestManager(provider=provider)
responses = manager.process_batch(batch)  # Just works!

# Works in async applications
async def process_requests():
    provider = OpenAIProvider(api_key="your-api-key")
    manager = RequestManager(provider=provider)
    
    with RequestBatch() as batch:
        batch.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": "Hello!"}]
        )
    
    responses = manager.process_batch(batch)
    return responses

# Run in existing event loop
async def main():
    responses = await process_requests()
    print(responses)

asyncio.run(main())

Caching Configuration

FastLLM supports both in-memory and disk-based caching, with request IDs serving as cache keys:

from fastllm import InMemoryCache, DiskCache, RequestBatch

# Create a batch and get request IDs
with RequestBatch() as batch:
    request_id = batch.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    print(f"Request ID (cache key): {request_id}")

# In-memory cache (faster, but cleared when process ends)
cache = InMemoryCache()

# Disk cache (persistent, with optional TTL and size limits)
cache = DiskCache(
    directory="./cache",
    ttl=3600,  # Cache TTL in seconds
    size_limit=int(2e9)  # 2GB size limit
)

# Check if a response is cached
is_cached = await cache.exists(request_id)

# Get cached response if available
if is_cached:
    response = await cache.get(request_id)

Custom Providers

Create your own provider by inheriting from the base Provider class:

from fastllm import Provider
from typing import Any
import httpx

class CustomProvider(Provider[YourResponseType]):
    def get_request_headers(self) -> dict[str, str]:
        return {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }

    async def make_request(
        self,
        client: httpx.AsyncClient,
        request: dict[str, Any],
        timeout: float,
    ) -> YourResponseType:
        # Implement your request logic here
        pass

Progress Tracking

The progress bar shows:

  • Request completion progress
  • Tokens per second (prompt and completion)
  • Cache hit/miss statistics
  • Estimated time remaining
  • Total elapsed time

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastllm_kit-0.1.9.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastllm_kit-0.1.9-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file fastllm_kit-0.1.9.tar.gz.

File metadata

  • Download URL: fastllm_kit-0.1.9.tar.gz
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.2

File hashes

Hashes for fastllm_kit-0.1.9.tar.gz
Algorithm Hash digest
SHA256 caf625afcf2b7615fd97c75fe9bf6ac6ce3abf1173f161753413634c33f676c5
MD5 3928e0b4a308892f028ddbac62cfe707
BLAKE2b-256 d68462ba2937dd4b66bcff1d622958e746cd7af9de8784d8c42c8e8eb7ae6eff

See more details on using hashes here.

File details

Details for the file fastllm_kit-0.1.9-py3-none-any.whl.

File metadata

File hashes

Hashes for fastllm_kit-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 edce5190990d1246a543b8379ab1f889bd6cea997395d194255ecbd8d217ed3c
MD5 9822846419b064172340648cf9f630f0
BLAKE2b-256 8f584e96c3cf490b3f5585ad7693115bcfb887a05de716c6ff93178212ef1764

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page