Skip to main content

High-performance parallel LLM API request tool with caching and multiple provider support

Project description

FastLLM

High-performance parallel LLM API request tool with support for multiple providers and caching capabilities.

Features

  • Parallel request processing with configurable concurrency
    • Allows you to process 20000+ prompt tokens per second and 1500+ output tokens per second even for extremely large LLMs, such as Deepseek-V3.
  • Built-in caching support (in-memory and disk-based)
  • Progress tracking with token usage statistics
  • Support for multiple LLM providers (OpenAI, OpenRouter, etc.)
  • OpenAI-style API for request batching
  • Retry mechanism with configurable attempts and delays
  • Request deduplication and response ordering

Installation

Use pip:

pip install fastllm-kit

Alternatively, use uv:

uv pip install fastllm-kit

Important: fastllm does not support yet libsqlite3.49.1, please use libsqlite3.49.0 or lower. See this issue for more details. This might be an issue for users with conda environments.

For development:

# Clone the repository
git clone https://github.com/Rexhaif/fastllm.git
cd fastllm

# Create a virtual environment and install dependencies
uv venv
uv pip install -e ".[dev]"

Dependencies

FastLLM requires Python 3.9 or later and depends on the following packages:

  • httpx (^0.27.2) - For async HTTP requests
  • pydantic (^2.10.6) - For data validation and settings management
  • rich (^13.9.4) - For beautiful terminal output and progress bars
  • diskcache (^5.6.3) - For persistent disk caching
  • asyncio (^3.4.3) - For asynchronous operations
  • anyio (^4.8.0) - For async I/O operations
  • tqdm (^4.67.1) - For progress tracking
  • typing_extensions (^4.12.2) - For enhanced type hints

Development dependencies:

  • ruff (^0.3.7) - For linting and formatting
  • pytest (^8.3.4) - For testing
  • pytest-asyncio (^0.23.8) - For async tests
  • pytest-cov (^4.1.0) - For test coverage
  • black (^24.10.0) - For code formatting
  • coverage (^7.6.10) - For code coverage reporting

Development

The project uses just for task automation and uv for dependency management.

Common tasks:

# Install dependencies
just install

# Run tests
just test

# Format code
just format

# Run linting
just lint

# Clean up cache files
just clean

Quick Start

from fastllm import RequestBatch, RequestManager, OpenAIProvider, InMemoryCache

# Create a provider
provider = OpenAIProvider(
    api_key="your-api-key",
    # Optional: custom API base URL
    api_base="https://api.openai.com/v1",
)

# Create a cache provider (optional)
cache = InMemoryCache()  # or DiskCache(directory="./cache")

# Create a request manager
manager = RequestManager(
    provider=provider,
    concurrency=50,  # Number of concurrent requests
    show_progress=True,  # Show progress bar
    caching_provider=cache,  # Enable caching
)

# Create a batch of requests
request_ids = []  # Store request IDs for later use
with RequestBatch() as batch:
    # Add requests to the batch
    for i in range(10):
        # create() returns the request ID (caching key)
        request_id = batch.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{
                "role": "user",
                "content": f"What is {i} + {i}?"
            }],
            temperature=0.7,
            include_reasoning=True,  # Optional: include model reasoning
        )
        request_ids.append(request_id)

# Process the batch
responses = manager.process_batch(batch)

# Process responses
for request_id, response in zip(request_ids, responses):
    print(f"Request {request_id}: {response.response.choices[0].message.content}")
        
# You can use request IDs to check cache status
for request_id in request_ids:
    is_cached = await cache.exists(request_id)
    print(f"Request {request_id} is {'cached' if is_cached else 'not cached'}")

Advanced Usage

Async Support

FastLLM can be used both synchronously and asynchronously, and works seamlessly in regular Python environments, async applications, and Jupyter notebooks:

import asyncio
from fastllm import RequestBatch, RequestManager, OpenAIProvider

# Works in Jupyter notebooks
provider = OpenAIProvider(api_key="your-api-key")
manager = RequestManager(provider=provider)
responses = manager.process_batch(batch)  # Just works!

# Works in async applications
async def process_requests():
    provider = OpenAIProvider(api_key="your-api-key")
    manager = RequestManager(provider=provider)
    
    with RequestBatch() as batch:
        batch.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": "Hello!"}]
        )
    
    responses = manager.process_batch(batch)
    return responses

# Run in existing event loop
async def main():
    responses = await process_requests()
    print(responses)

asyncio.run(main())

Caching Configuration

FastLLM supports both in-memory and disk-based caching, with request IDs serving as cache keys:

from fastllm import InMemoryCache, DiskCache, RequestBatch

# Create a batch and get request IDs
with RequestBatch() as batch:
    request_id = batch.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hello!"}]
    )
    print(f"Request ID (cache key): {request_id}")

# In-memory cache (faster, but cleared when process ends)
cache = InMemoryCache()

# Disk cache (persistent, with optional TTL and size limits)
cache = DiskCache(
    directory="./cache",
    ttl=3600,  # Cache TTL in seconds
    size_limit=int(2e9)  # 2GB size limit
)

# Check if a response is cached
is_cached = await cache.exists(request_id)

# Get cached response if available
if is_cached:
    response = await cache.get(request_id)

Custom Providers

Create your own provider by inheriting from the base Provider class:

from fastllm import Provider
from typing import Any
import httpx

class CustomProvider(Provider[YourResponseType]):
    def get_request_headers(self) -> dict[str, str]:
        return {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }

    async def make_request(
        self,
        client: httpx.AsyncClient,
        request: dict[str, Any],
        timeout: float,
    ) -> YourResponseType:
        # Implement your request logic here
        pass

Progress Tracking

The progress bar shows:

  • Request completion progress
  • Tokens per second (prompt and completion)
  • Cache hit/miss statistics
  • Estimated time remaining
  • Total elapsed time

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastllm_kit-0.1.8.tar.gz (48.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastllm_kit-0.1.8-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file fastllm_kit-0.1.8.tar.gz.

File metadata

  • Download URL: fastllm_kit-0.1.8.tar.gz
  • Upload date:
  • Size: 48.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.2

File hashes

Hashes for fastllm_kit-0.1.8.tar.gz
Algorithm Hash digest
SHA256 3daaf708f83e31d6b6ed8860a93282742d6862866bfc3e80658a024e2ae591f7
MD5 c06df7945323ced0bc845bca3088bfae
BLAKE2b-256 d000094b692132fd8bbdf5ea8d609d492713fb2c23d8e1700a46102b3391ca42

See more details on using hashes here.

File details

Details for the file fastllm_kit-0.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for fastllm_kit-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 b6555287bec7e897c1e38e5bf73efe169ec7bba37beed803bb9e221108f0653e
MD5 39753db4a98d3dfcccf6eeb69e3d3b6c
BLAKE2b-256 738719b2903dc02f2b8b1f97b9d0a34734046dd96a6728b13718749e95414c61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page