A high-quality Python package for generating multiple LLM responses with built-in resampling, caching, and provider abstraction

These details have not been verified by PyPI

Project links

Project description

Rollouts

A high-quality Python package for generating multiple LLM responses with built-in resampling, caching, and provider abstraction.

Features

Simple Interface: Both synchronous and asynchronous APIs
Multiple Providers: Support for OpenRouter, Fireworks, Together, and more
Smart Caching: Automatic response caching to reduce API costs
Parameter Override: Override any setting at generation time
Presets: Built-in presets for common use cases
Type Safety: Full type hints and dataclass models
Production Ready: Comprehensive error handling and retries

Installation

pip install rollouts

Examples

See example.py for comprehensive examples of all package features:

# Set your API key
export OPENROUTER_API_KEY="your-key-here"

# Run the examples
python example.py

Quick Start

Synchronous Usage

from rollouts import RolloutsClient

# Create client with default settings
client = RolloutsClient(
    model="qwen/qwen3-30b-a3b",
    temperature=0.7,
    max_tokens=1000
)

# Generate multiple responses
rollouts = client.generate("What is the meaning of life?", n_samples=5)

# Access responses
for response in rollouts:
    print(response.full)

Asynchronous Usage

import asyncio
from rollouts import RolloutsClient

async def main():
    client = RolloutsClient(model="qwen/qwen3-30b-a3b")
    
    # Generate responses for multiple prompts concurrently
    results = await asyncio.gather(
        client.agenerate("Explain quantum computing", n_samples=3),
        client.agenerate("Write a haiku", n_samples=5, temperature=1.2)
    )
    
    for rollouts in results:
        print(f"Generated {len(rollouts)} responses")

asyncio.run(main())

Using Presets

from rollouts import create_client

# Create client with a preset configuration
client = create_client(
    model="qwen/qwen3-30b-a3b",
    preset="creative"  # High temperature, more diverse outputs
)

responses = client.generate("Write a story", n_samples=3)

Available presets:

deterministic: Temperature 0, best for factual responses
focused: Low temperature (0.3), focused but not rigid
balanced: Medium temperature (0.7), good default
creative: High temperature (1.2), diverse outputs

Thinking Injection (Advanced)

Some models support "thinking injection" where you can control the reasoning process by injecting partial thoughts:

# Works with DeepSeek R1, QwQ, Qwen models
prompt = "Calculate 10*5 <think>Let me calculate: 10*5="
result = client.generate(prompt, n_samples=1)
# Model continues from "=" and completes the calculation

Supported models:

✅ DeepSeek R1 and variants
✅ QwQ models
✅ Qwen models
✅ Claude/Anthropic models
❌ GPT-OSS models (no injection support on OpenRouter)
❌ Gemini thinking models (internal reasoning only)

For more details, see the THINK_INJECTION.md documentation.

Advanced Usage

Parameter Override

Override any default setting at generation time:

client = RolloutsClient(model="qwen/qwen3-30b-a3b", temperature=0.7)

# Override temperature for this specific generation
rollouts = client.generate(
    "Be creative!",
    n_samples=5,
    temperature=1.5,  # Override default
    max_tokens=2000   # Override default
)

Custom Configuration

from rollouts import RolloutsClient, Config

# Create custom configuration
config = Config(
    model="qwen/qwen3-30b-a3b",
    temperature=0.8,
    top_p=0.95,
    max_tokens=2000,
    presence_penalty=0.1,
    frequency_penalty=0.1
)

# Use configuration
client = RolloutsClient(**config.to_dict())

Caching

Responses are automatically cached to disk:

client = RolloutsClient(
    model="qwen/qwen3-30b-a3b",
    use_cache=True,  # Default
    cache_dir="my_cache"  # Custom cache directory
)

# First call: generates responses
rollouts1 = client.generate("What is 2+2?", n_samples=3)

# Second call: uses cached responses (instant)
rollouts2 = client.generate("What is 2+2?", n_samples=3)

OpenRouter Implicit Prompt Caching

In addition to this package's local response caching, OpenRouter provides automatic server-side prompt caching for many models. This can significantly reduce costs on repeated API calls with similar prompts:

Cost savings: Cache reads are typically charged at 0.25x to 0.5x the original input token price
Automatic: Most models (OpenAI, DeepSeek, Grok, Gemini 2.5) enable caching automatically with no configuration needed
Smart routing: OpenRouter automatically routes to the same provider to maximize cache hits

This server-side caching works independently from this package's local cache. While our local cache eliminates API calls entirely for identical requests, OpenRouter's prompt caching reduces costs when you make similar (but not identical) requests. For full details on pricing and supported models, see OpenRouter's Prompt Caching documentation.

API Reference

RolloutsClient

Main client class for generating responses.

Parameters:

model (str, required): Model identifier
temperature (float): Sampling temperature (0.0-2.0)
top_p (float): Nucleus sampling parameter
max_tokens (int): Maximum tokens to generate
top_k (int): Top-k sampling parameter
presence_penalty (float): Presence penalty (-2.0 to 2.0)
frequency_penalty (float): Frequency penalty (-2.0 to 2.0)
api_key (str): API key (uses env variable if None)
use_cache (bool): Enable caching
verbose (bool): Print debug information

Rollouts

Container for multiple responses.

Attributes:

prompt: The input prompt
responses: List of Response objects
num_responses: Number of responses requested
temperature, top_p, max_tokens: Generation parameters
model: Model information

Methods:

get_texts(): Get all full response texts (includes reasoning + content)
get_reasonings(): Get reasoning portions only
get_contents(): Get content portions only (post-reasoning text)

Response

Individual response from the model.

Key Fields:

full: The complete response text, formatted as reasoning_text + "\n</think>\n" + content_text
content: The post-reasoning text (what comes after </think>)
reasoning: The reasoning/thinking text (what comes before </think>)
usage: Token usage statistics
finish_reason: Why the response ended (e.g., "stop", "length")

Understanding the Think Token Format:

The full field is always structured with a </think> separator between reasoning and content:

reasoning_text
</think>
content_text

This format is used consistently even for models that don't natively use <think> tags:

Models with native think support (DeepSeek R1, QwQ, Qwen): The reasoning appears naturally
GPT-OSS models: OpenRouter returns reasoning in a separate field, which we format into this structure
Models without reasoning: The full field contains just the content (no reasoning section)

Important Note for GPT-OSS Models:

GPT-OSS models (like gpt-oss-20b and gpt-oss-120b) use OpenAI's Harmony format internally. On OpenRouter:

Reasoning is returned in a separate reasoning field by the API
You cannot inject or control thinking tokens for these models
The </think> separator is added by this library for consistency
If you need to control reasoning, use models like DeepSeek R1 or QwQ instead

Example accessing Response fields:

for response in rollouts:
    print(f"Full response: {response.full}")
    print(f"Just content: {response.content}")
    print(f"Just reasoning: {response.reasoning}")
    print(f"Tokens used: {response.usage.total_tokens}")

API Key Configuration

There are three ways to provide API keys:

1. Environment Variable (recommended for development)

export OPENROUTER_API_KEY="your-key-here"

2. Pass to Client (recommended for production)

client = RolloutsClient(
    model="qwen/qwen3-30b-a3b",
    api_key="your-key-here"
)

3. Pass at Generation Time (for per-request keys)

client = RolloutsClient(model="qwen/qwen3-30b-a3b")
responses = client.generate(
    "Your prompt",
    n_samples=5,
    api_key="different-key-here"  # Overrides any default
)

Note: API keys are never cached or stored to disk.

Known Limitations

Logprobs Not Supported

This package does not currently support logprobs (log probabilities). If you try to use top_logprobs, you'll get a NotImplementedError:

# This will raise an error:
client = RolloutsClient(
    model="openai/gpt-3.5-turbo",
    top_logprobs=5  # ❌ Not supported
)

Why? OpenRouter's implementation of logprobs appears inconsistent across different providers. Based on examination of multiple providers, the logprobs functionality doesn't work reliably through OpenRouter's API. Until this is resolved upstream, this feature is not implemented in this package.

If you need logprobs, you may need to use the providers' APIs directly rather than through OpenRouter.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.11

Oct 19, 2025

0.1.10

Oct 19, 2025

0.1.9

Oct 19, 2025

0.1.8

Oct 19, 2025

0.1.7

Oct 19, 2025

0.1.5

Sep 3, 2025

0.1.4

Sep 2, 2025

0.1.3

Sep 1, 2025

0.1.1

Sep 1, 2025

This version

0.1.0

Sep 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rollouts-0.1.0.tar.gz (32.0 kB view details)

Uploaded Sep 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rollouts-0.1.0-py3-none-any.whl (19.7 kB view details)

Uploaded Sep 1, 2025 Python 3

File details

Details for the file rollouts-0.1.0.tar.gz.

File metadata

Download URL: rollouts-0.1.0.tar.gz
Upload date: Sep 1, 2025
Size: 32.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for rollouts-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`220384b0872a603be7b3eb49b99c9c4dbacbd9a0aa6dd3a28266a9f5b2ca79a4`
MD5	`91b6454d5fd100d26fc7ba608fc6f31a`
BLAKE2b-256	`bf58a92bd04e7f332fb372b46bdd79befadc18e642bf4c4d2232dc4d6c47fcc5`

See more details on using hashes here.

File details

Details for the file rollouts-0.1.0-py3-none-any.whl.

File metadata

Download URL: rollouts-0.1.0-py3-none-any.whl
Upload date: Sep 1, 2025
Size: 19.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for rollouts-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e93d7470262be68358fd913c8881dac8eaa218be83017d05d815762f13e1b058`
MD5	`1c7e17a59efc66fd17f1f6d1a33265c2`
BLAKE2b-256	`04ade8f44d2761d812941c8b8f97662995ce3a97a501246a23ed054888178cc2`

See more details on using hashes here.

rollouts 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Rollouts

Features

Installation

Examples

Quick Start

Synchronous Usage

Asynchronous Usage

Using Presets

Thinking Injection (Advanced)

Advanced Usage

Parameter Override

Custom Configuration

Caching

OpenRouter Implicit Prompt Caching

API Reference

RolloutsClient

Rollouts

Response

API Key Configuration

1. Environment Variable (recommended for development)

2. Pass to Client (recommended for production)

3. Pass at Generation Time (for per-request keys)

Known Limitations

Logprobs Not Supported

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes