A high-quality Python package for generating multiple LLM responses with built-in resampling, caching, and provider abstraction
Project description
Rollouts
A high-quality Python package for generating multiple LLM responses with built-in resampling, caching, and provider abstraction.
Features
- Simple Interface: Both synchronous and asynchronous APIs
- Multiple Providers: Support for OpenRouter, Fireworks, Together, and more
- Smart Caching: Automatic response caching to reduce API costs
- Parameter Override: Override any setting at generation time
- Presets: Built-in presets for common use cases
- Type Safety: Full type hints and dataclass models
- Production Ready: Comprehensive error handling and retries
Installation
pip install rollouts
Examples
See example.py for comprehensive examples of all package features:
# Set your API key
export OPENROUTER_API_KEY="your-key-here"
# Run the examples
python example.py
Quick Start
Synchronous Usage
from rollouts import RolloutsClient
# Create client with default settings
client = RolloutsClient(
model="qwen/qwen3-30b-a3b",
temperature=0.7,
max_tokens=1000
)
# Generate multiple responses
rollouts = client.generate("What is the meaning of life?", n_samples=5)
# Access responses
for response in rollouts:
print(response.full)
Asynchronous Usage
import asyncio
from rollouts import RolloutsClient
async def main():
client = RolloutsClient(model="qwen/qwen3-30b-a3b")
# Generate responses for multiple prompts concurrently
results = await asyncio.gather(
client.agenerate("Explain quantum computing", n_samples=3),
client.agenerate("Write a haiku", n_samples=5, temperature=1.2)
)
for rollouts in results:
print(f"Generated {len(rollouts)} responses")
asyncio.run(main())
Using Presets
from rollouts import create_client
# Create client with a preset configuration
client = create_client(
model="qwen/qwen3-30b-a3b",
preset="creative" # High temperature, more diverse outputs
)
responses = client.generate("Write a story", n_samples=3)
Available presets:
deterministic: Temperature 0, best for factual responsesfocused: Low temperature (0.3), focused but not rigidbalanced: Medium temperature (0.7), good defaultcreative: High temperature (1.2), diverse outputs
Thinking Injection (Advanced)
Some models support "thinking injection" where you can control the reasoning process by injecting partial thoughts:
# Works with DeepSeek R1, QwQ, Qwen models
prompt = "Calculate 10*5 <think>Let me calculate: 10*5="
result = client.generate(prompt, n_samples=1)
# Model continues from "=" and completes the calculation
Supported models:
- ✅ DeepSeek R1 and variants
- ✅ QwQ models
- ✅ Qwen models
- ✅ Claude/Anthropic models
- ❌ GPT-OSS models (no injection support on OpenRouter)
- ❌ Gemini thinking models (internal reasoning only)
For more details, see the THINK_INJECTION.md documentation.
Advanced Usage
Parameter Override
Override any default setting at generation time:
client = RolloutsClient(model="qwen/qwen3-30b-a3b", temperature=0.7)
# Override temperature for this specific generation
rollouts = client.generate(
"Be creative!",
n_samples=5,
temperature=1.5, # Override default
max_tokens=2000 # Override default
)
Custom Configuration
from rollouts import RolloutsClient, Config
# Create custom configuration
config = Config(
model="qwen/qwen3-30b-a3b",
temperature=0.8,
top_p=0.95,
max_tokens=2000,
presence_penalty=0.1,
frequency_penalty=0.1
)
# Use configuration
client = RolloutsClient(**config.to_dict())
Caching
Responses are automatically cached to disk:
client = RolloutsClient(
model="qwen/qwen3-30b-a3b",
use_cache=True, # Default
cache_dir="my_cache" # Custom cache directory
)
# First call: generates responses
rollouts1 = client.generate("What is 2+2?", n_samples=3)
# Second call: uses cached responses (instant)
rollouts2 = client.generate("What is 2+2?", n_samples=3)
OpenRouter Implicit Prompt Caching
In addition to this package's local response caching, OpenRouter provides automatic server-side prompt caching for many models. This can significantly reduce costs on repeated API calls with similar prompts:
- Cost savings: Cache reads are typically charged at 0.25x to 0.5x the original input token price
- Automatic: Most models (OpenAI, DeepSeek, Grok, Gemini 2.5) enable caching automatically with no configuration needed
- Smart routing: OpenRouter automatically routes to the same provider to maximize cache hits
This server-side caching works independently from this package's local cache. While our local cache eliminates API calls entirely for identical requests, OpenRouter's prompt caching reduces costs when you make similar (but not identical) requests. For full details on pricing and supported models, see OpenRouter's Prompt Caching documentation.
API Reference
RolloutsClient
Main client class for generating responses.
Parameters:
model(str, required): Model identifiertemperature(float): Sampling temperature (0.0-2.0)top_p(float): Nucleus sampling parametermax_tokens(int): Maximum tokens to generatetop_k(int): Top-k sampling parameterpresence_penalty(float): Presence penalty (-2.0 to 2.0)frequency_penalty(float): Frequency penalty (-2.0 to 2.0)api_key(str): API key (uses env variable if None)use_cache(bool): Enable cachingverbose(bool): Print debug information
Rollouts
Container for multiple responses.
Attributes:
prompt: The input promptresponses: List of Response objectsnum_responses: Number of responses requestedtemperature,top_p,max_tokens: Generation parametersmodel: Model information
Methods:
get_texts(): Get all full response texts (includes reasoning + content)get_reasonings(): Get reasoning portions onlyget_contents(): Get content portions only (post-reasoning text)
Response
Individual response from the model.
Key Fields:
full: The complete response text, formatted asreasoning_text + "\n</think>\n" + content_textcontent: The post-reasoning text (what comes after</think>)reasoning: The reasoning/thinking text (what comes before</think>)usage: Token usage statisticsfinish_reason: Why the response ended (e.g., "stop", "length")
Understanding the Think Token Format:
The full field is always structured with a </think> separator between reasoning and content:
reasoning_text
</think>
content_text
This format is used consistently even for models that don't natively use <think> tags:
- Models with native think support (DeepSeek R1, QwQ, Qwen): The reasoning appears naturally
- GPT-OSS models: OpenRouter returns reasoning in a separate field, which we format into this structure
- Models without reasoning: The
fullfield contains just the content (no reasoning section)
Important Note for GPT-OSS Models:
GPT-OSS models (like gpt-oss-20b and gpt-oss-120b) use OpenAI's Harmony format internally. On OpenRouter:
- Reasoning is returned in a separate
reasoningfield by the API - You cannot inject or control thinking tokens for these models
- The
</think>separator is added by this library for consistency - If you need to control reasoning, use models like DeepSeek R1 or QwQ instead
Example accessing Response fields:
for response in rollouts:
print(f"Full response: {response.full}")
print(f"Just content: {response.content}")
print(f"Just reasoning: {response.reasoning}")
print(f"Tokens used: {response.usage.total_tokens}")
API Key Configuration
There are three ways to provide API keys:
1. Environment Variable (recommended for development)
export OPENROUTER_API_KEY="your-key-here"
2. Pass to Client (recommended for production)
client = RolloutsClient(
model="qwen/qwen3-30b-a3b",
api_key="your-key-here"
)
3. Pass at Generation Time (for per-request keys)
client = RolloutsClient(model="qwen/qwen3-30b-a3b")
responses = client.generate(
"Your prompt",
n_samples=5,
api_key="different-key-here" # Overrides any default
)
Note: API keys are never cached or stored to disk.
Known Limitations
Logprobs Not Supported
This package does not currently support logprobs (log probabilities). If you try to use top_logprobs, you'll get a NotImplementedError:
# This will raise an error:
client = RolloutsClient(
model="openai/gpt-3.5-turbo",
top_logprobs=5 # ❌ Not supported
)
Why? OpenRouter's implementation of logprobs appears inconsistent across different providers. Based on examination of multiple providers, the logprobs functionality doesn't work reliably through OpenRouter's API. Until this is resolved upstream, this feature is not implemented in this package.
If you need logprobs, you may need to use the providers' APIs directly rather than through OpenRouter.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rollouts-0.1.0.tar.gz.
File metadata
- Download URL: rollouts-0.1.0.tar.gz
- Upload date:
- Size: 32.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
220384b0872a603be7b3eb49b99c9c4dbacbd9a0aa6dd3a28266a9f5b2ca79a4
|
|
| MD5 |
91b6454d5fd100d26fc7ba608fc6f31a
|
|
| BLAKE2b-256 |
bf58a92bd04e7f332fb372b46bdd79befadc18e642bf4c4d2232dc4d6c47fcc5
|
File details
Details for the file rollouts-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rollouts-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e93d7470262be68358fd913c8881dac8eaa218be83017d05d815762f13e1b058
|
|
| MD5 |
1c7e17a59efc66fd17f1f6d1a33265c2
|
|
| BLAKE2b-256 |
04ade8f44d2761d812941c8b8f97662995ce3a97a501246a23ed054888178cc2
|