Skip to main content

A library for running controlled experiments with LLMs using different sampling methods

Project description

PyPI - Version PyPI - Python Version License arXiv

Try it yourself | Installation | Quick Start | Reproduce Experiments | Citation


Try it yourself

Example 1: Add before your own prompts in Chat Interface

Copy and paste this prompt into any chat interface (ChatGPT, Claude, Gemini, etc.):

Generate 10 responses to the user query, each within a separate <response> tag. Each response should be 50-100 words.
Each <response> must include a <text> and a numeric <probability>. Randomly sample the responses from the full distribution.
Return ONLY the responses in JSON format, with no additional explanations or text.

<user_query>Write a short story about a bear.</user_query>

Example 2: Query via API

Use this curl command to try VS-Standard with the OpenAI API. Replace gpt-4.1 with your model of choice:

export OPENAI_API_KEY="your_openai_key"
curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4.1",
    "messages": [
      {
        "role": "system",
        "content": "Generate 10 responses to the input prompt, each within a separate <response> tag. Each response should be 50-100 words. Each <response> must include a <text> and a numeric <probability>. Randomly sample the responses from the full distribution. Return ONLY the responses, with no additional explanations or text."
      },
      {
        "role": "user",
        "content": "Write a short story about a bear."
      }
    ],
    "temperature": 1.0
  }'

📓 Interactive Notebooks

Explore verbalized sampling with our interactive Jupyter notebooks:

Notebook Description Code Run it Yourself!
Direct vs. Verbalized Sampling Head-to-head comparison demonstrating VS effectiveness: 2-3x diversity improvement in creative tasks while maintaining quality View on GitHub Open In Colab
Image Generation with VS Visual comparison of Direct Prompting vs. Verbalized Sampling for text-to-image generation, showcasing creative diversity in artistic styles View on GitHub Open In Colab
Complete Framework Tutorial Step-by-step guide to using verbalized sampling: API basics, transforms, selection methods, recipes, and advanced features View on GitHub Open In Colab

💡 Tip: Start with Direct vs. Verbalized Sampling to see the effectiveness, then explore Image Generation for visual results, or dive into the Complete Tutorial to learn the full API!

Introduction

Verbalized Sampling (VS) is a prompting strategy that mitigates mode collapse in Large Language Models by explicitly requesting responses with associated probabilities. This framework is:

  • Training-Free: Works with any LLM without fine-tuning—simply apply VS prompts to unlock diversity.
  • Model-Agnostic: Compatible with GPT, Claude, Gemini, and open models like Llama and Qwen.
  • Measurable Impact: Achieves 2-3x diversity improvement in creative writing while maintaining quality.
  • Versatile Applications: Supports creative writing, synthetic data generation, open-ended QA.
  • Complete Framework: Includes task implementations, evaluation metrics, and reproducible experiments from our paper.
  • Easy to Use: Simple CLI and Python API for running experiments and comparing methods.

Verbalized Sampling

Updates

  • 🎉 10/01/2025: We release our paper, code and package. Check the release page for more details.

Installation

# Lightweight install (API-based models only)
pip install verbalized-sampling

# With GPU support for local models (vLLM, torch, transformers)
pip install verbalized-sampling[gpu]

# Development install
pip install verbalized-sampling[dev]

# Complete install
pip install verbalized-sampling[gpu,dev]

API Keys Setup

export OPENAI_API_KEY="your_openai_key"
export OPENROUTER_API_KEY="your_openrouter_key"

Quick Start

Command Line Interface

# List available tasks and methods
verbalize list-tasks
verbalize list-methods

# Run an experiment
verbalize run \
    --task joke \
    --model "gpt-4.1" \
    --methods "vs_standard direct vs_cot vs_multi" \
    --num-responses 50

# Run quick test (TODO add this support to the CLI)
verbalize run \
    --task joke \
    --prompt "Write a joke about the weather." \
    --model "gpt-4.1" \
    --methods "direct vs_standard sequence vs_multi" \
    --num-responses 50 \
    --metrics "diversity length ngram joke_quality"

verbalize dialogue \
  --persuader-model "gpt-4.1" \
  --persuadee-model "gpt-4.1" \
  --method direct \
  --num-conversations 5 \
  --num-samplings 4 \
  --max-turns 10 \
  --word-limit 160 \
  --temperature 0.7 \
  --top-p 0.9 \
  --max-tokens 500 \
  --response-selection probability \
  --evaluate \
  --output-file results/dialogue/persuasion_vs_standard.jsonl

Python API

from verbalized_sampling.pipeline import run_quick_comparison
from verbalized_sampling.tasks import Task
from verbalized_sampling.prompts import Method

# Run a quick comparison
results = run_quick_comparison(
    task=Task.JOKE,
    methods=[Method.DIRECT, Method.VS_STANDARD],
    model_name="anthropic/claude-sonnet-4",
    metrics=["diversity", "length", "ngram"],
    num_responses=50,
)

print(f"VS Diversity: {results['VS_STANDARD']['diversity']:.2f}")
print(f"Direct Diversity: {results['DIRECT']['diversity']:.2f}")

Example Usage

from verbalized_sampling.tasks import get_task, Task
from verbalized_sampling.prompts import Method

# Create a task
task = get_task(Task.STORY, num_prompts=10, random_seed=42)

# Generate diverse responses
vs_prompt = task.get_prompt(Method.VS_STANDARD, num_samples=5, prompt_index=0)
responses = model.generate(vs_prompt)
parsed = task.parse_response(Method.VS_STANDARD, responses)
# Returns: [{"response": "...", "probability": 0.15}, ...]

# Chain-of-thought reasoning
cot_prompt = task.get_prompt(Method.VS_COT, num_samples=3)
cot_responses = model.generate(cot_prompt)
parsed_cot = task.parse_response(Method.VS_COT, cot_responses)
# Returns: [{"reasoning": "...", "response": "...", "probability": 0.22}, ...]

Reproducing Paper Results

For detailed instructions on reproducing all experiments from our paper, including exact commands, parameter settings, and expected outputs, see:

📊 EXPERIMENTS.md - Complete Experiment Replication Guide

This guide provides 1-to-1 mapping between paper sections (§5-8) and experiment scripts.

Key Results

Our experiments demonstrate consistent improvements across tasks and models:

  • Creative Writing: 2-3x diversity improvement while maintaining quality
  • Bias Mitigation: Uniform sampling (KL divergence: 0.027 vs 0.926 for direct)
  • Emergent Scaling: Larger models show greater benefits from VS
  • Safety: Preserved refusal rates for harmful content
  • Tunable Diversity: Control output diversity via probability thresholds

Repository Structure

verbalized_sampling/           # Main package
├── tasks/                     # Task implementations
│   ├── creativity/           # Creative writing tasks
│   ├── synthetic_data/       # Data generation tasks
│   ├── bias/                # Bias mitigation tasks
│   └── safety/              # Safety evaluation
├── prompts/                  # VS method implementations
├── llms/                     # Model interfaces
├── evals/                    # Evaluation metrics
└── cli.py                    # Command line interface

scripts/tasks/                 # Experimental scripts
├── run_poem.py               # Poetry experiments
├── run_story.py              # Story generation
├── run_jokes.py              # Joke writing
├── run_positive_*.py         # Synthetic data generation
├── run_rng.py                # Random number generation
├── run_state_name.py         # Geographic bias
└── run_safety.py             # Safety evaluation

Development

# Install development dependencies
pip install -e ".[dev]"

# Code formatting and linting
black .
isort .
ruff check .
mypy .

# Run tests
pytest

Citation

If you use Verbalized Sampling in your research, please cite our paper:

@misc{zhang2025verbalizedsamplingmitigatemode,
  title={Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity},
  author={Jiayi Zhang and Simon Yu and Derek Chong and Anthony Sicilia and Michael R. Tomz and Christopher D. Manning and Weiyan Shi},
  year={2025},
  eprint={2510.01171},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2510.01171}
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

verbalized_sampling-0.1.3.tar.gz (88.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

verbalized_sampling-0.1.3-py3-none-any.whl (203.1 kB view details)

Uploaded Python 3

File details

Details for the file verbalized_sampling-0.1.3.tar.gz.

File metadata

  • Download URL: verbalized_sampling-0.1.3.tar.gz
  • Upload date:
  • Size: 88.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for verbalized_sampling-0.1.3.tar.gz
Algorithm Hash digest
SHA256 bdea206b652e6a1a0c63f1f3330ab3b7150ad2a371abdc1e7786814707732336
MD5 2e94fafcbec59f5f2f70544278bf4e97
BLAKE2b-256 30f68b5e32ea32c4539ac1f2d4493521e5b8f54980fc3018ac34ae6ec3f86e3e

See more details on using hashes here.

File details

Details for the file verbalized_sampling-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for verbalized_sampling-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 22575b4aae92c165db651108ca10f68aa4a95c49e2c20aa7a1011a7d54d97f62
MD5 08f17a0fae9f3b37d74ae7057c02bb70
BLAKE2b-256 db2f2abafeb9ec02ed14437c1c16868b55c54064029865b47b802a37834164d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page