A simple package for audio comparison using large language models
Project description
AudioJudge 🎵
A Python wrapper for audio comparison and evaluation using a Large Audio Model as Judge (i.e., LAM-as-a-Judge or AudioJudge) with support for in-context learning and flexible audio concatenation strategies.
Features
- Multi-Model Support: Works with OpenAI GPT-4o Audio and Google Gemini models (GPT-4o-audio family, Gemini-1.5/2.0/2.5-flash families)
- Flexible Audio Comparison: Support for both pairwise and pointwise audio evaluation
- In-Context Learning: Provide examples to improve model performance
- Audio Concatenation: Multiple strategies for combining audio files
- Smart Caching: Built-in API response caching to reduce costs and latency
Installation
pip install audiojudge # Requires Python >= 3.10
Quick Start
from audiojudge import AudioJudge
# Initialize with API keys
judge = AudioJudge(
openai_api_key="your-openai-key",
google_api_key="your-google-key"
)
# Simple pairwise comparison
result = judge.judge_audio(
audio1_path="audio1.wav",
audio2_path="audio2.wav",
system_prompt="Compare these two audio clips for quality.",
model="gpt-4o-audio-preview"
)
print(result["response"])
Quick Demo
Configuration
Environment Variables
Set your API keys as environment variables:
export OPENAI_API_KEY="your-openai-key"
export GOOGLE_API_KEY="your-google-key"
export EVAL_CACHE_DIR=".audio_cache" # Optional
export EVAL_DISABLE_CACHE="false" # Optional
AudioJudge Parameters
judge = AudioJudge(
openai_api_key=None, # OpenAI API key (optional if env var set)
google_api_key=None, # Google API key (optional if env var set)
temp_dir="temp_audio", # Temporary files directory for storing concatenated audios
signal_folder="signal_audios", # TTS signal files directory used in audio concatenation
# Default signal files are included in the package
# Will use TTS model to generate new ones if needed
cache_dir=None, # API Cache directory (default: .eval_cache)
cache_expire_seconds=2592000, # Cache expiration (30 days)
disable_cache=False # Disable caching
)
Core Methods
1. Pairwise Audio Comparison
1.1. Pairwise Comparison without Instruction Audio
Compare two audio files and get a model response directly:
result = judge.judge_audio(
audio1_path="speaker1.wav",
audio2_path="speaker2.wav",
system_prompt="Which speaker sounds more professional?", # Define the evaluation criteria at the beginning
user_prompt="Analyze both speakers and provide your assessment.", # Optional specific instructions at the end
model="gpt-4o-audio-preview",
temperature=0.1, # 0.0 is not supported for some api calling
max_tokens=500 # Maximum response length
)
if result["success"]:
print(f"Model response: {result['response']}")
else:
print(f"Error: {result['error']}")
1.2. Pairwise Comparison with Instruction Audio
For scenarios where both audio clips are responses to the same instruction (e.g., comparing two speech-in speech-out systems):
result = judge.judge_audio(
audio1_path="system_a_response.wav", # Response from system A
audio2_path="system_b_response.wav", # Response from system B
instruction_path="original_instruction.wav", # The instruction both systems responded to
system_prompt="Compare which response better follows the given instruction.",
model="gpt-4o-audio-preview"
)
print(f"Better response: {result['response']}")
2. Pointwise Audio Evaluation
Evaluate a single audio file:
result = judge.judge_audio_pointwise(
audio_path="speech.wav",
system_prompt="Rate the speech quality from 1-10.",
model="gpt-4o-audio-preview"
)
print(f"Quality rating: {result['response']}")
In-Context Learning
Improve model performance by providing examples:
Pairwise Examples
from audiojudge.utils import AudioExample
# Create examples
examples = [
AudioExample(
audio1_path="example1_good.wav",
audio2_path="example1_bad.wav",
output="Audio 1 is better quality with clearer speech."
# Optional: instruction_path="instruction1.wav" # For instruction-based evaluation
),
AudioExample(
audio1_path="example2_noisy.wav",
audio2_path="example2_clean.wav",
output="Audio 2 is better due to less background noise."
)
]
# Use examples in evaluation
result = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
system_prompt="Compare audio quality and choose the better one.",
examples=examples,
model="gpt-4o-audio-preview"
)
Pointwise Examples
from audiojudge.utils import AudioExamplePointwise
examples = [
AudioExamplePointwise(
audio_path="high_quality.wav",
output="9/10 - Excellent clarity and no background noise"
),
AudioExamplePointwise(
audio_path="medium_quality.wav",
output="6/10 - Acceptable quality with minor distortions"
)
]
result = judge.judge_audio_pointwise(
audio_path="test_audio.wav",
system_prompt="Rate the audio quality from 1-10 with explanation.",
examples=examples,
model="gpt-4o-audio-preview"
)
Audio Concatenation Methods
Control how audio files are combined for model input:
Available Methods
For Pairwise Evaluation:
no_concatenation: Keep all audio files separatepair_example_concatenation: Concatenate each example pairexamples_concatenation: Concatenate all examples into one filetest_concatenation: Concatenate test audio pairexamples_and_test_concatenation(default): Concatenate all examples and test audio - shown as the most effective prompting strategy
For Pointwise Evaluation:
no_concatenation(default): Keep all audio files separateexamples_concatenation: Concatenate all examples into one file
Example Usage
# Pairwise: Keep everything separate
result = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
system_prompt="Compare these audio clips.",
concatenation_method="no_concatenation"
)
# Pairwise: Concatenate all for better context (recommended)
result = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
system_prompt="Compare these audio clips.",
examples=examples,
concatenation_method="examples_and_test_concatenation"
)
# Pointwise: With example concatenation
result = judge.judge_audio_pointwise(
audio_path="test.wav",
system_prompt="Rate the audio quality from 1-10.",
examples=pointwise_examples,
concatenation_method="examples_concatenation"
)
Instruction Audio
Use audio files as instructions for more complex tasks:
With Examples
# Examples with instruction audio
examples = [
AudioExample(
audio1_path="example1.wav",
audio2_path="example2.wav",
instruction_path="instruction_example.wav",
output="Audio 1 follows the instruction better."
)
]
result = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
instruction_path="instruction.wav",
system_prompt="Follow the audio instruction to evaluate these clips.",
examples=examples,
model="gpt-4o-audio-preview"
)
Supported Models
OpenAI Models
gpt-4o-audio-preview(recommended)gpt-4o-mini-audio-preview
Google Models
gemini-1.5-flashgemini-2.0-flashgemini-2.5-flash
# Using different models
result_gpt = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
system_prompt="Compare quality.",
model="gpt-4o-audio-preview"
)
result_gemini = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
system_prompt="Compare quality.",
model="gemini-2.0-flash"
)
Caching
AudioJudge includes intelligent caching to reduce API costs and improve performance:
Cache Management
# Clear entire cache
judge.clear_cache()
# Clear only failed (None) responses
valid_entries = judge.clear_none_cache()
print(f"Kept {valid_entries} valid cache entries")
# Get cache statistics
stats = judge.get_cache_stats()
print(f"Cache entries: {stats['total_entries']}")
Cache Configuration
# Disable caching
judge = AudioJudge(disable_cache=True)
# Custom cache directory and expiration
judge = AudioJudge(
cache_dir="my_audio_cache",
cache_expire_seconds=86400 # 1 day
)
Advanced Usage
Error Handling
result = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
system_prompt="Compare these audio clips."
)
if result["success"]:
response = result["response"]
model_used = result["model"]
print(f"Success with {model_used}: {response}")
else:
error_message = result["error"]
print(f"Evaluation failed: {error_message}")
Temperature and Token Control
# Deterministic output
result = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
system_prompt="Compare quality.",
temperature=0.000001,
max_tokens=100
)
# More creative output
result = judge.judge_audio(
audio1_path="test1.wav",
audio2_path="test2.wav",
system_prompt="Describe these audio clips creatively.",
temperature=0.8,
max_tokens=500
)
Best Practices
1. System Prompt Design
# Good: Specific and clear
system_prompt = """
You are an audio quality expert. Compare two audio clips and determine which has:
1. Better speech clarity
2. Less background noise
3. More natural sound
Respond with: "Audio 1" or "Audio 2" followed by your reasoning.
"""
# Avoid: Vague instructions
system_prompt = "Which audio is better?"
2. Example Selection
# Use diverse, representative examples
examples = [
AudioExample(
audio1_path="clear.wav",
audio2_path="muffled.wav",
output="Audio 1 - clearer speech"
),
AudioExample(
audio1_path="noisy.wav",
audio2_path="clean.wav",
output="Audio 2 - less background noise"
),
AudioExample(
audio1_path="fast.wav",
audio2_path="normal.wav",
output="Audio 2 - better pacing"
)
]
3. Concatenation Strategy
- Use
no_concatenationfor simple cases or when preserving individual audio quality is crucial - Use
examples_and_test_concatenationwhen you have examples (recommended for best performance) - Consider model context limits when choosing strategies
4. Model Selection
- GPT-4o Audio: Best for complex reasoning and detailed analysis
- Gemini 2.0+: Good for general comparisons, potentially faster and more cost-effective
Research and Experiments
This package is based on research in audio evaluation using large audio models. The experimental code and evaluation scripts used in our research are available in the experiments/ folder for reproducing the result.
Example Usage
Additional usage examples can be found in the examples/ folder, which wraps some of our experiments into the package for demonstration:
examples/audiojudge_usage.py: Pairwise comparison without instruction- Datasets: somos, thaimos, tmhintq, pronunciation, speed, speaker evaluations
examples/audiojudge_usage_with_instruction.py: Pairwise comparison with instruction audio- Datasets: System-level comparisons including ChatbotArena and SpeakBench
examples/audiojudge_usage_pointwise.py: Pointwise evaluation- Datasets: somos, thaimos, tmhintq,
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For issues and questions:
- GitHub Issues: Create an issue
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file audiojudge-0.1.2.tar.gz.
File metadata
- Download URL: audiojudge-0.1.2.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dc15506597dab2f6b6348ba1372ab6c7d684a9a2d455da6ad3128bfd46328d4
|
|
| MD5 |
0065647a8b7f10907d041c095dac1451
|
|
| BLAKE2b-256 |
8c5f95cb14868886fdf0d73ec6cce89154032a2a15d990e112a87a509f0494ef
|
File details
Details for the file audiojudge-0.1.2-py3-none-any.whl.
File metadata
- Download URL: audiojudge-0.1.2-py3-none-any.whl
- Upload date:
- Size: 2.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cd3b93d63accb003caeef17b005b0935e84409050878a447ebe1f10d7ea50d1
|
|
| MD5 |
a98402fa0ac328949b7c0cd58f721f28
|
|
| BLAKE2b-256 |
b4f22121d81e572f8b43bba55d1f39e97a68e43f7619fb24b64bb5a0355171d1
|