Skip to main content

A benchmarking toolkit that evaluates and compares RAG chunking strategies against your actual queries to find the optimal configuration

Project description

Beacon

A benchmarking toolkit that evaluates and compares RAG chunking strategies against your actual queries to find the optimal configuration.

PyPI CI Python License: MIT

Why Beacon?

RAG quality is 80% retrieval quality, and retrieval quality depends heavily on chunking. Yet most teams:

  • Use default chunk sizes (512 tokens) without evaluation
  • Don't know if semantic chunking would improve their specific use case
  • Spend weeks manually testing different configurations
  • Have no metrics to compare strategies objectively

Beacon automates this with reproducible benchmarks.

Features

  • Multi-Strategy Evaluation - Compare 5+ chunking strategies in one run
  • Standard IR Metrics - MRR, Recall@K, NDCG, Precision, MAP
  • Bayesian Auto-Tuning - Find optimal chunk size and overlap automatically
  • HTML Reports - Beautiful, shareable benchmark reports
  • Local-First - Run entirely offline with local embeddings
  • Framework Agnostic - Works with any RAG implementation

Installation

pip install beacon-eval

For auto-tuning support:

pip install beacon-eval[optuna]

Quick Start

1. Create a configuration file

beacon init

This creates a beacon.yaml configuration file.

2. Prepare your queries

Create a queries.jsonl file with your test queries:

{"query": "What is the refund policy?", "relevant_doc_ids": ["policy.pdf"]}
{"query": "How do I reset my password?", "relevant_doc_ids": ["faq.pdf"]}

3. Run the benchmark

beacon run beacon.yaml

Configuration

name: my_benchmark
documents:
  - ./docs/*.pdf
  - ./docs/*.txt
queries: ./queries.jsonl
embedding_model: all-MiniLM-L6-v2
top_k: 10

strategies:
  - name: fixed_256
    type: fixed_size
    chunk_size: 256
    overlap: 25

  - name: fixed_512
    type: fixed_size
    chunk_size: 512
    overlap: 50

  - name: sentence_based
    type: sentence
    chunk_size: 512

  - name: recursive
    type: recursive
    chunk_size: 512
    overlap: 50

output_dir: ./results
generate_html_report: true
export_csv: true

Chunking Strategies

Strategy Description
fixed_size Split by fixed token/character count
sentence Split by sentence boundaries
paragraph Split by paragraph boundaries
semantic Split by semantic similarity
recursive Recursively split with multiple separators

Python API

from beacon import BenchmarkConfig, ChunkingStrategy
from beacon.runner import run_benchmark
from beacon.parsers import load_documents, load_queries

# Load your data
documents = load_documents([Path("./docs")])
queries = load_queries(Path("./queries.jsonl"))

# Define strategies
strategies = [
    ChunkingStrategy(name="small", strategy_type="fixed_size", chunk_size=256),
    ChunkingStrategy(name="medium", strategy_type="fixed_size", chunk_size=512),
    ChunkingStrategy(name="large", strategy_type="fixed_size", chunk_size=1024),
]

# Run benchmark
config = BenchmarkConfig(
    name="my_benchmark",
    documents=[Path("./docs")],
    queries_file=Path("./queries.jsonl"),
    strategies=strategies,
)

result = run_benchmark(config)
print(f"Best strategy: {result.best_strategy}")

Auto-Tuning

Find optimal chunking parameters automatically:

from beacon.tuner import auto_tune
from beacon.parsers import load_documents, load_queries
from beacon.models import ChunkingStrategyType

documents = load_documents([Path("./docs")])
queries = load_queries(Path("./queries.jsonl"))

result = auto_tune(
    documents=documents,
    queries=queries,
    strategy_type=ChunkingStrategyType.FIXED_SIZE,
    metric="mrr",
    n_trials=50,
    chunk_size_range=(100, 2000),
)

print(f"Optimal chunk size: {result['chunk_size']}")
print(f"Optimal overlap: {result['chunk_overlap']}")
print(f"Best MRR: {result['best_mrr']:.4f}")

Metrics

Metric Description Good Score
MRR Mean Reciprocal Rank > 0.7
Recall@K % of relevant docs in top K > 0.8
NDCG@K Normalized DCG > 0.75
Precision@K Precision at K > 0.6
MAP Mean Average Precision > 0.6

CLI Commands

# Initialize sample configuration
beacon init

# Run benchmark
beacon run config.yaml

# List available strategies
beacon strategies

# Compare multiple benchmark results
beacon compare results1.json results2.json

Output

Beacon generates:

  • results.json - Full results in JSON format
  • results.csv - Comparison table in CSV
  • report.html - Interactive HTML report with charts

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Security

If you discover a security vulnerability, please see SECURITY.md for reporting guidelines.

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beacon_eval-0.1.0.tar.gz (50.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

beacon_eval-0.1.0-py3-none-any.whl (37.0 kB view details)

Uploaded Python 3

File details

Details for the file beacon_eval-0.1.0.tar.gz.

File metadata

  • Download URL: beacon_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 50.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for beacon_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 76b567313fe6234100b12fde8af7da9793c27aff6af2b4b4fc7d620b8c755944
MD5 5cfe985aba510bf0a3bb6b9cdabba9f0
BLAKE2b-256 89adef1cbf887b01096866c7d999797213869ed9afde9914889ce20187d97d72

See more details on using hashes here.

File details

Details for the file beacon_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: beacon_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for beacon_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 98cc06e1780b9d385409946855bc35681e87f3b758e566ed64bff8bbbbf63d3a
MD5 d0e0bfd620c31bde36fd37265e1ae68d
BLAKE2b-256 715209eefa99e3258aa9a0a0f3734dcb141c3f07a5e6fdc1a7c624aa9ec3f959

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page