Skip to main content

Fair embedding model evaluation with independent parameter optimization

Project description

embedding-eval

Fair embedding model evaluation with independent parameter optimization.

Why This Package?

Most embedding comparisons are unfair because they use the same parameters for all models. This package implements a fair comparison methodology:

Approach Description Fair?
Unfair Optimize parameters for Model A, apply to all models
Fair Each model gets its own optimized parameters

Key Features

  • Independent Optimization: Each model gets its own best chunk_size, overlap, and top_k
  • Binary Evaluation: Simple substring matching, no LLM cost, reproducible
  • Confidence Intervals: Reports 95% CI using Wilson score
  • Minimal Dependencies: Core functionality requires only sentence-transformers and tiktoken
  • No External Services: InMemoryVectorStore requires no database setup

Installation

pip install embedding-eval

# For OpenAI models
pip install embedding-eval[openai]

Quick Start

from embedding_eval import run_fair_comparison

# Your document content
doc_content = open("document.txt").read()

# Q&A pairs where answers appear VERBATIM in the document
qa_pairs = [
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "When was the company founded?", "answer": "1995"},
    # ... more pairs (recommend 80+ for statistical power)
]

# Compare models with independent optimization
results = run_fair_comparison(
    models=["st:bge-base", "st:minilm"],
    doc_content=doc_content,
    qa_pairs=qa_pairs,
)

# Results include baseline + optimized + confidence intervals
for r in results:
    print(f"{r.model_name}:")
    print(f"  Baseline: {r.baseline_accuracy:.1f}%")
    print(f"  Optimized: {r.best_accuracy:.1f}% (95% CI: [{r.ci_lower:.1f}%, {r.ci_upper:.1f}%])")
    print(f"  Best params: {r.best_params}")

Methodology

Fair Comparison = Independent Optimization

┌────────────────────────────────────────────────────────────────┐
│  FAIR COMPARISON METHODOLOGY                                   │
│                                                                │
│  For each model:                                               │
│    1. Grid search over chunk_size × overlap × top_k            │
│    2. Find best parameters for THIS model                      │
│    3. Report: baseline + optimized + 95% CI                    │
│                                                                │
│  Compare models using their respective best configurations     │
└────────────────────────────────────────────────────────────────┘

Binary Evaluation

We use substring matching to check if the expected answer appears in retrieved chunks:

from embedding_eval import BinaryEvaluator

evaluator = BinaryEvaluator()
score = evaluator.evaluate(
    question="What is the capital?",
    expected_answer="Paris",
    retrieved_chunks=["France is a country. Paris is its capital."]
)
print(score.score)  # 1.0 (answer found)

Why binary evaluation?

  • Simple and reproducible
  • No LLM cost ($0 vs ~$0.03/question for LLM evaluation)
  • Proven effective for parameter optimization (see EDD-005)
  • RAGAS and LLM evaluation add cost without improving decisions

Statistical Requirements

Sample Size 95% CI Width Can Detect
50 ±11% >22% differences
80 ±9% >18% differences
100 ±8% >16% differences

Recommendation: Use 80+ questions with 20%+ multi-hop for meaningful comparisons.

Q&A Fixture Format

[
  {
    "question": "What does BATNA stand for?",
    "answer": "Best Alternative To a Negotiated Agreement",
    "category": "exact",
    "difficulty": "medium"
  }
]

Important: Answers must appear verbatim in the document.

Question Categories

Category Description Example
exact Answer appears verbatim in document "What year was the company founded?" → "1995"
reformulated Question rephrased, same verbatim answer "When did the company start?" → "1995"
multi_hop Requires connecting multiple facts "What is the phone number of the org that teaches X?"
fine_detail Specific numbers, dates, codes "What is the gross margin percentage?" → "65%"
implicit Requires inference from document content "Is the company profitable?" (inferred from financials)
negation Asks what is NOT something "Which technique is NOT used for divergence?"

Difficulty Levels

Difficulty Description Typical Accuracy
easy Direct lookup, common terms 95-100%
medium Some vocabulary variation 80-95%
hard Multi-hop, specific details, vocabulary gaps 60-85%

Fixture Guidelines

For statistically meaningful comparisons:

Requirement Minimum Recommended
Total questions 50 80+
Multi-hop questions 10% 20%+
Hard questions 30% 40%+

Why multi-hop matters: Multi-hop questions are the primary differentiator between retrieval strategies. Simple exact-match questions often hit ceiling effects (100% accuracy across all strategies).

For detailed guidance, see Creating Q&A Fixtures.

Model Specifications

Format Example Description
st:<model> st:bge-base SentenceTransformers (free, local)
openai:<model> openai:text-embedding-3-small OpenAI API (requires key)

Recommended Models

Use Case Model Accuracy Cost
Best value st:bge-base 94.4% Free
Quality-first openai:text-embedding-3-small 97.3% ~$0.02/1M tokens
Fast prototyping st:minilm 89.7% Free

API Reference

Core Functions

# Compare multiple models
from embedding_eval import run_fair_comparison
results = run_fair_comparison(
    models=["st:bge-base", "st:minilm"],
    doc_content=text,
    qa_pairs=pairs,
    chunk_sizes=[256, 384, 512],  # optional
    overlaps=[25, 50, 100],       # optional
    top_ks=[5, 10, 15],           # optional
)

# Optimize single model
from embedding_eval import optimize_model
result = optimize_model(
    model_spec="st:bge-base",
    doc_content=text,
    qa_pairs=pairs,
)

Components

# Chunking
from embedding_eval.chunking import FixedSizeChunker
chunker = FixedSizeChunker(chunk_size=512, overlap=50)
chunks = chunker.chunk(document)

# Embedding
from embedding_eval.adapters.embedding import SentenceTransformerEmbedding
embedding = SentenceTransformerEmbedding(model="bge-base")
vectors = embedding.embed_documents(texts)

# Vector Store (no external deps)
from embedding_eval.adapters.vector import InMemoryVectorStore
store = InMemoryVectorStore()
store.connect()
store.create_collection("test", dimensions=768)
store.upsert("test", ids, embeddings, texts=texts)
results = store.search("test", query_embedding, top_k=10)

# Evaluation
from embedding_eval import BinaryEvaluator
evaluator = BinaryEvaluator()
score = evaluator.evaluate(question, answer, chunks)

# Fixture Validation
from embedding_eval import validate_fixture

result = validate_fixture(qa_pairs, doc_content)
print(result.summary)

if not result.is_valid:
    for issue in result.answer_issues:
        print(f"  {issue.answer}: {issue.issue_type}")

CLI Usage

Validate Fixtures

Validate your Q&A fixture before running evaluations:

# Basic validation (checks structure, distribution, thresholds)
embedding-eval validate --qa fixture.json

# With answer verification against document
embedding-eval validate --qa fixture.json --doc document.txt

# JSON output for programmatic use
embedding-eval validate --qa fixture.json --json

# Custom thresholds
embedding-eval validate --qa fixture.json \
    --min-questions 30 \
    --min-multihop 5 \
    --min-hard 20

Example output:

==================================================
FIXTURE VALIDATION REPORT
==================================================

Total questions: 80
Answers verified: 80/80
  ✓ All answers found verbatim
  ✓ No duplicate questions

Category Distribution:
  exact: 50.0% (40)
  multi_hop: 25.0% (20)
  fine_detail: 15.0% (12)
  reformulated: 10.0% (8)

Difficulty Distribution:
  easy: 15.0% (12)
  medium: 45.0% (36)
  hard: 40.0% (32)

Threshold Checks:
  ✓ Questions: 80 (≥80 recommended)
  ✓ Multi-hop: 25.0% (≥20% recommended)
  ✓ Hard: 40.0% (≥40% recommended)

==================================================
STATUS: ✓ VALID (meets minimum requirements)
==================================================

Compare Models

Run fair comparison from the command line:

embedding-eval compare \
    --doc document.txt \
    --qa fixture.json \
    --models st:bge-base st:minilm \
    --output results.json

Key Research Findings

Based on comprehensive evaluation (712 questions across 5 document types):

  1. Chunking matters most: Section-aware chunking improved Q33 from rank 66 → rank 2 (more impact than any algorithm change)

  2. Recommended configuration:

    config = {
        "chunk_size": 512,
        "overlap": 50,
        "top_k": 10,
    }
    # Accuracy: 94.0% with BGE-base
    
  3. What NOT to do:

    • Graph retrieval causes -4.6% to -4.9% regression on most documents
    • Small chunks (128 tokens) generalize poorly
    • Query expansion helps vocabulary mismatch but can hurt precision

License

MIT

Contributing

Contributions welcome! Please open an issue first to discuss proposed changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_eval-0.2.0.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedding_eval-0.2.0-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file embedding_eval-0.2.0.tar.gz.

File metadata

  • Download URL: embedding_eval-0.2.0.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for embedding_eval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a4f7ad240b7bcf1586af50cdceda11d3ed2c7bdb9f5a0d81c577f4329e48ee45
MD5 0073abdcf2441a91a6cb3a62b4d3087c
BLAKE2b-256 2cff1582548c274c60e7d41f456a8ac32ed95d71c4d50316c4858cb2d78568ff

See more details on using hashes here.

File details

Details for the file embedding_eval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: embedding_eval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for embedding_eval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 527e9b1f6efdb0b8362e65c50445914de536c9fcc71bea84712fa236ec7bcc50
MD5 66bef066832da45ccb31e569e8471301
BLAKE2b-256 a24c31da5be41dcdd84bdbab9154c1f2f39363afba6f1246776e7144c8bc8f33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page