Fair embedding model evaluation with independent parameter optimization

These details have not been verified by PyPI

Project links

Project description

embedding-eval

Fair embedding model evaluation with independent parameter optimization.

Why This Package?

Most embedding comparisons are unfair because they use the same parameters for all models. This package implements a fair comparison methodology:

Approach	Description	Fair?
Unfair	Optimize parameters for Model A, apply to all models	❌
Fair	Each model gets its own optimized parameters	✅

Key Features

Independent Optimization: Each model gets its own best chunk_size, overlap, and top_k
Binary Evaluation: Simple substring matching, no LLM cost, reproducible
Confidence Intervals: Reports 95% CI using Wilson score
Minimal Dependencies: Core functionality requires only sentence-transformers and tiktoken
No External Services: InMemoryVectorStore requires no database setup

Installation

pip install embedding-eval

# For OpenAI models
pip install embedding-eval[openai]

Quick Start

from embedding_eval import run_fair_comparison

# Your document content
doc_content = open("document.txt").read()

# Q&A pairs where answers appear VERBATIM in the document
qa_pairs = [
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "When was the company founded?", "answer": "1995"},
    # ... more pairs (recommend 80+ for statistical power)
]

# Compare models with independent optimization
results = run_fair_comparison(
    models=["st:bge-base", "st:minilm"],
    doc_content=doc_content,
    qa_pairs=qa_pairs,
)

# Results include baseline + optimized + confidence intervals
for r in results:
    print(f"{r.model_name}:")
    print(f"  Baseline: {r.baseline_accuracy:.1f}%")
    print(f"  Optimized: {r.best_accuracy:.1f}% (95% CI: [{r.ci_lower:.1f}%, {r.ci_upper:.1f}%])")
    print(f"  Best params: {r.best_params}")

Methodology

Fair Comparison = Independent Optimization

┌────────────────────────────────────────────────────────────────┐
│  FAIR COMPARISON METHODOLOGY                                   │
│                                                                │
│  For each model:                                               │
│    1. Grid search over chunk_size × overlap × top_k            │
│    2. Find best parameters for THIS model                      │
│    3. Report: baseline + optimized + 95% CI                    │
│                                                                │
│  Compare models using their respective best configurations     │
└────────────────────────────────────────────────────────────────┘

Binary Evaluation

We use substring matching to check if the expected answer appears in retrieved chunks:

from embedding_eval import BinaryEvaluator

evaluator = BinaryEvaluator()
score = evaluator.evaluate(
    question="What is the capital?",
    expected_answer="Paris",
    retrieved_chunks=["France is a country. Paris is its capital."]
)
print(score.score)  # 1.0 (answer found)

Why binary evaluation?

Simple and reproducible
No LLM cost ($0 vs ~$0.03/question for LLM evaluation)
Proven effective for parameter optimization (see EDD-005)
RAGAS and LLM evaluation add cost without improving decisions

Statistical Requirements

Sample Size	95% CI Width	Can Detect
50	±11%	>22% differences
80	±9%	>18% differences
100	±8%	>16% differences

Recommendation: Use 80+ questions with 20%+ multi-hop for meaningful comparisons.

Q&A Fixture Format

[
  {
    "question": "What does BATNA stand for?",
    "answer": "Best Alternative To a Negotiated Agreement",
    "category": "exact",
    "difficulty": "medium"
  }
]

Important: Answers must appear verbatim in the document.

Question Categories

Category	Description	Example
`exact`	Answer appears verbatim in document	"What year was the company founded?" → "1995"
`reformulated`	Question rephrased, same verbatim answer	"When did the company start?" → "1995"
`multi_hop`	Requires connecting multiple facts	"What is the phone number of the org that teaches X?"
`fine_detail`	Specific numbers, dates, codes	"What is the gross margin percentage?" → "65%"
`implicit`	Requires inference from document content	"Is the company profitable?" (inferred from financials)
`negation`	Asks what is NOT something	"Which technique is NOT used for divergence?"

Difficulty Levels

Difficulty	Description	Typical Accuracy
`easy`	Direct lookup, common terms	95-100%
`medium`	Some vocabulary variation	80-95%
`hard`	Multi-hop, specific details, vocabulary gaps	60-85%

Fixture Guidelines

For statistically meaningful comparisons:

Requirement	Minimum	Recommended
Total questions	50	80+
Multi-hop questions	10%	20%+
Hard questions	30%	40%+

Why multi-hop matters: Multi-hop questions are the primary differentiator between retrieval strategies. Simple exact-match questions often hit ceiling effects (100% accuracy across all strategies).

For detailed guidance, see Creating Q&A Fixtures.

Model Specifications

Format	Example	Description
`st:<model>`	`st:bge-base`	SentenceTransformers (free, local)
`openai:<model>`	`openai:text-embedding-3-small`	OpenAI API (requires key)

Recommended Models

Use Case	Model	Accuracy	Cost
Best value	`st:bge-base`	94.4%	Free
Quality-first	`openai:text-embedding-3-small`	97.3%	~$0.02/1M tokens
Fast prototyping	`st:minilm`	89.7%	Free

API Reference

Core Functions

# Compare multiple models
from embedding_eval import run_fair_comparison
results = run_fair_comparison(
    models=["st:bge-base", "st:minilm"],
    doc_content=text,
    qa_pairs=pairs,
    chunk_sizes=[256, 384, 512],  # optional
    overlaps=[25, 50, 100],       # optional
    top_ks=[5, 10, 15],           # optional
)

# Optimize single model
from embedding_eval import optimize_model
result = optimize_model(
    model_spec="st:bge-base",
    doc_content=text,
    qa_pairs=pairs,
)

Components

# Chunking
from embedding_eval.chunking import FixedSizeChunker
chunker = FixedSizeChunker(chunk_size=512, overlap=50)
chunks = chunker.chunk(document)

# Embedding
from embedding_eval.adapters.embedding import SentenceTransformerEmbedding
embedding = SentenceTransformerEmbedding(model="bge-base")
vectors = embedding.embed_documents(texts)

# Vector Store (no external deps)
from embedding_eval.adapters.vector import InMemoryVectorStore
store = InMemoryVectorStore()
store.connect()
store.create_collection("test", dimensions=768)
store.upsert("test", ids, embeddings, texts=texts)
results = store.search("test", query_embedding, top_k=10)

# Evaluation
from embedding_eval import BinaryEvaluator
evaluator = BinaryEvaluator()
score = evaluator.evaluate(question, answer, chunks)

# Fixture Validation
from embedding_eval import validate_fixture

result = validate_fixture(qa_pairs, doc_content)
print(result.summary)

if not result.is_valid:
    for issue in result.answer_issues:
        print(f"  {issue.answer}: {issue.issue_type}")

CLI Usage

Validate Fixtures

Validate your Q&A fixture before running evaluations:

# Basic validation (checks structure, distribution, thresholds)
embedding-eval validate --qa fixture.json

# With answer verification against document
embedding-eval validate --qa fixture.json --doc document.txt

# JSON output for programmatic use
embedding-eval validate --qa fixture.json --json

# Custom thresholds
embedding-eval validate --qa fixture.json \
    --min-questions 30 \
    --min-multihop 5 \
    --min-hard 20

Example output:

==================================================
FIXTURE VALIDATION REPORT
==================================================

Total questions: 80
Answers verified: 80/80
  ✓ All answers found verbatim
  ✓ No duplicate questions

Category Distribution:
  exact: 50.0% (40)
  multi_hop: 25.0% (20)
  fine_detail: 15.0% (12)
  reformulated: 10.0% (8)

Difficulty Distribution:
  easy: 15.0% (12)
  medium: 45.0% (36)
  hard: 40.0% (32)

Threshold Checks:
  ✓ Questions: 80 (≥80 recommended)
  ✓ Multi-hop: 25.0% (≥20% recommended)
  ✓ Hard: 40.0% (≥40% recommended)

==================================================
STATUS: ✓ VALID (meets minimum requirements)
==================================================

Compare Models

Run fair comparison from the command line:

embedding-eval compare \
    --doc document.txt \
    --qa fixture.json \
    --models st:bge-base st:minilm \
    --output results.json

Key Research Findings

Based on comprehensive evaluation (712 questions across 5 document types):

Chunking matters most: Section-aware chunking improved Q33 from rank 66 → rank 2 (more impact than any algorithm change)

Recommended configuration:

config = {
    "chunk_size": 512,
    "overlap": 50,
    "top_k": 10,
}
# Accuracy: 94.0% with BGE-base

What NOT to do:
- Graph retrieval causes -4.6% to -4.9% regression on most documents
- Small chunks (128 tokens) generalize poorly
- Query expansion helps vocabulary mismatch but can hurt precision

License

MIT

Contributing

Contributions welcome! Please open an issue first to discuss proposed changes.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jan 16, 2026

0.1.0

Jan 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_eval-0.2.0.tar.gz (26.8 kB view details)

Uploaded Jan 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embedding_eval-0.2.0-py3-none-any.whl (34.1 kB view details)

Uploaded Jan 16, 2026 Python 3

File details

Details for the file embedding_eval-0.2.0.tar.gz.

File metadata

Download URL: embedding_eval-0.2.0.tar.gz
Upload date: Jan 16, 2026
Size: 26.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for embedding_eval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a4f7ad240b7bcf1586af50cdceda11d3ed2c7bdb9f5a0d81c577f4329e48ee45`
MD5	`0073abdcf2441a91a6cb3a62b4d3087c`
BLAKE2b-256	`2cff1582548c274c60e7d41f456a8ac32ed95d71c4d50316c4858cb2d78568ff`

See more details on using hashes here.

File details

Details for the file embedding_eval-0.2.0-py3-none-any.whl.

File metadata

Download URL: embedding_eval-0.2.0-py3-none-any.whl
Upload date: Jan 16, 2026
Size: 34.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for embedding_eval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`527e9b1f6efdb0b8362e65c50445914de536c9fcc71bea84712fa236ec7bcc50`
MD5	`66bef066832da45ccb31e569e8471301`
BLAKE2b-256	`a24c31da5be41dcdd84bdbab9154c1f2f39363afba6f1246776e7144c8bc8f33`

See more details on using hashes here.

embedding-eval 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

embedding-eval

Why This Package?

Key Features

Installation

Quick Start

Methodology

Fair Comparison = Independent Optimization

Binary Evaluation

Statistical Requirements

Q&A Fixture Format

Question Categories

Difficulty Levels

Fixture Guidelines

Model Specifications

Recommended Models

API Reference

Core Functions

Components

CLI Usage

Validate Fixtures

Compare Models

Key Research Findings

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes