sophisticated evaluation framework designed to assess the performance of long-context embedding models on semantic understanding and information retention
Project description
Finesse Benchmark: Evaluating Long-Context Embedders with Semantic Precision
vBERT: Vector Sequence Merger for Limited Embedding Context Extension
Introduction
The Finesse Benchmark is a sophisticated evaluation framework designed to assess the performance of long-context embedding models on semantic understanding and information retention. Unlike traditional benchmarks that rely on superficial metrics, Finesse focuses on Relative Semantic Similarity (RSS)—a robust metric that measures how well models distinguish between relevant ("memory") and irrelevant ("noise") chunks in long sequences.
Key Features
-
Modular Evaluation Modes: Supports
merger_mode(using sequence-merger with a base embedder),native_mode(direct long-context embedders like Snowflake Arctic Embed), andBYOK_mode(Bring Your Own Keys for external APIs via LiteLLM). -
Dynamic Probe Generation: Creates synthetic probes from atomic text chunks in the dataset, masking portions to test reconstruction accuracy.
-
Top-Down and Bottom-Up Scoring: Combines Top-Down (TD) for contextual coherence (how well the model separates memory from noise) and Bottom-Up (BU) for individual chunk integrity (how well each chunk recognizes itself in compositions).
-
Latency Measurement: Tracks computational efficiency by measuring
total_latency(full embedding process time in milliseconds) andsynthesis_latency(merging-specific time in milliseconds), enabling analysis of accuracy versus speed trade-offs. -
Reproducibility and Integrity: Outputs include self-contained content hashes and optional model hashes for notarization and verification.
-
CLI-Driven Workflow: Simple commands (
init,generate,score,checksum) for end-to-end evaluation. -
Dataset: Uses the enzoescipy/finesse-benchmark-database on Hugging Face, which provides domain-diverse atomic chunks grouped by
string_id. -
Results Repository: All official and community-submitted benchmark results are archived in the enzoescipy/finesse-benchmark-results dataset. For interactive leaderboards and submission guidelines, see the enzoescipy/finesse-benchmark-space.
Finesse is built with Pydantic for configuration validation, Typer for CLI, and Torch for efficient embedding computations.
Installation
Install via pip:
pip install finesse-benchmark
- Requires Python 3.8+.
- For GPU acceleration: Ensure CUDA is installed and set
device: "cuda"in config. - Hugging Face models are downloaded automatically (use
transformerscache).
For BYOK mode (e.g., OpenAI), install additional dependencies:
pip install litellm
Quick Start
1. Initialize Config (Optional)
Generate a default benchmark.yaml template:
finesse init --output benchmark.yaml
For official leaderboard submissions, use the standardized config:
finesse init --leaderboard
This copies the official benchmark.leaderboard.yaml to benchmark.yaml, ensuring reproducibility with fixed seed (42), balanced sequence lengths (4-32), and samples per length (25).
Alternatively, manually copy:
cp "benchmark.leaderboard.yaml" "benchmark.yaml"
Note: For leaderboard, use the official config as-is without modifications to maintain fairness. Edit only for custom evaluations.
Edit benchmark.yaml to select mode, models, and probe settings. For BYOK mode, see the dedicated section below.
2. Generate Raw Embeddings
Run the evaluation to generate raw probe and synthesis embeddings:
finesse generate --config benchmark.yaml --output results --samples 5 --seed 42
- This saves a
.ptfile (e.g.,embeddings_merger_mode_finesse-benchmark-database.pt) containing raw data and config. - Overrides: Use
--dataset-pathfor custom HF datasets,--samplesfor more evaluations per length.
3. Score the Embeddings
Compute RSS scores from the raw data:
finesse score --pt-path results/embeddings_merger_mode_finesse-benchmark-database.pt --output results
- Outputs
benchmark_results.jsonwithaverage_rss(final score) and per-length scores. - Scores are normalized and scaled (multiplied by 500 for interpretability).
4. Verify Integrity (Checksum)
Validate the results for tampering or reproducibility:
finesse checksum --json-path results/benchmark_results.json
For full provenance (model unchanged), provide the model ID:
finesse checksum --json-path results/benchmark_results.json --model-path Snowflake/snowflake-arctic-embed-l-v2.0
- ✅ Success if content and model hashes match.
- Only Hugging Face model IDs (e.g.,
org/repo) are accepted for--model-path.
Detailed CLI Reference
All commands use Typer for intuitive interfaces. Run finesse --help for overview.
finesse init
Generates a commented benchmark.yaml template or copies the official leaderboard config.
Options:
--leaderboard: Use official leaderboard configuration (copiesbenchmark.leaderboard.yamlto output; default: False).--output: Path to save YAML (default:benchmark.yaml).
Examples:
- Default template:
finesse init --output my_config.yaml
- Leaderboard config:
finesse init --leaderboard
The template includes examples for all modes and validates against BenchmarkConfig before saving.
finesse generate
Generates raw embeddings from the dataset using the specified config.
Options:
--config(required): Path tobenchmark.yaml.--dataset-path: Override HF dataset path (default: from config).--output: Directory for.ptfiles (default:results).--samples: Samples per sequence length (overrides config).--seed: Random seed for reproducibility (overrides config).
Output:
.ptfile: Torch tensor withconfig(dict),raw_results(embeddings per length).
Example:
finesse generate --config benchmark.yaml --output ./my_results --samples 10
finesse score
Computes TD/BU scores and final RSS from raw .pt data.
Options:
--pt-path(required): Path to.ptfile fromgenerate.--output: Directory for JSON (default:results).
Scoring Logic (simplified):
- TD Score: Quartile gap between memory and noise similarities (excludes first/last synthesis steps for stability).
- BU Score: Similar gap from individual chunk perspectives.
- Final RSS:
((avg_TD + avg_BU) / 2) - |TD - BU|per length, averaged across lengths, scaled by 500.
Output:
benchmark_results.json:{ "config": {...}, "average_rss": 68.984345, "average_total_latency": 52.576509, "average_synthesis_latency": 52.576509, "length_scores": { "4": { "rss_scores": [ 123.456, 234.567, ], "total_latency_scores": [ 12.34, 56.78, ], "synthesis_latency_scores": [ 1.23, 4.56, ] } "5" : ... } "content_hash": "sha256:...", "model_hash": "sha256:..." // Optional, for HF models }
Example:
finesse score --pt-path results/embeddings_byok_mode_finesse-benchmark-database.pt
finesse checksum
Verifies JSON integrity via self-contained hash. Optional model provenance check.
Options:
--json-path(required): Path tobenchmark_results.json.--model-path: HF model ID for provenance (e.g.,intfloat/multilingual-e5-base).
Verification:
- Recomputes
content_hash(excludes hash itself) and compares. - For
--model-path: Recomputesmodel_hashfrom model files and compares.
Example:
finesse checksum --json-path results/benchmark_results.json --model-path enzoescipy/sequence-merger-tiny
Output Files Explained
-
.pt(Raw Embeddings): Binary Torch file with:config: Full benchmark config as dict.raw_results: Dict of{length: {"probe_embeddings": [...], "synthesis_embeddings": [...], "num_synth_steps": int}}.- Used as input to
score; enables decoupling of embedding generation (GPU-heavy) from scoring (CPU-friendly).
-
benchmark_results.json: Human-readable results with:length_scores: Per-sequence-length scores (tests scaling).rss_scores: Overall score (higher is better; -1000 ~ 1000 range).total_latency_scores: Total time (in milliseconds) for the entire embedding generation process across all samples.synthesis_latency_scores: Time (in milliseconds) specifically for the synthesis/merging steps across all samples.content_hash: SHA-256 of config + scores (for tamper-proofing).model_hash: SHA-256 of model files (if applicable; verifies unchanged model).
Hashes ensure reproducibility: Rerun checksum on shared results to confirm no alterations. Users can upload their benchmark_results.json to the enzoescipy/finesse-benchmark-space for community visibility and leaderboard integration.
Using BYOK Mode (Bring Your Own Keys)
BYOK mode integrates external embedding APIs (e.g., OpenAI, Cohere) via LiteLLM for fair comparison with open models.
BYOK mode curruntly disrupted caused by litellm issue. Feature temporary disabled.
Setup
-
Edit
benchmark.yaml:mode: "byok_mode" models: byok_embedder: provider: "openai" # 'openai', 'cohere', 'google', etc. name: "text-embedding-3-large" # Provider-specific model
-
Set Environment Variables (REQUIRED; never hardcode in YAML):
- OpenAI:
export OPENAI_API_KEY="sk-..." - Cohere:
export COHERE_API_KEY="..." - Google:
export GOOGLE_API_KEY="..."(or Vertex AI creds). - LiteLLM auto-detects based on
provider. See LiteLLM docs for full list.
On Windows (PowerShell):
$env:OPENAI_API_KEY="sk-..." - OpenAI:
-
Run as usual:
finesse generate --config byok_config.yaml
Notes
- Costs: BYOK incurs API fees; start with small
--samples(e.g., 1-5). - Security: Keys stay in env vars—YAML remains commit-safe.
- Validation: Config validator ensures
byok_embedderis set forbyok_mode; others are optional/ignored. - Example YAML (in
inittemplate): Uncomment and customize the BYOK section.
Configuration Deep Dive (benchmark.yaml)
- mode:
"merger_mode"(default; uses merger + base),"native_mode"(direct embedder),"byok_mode". - models: Mode-specific; unused fields default to
None(no download).merger: Sequence-merger path (e.g.,"enzoescipy/sequence-merger-tiny").base_embedder/native_embedder: Embedder path (e.g.,"intfloat/multilingual-e5-base").
- datasets: The list of HF dataset import statement;
path: "author/databasename" hf pathsplit: "train" or specific split on huggingface databasedata_files: specific file convension like "en/en_part_00000.parquet". wildcard(*) is allowed on filename, like "en/en_part_0000*.parquet"data_dir: dataset directory subset like "en". if exists, override thedata_files.text_column: The column that the actual corpus exists. normally the "text".commit_hash: commit version hash of the hf database. like "a1b2c3..."
- probe_config:
sequence_length:{min: 5, max: 16}(probe lengths in tokens).samples_per_length: 1+ (evals per length).
- advanced:
{batch_size: 8, device: "cuda"}(optional). - seed: 42 (reproducibility).
Pydantic ensures type safety; invalid configs raise ValueError on load.
Leaderboard Submission Guide
To ensure fair, reproducible, and standardized evaluations for the official Finesse leaderboard:
-
Initialize Official Config: Use the CLI for convenience:
finesse init --leaderboard
This copies
benchmark.leaderboard.yamltobenchmark.yamlwith fixed settings (seed: 42, sequence_length: {min: 4, max: 32}, samples_per_length: 25).Or manually (for scripts/environments without CLI):
cp "benchmark.leaderboard.yaml" "benchmark.yaml"
Important: Do not modify this config for leaderboard submissions. Use as-is for comparability.
-
Run Evaluation: Generate embeddings:
finesse generate --config benchmark.yaml --output results
Score results:
finesse score --pt-path results/embeddings_merger_mode_finesse-benchmark-database.pt --output results
-
Verify & Submit: Check integrity:
finesse checksum --json-path results/benchmark_results.json finesse checksum --json-path results/benchmark_results.json --merger-path enzoescipy/sequence-merger-malgeum --base-embedder-path intfloat/multilingual-e5-base finesse checksum --json-path results/benchmark_results.json --native-path Snowflake/snowflake-arctic-embed-l-v2.0 finesse verify --json-path results/benchmark_results.json --pt-path results/embeddings_merger_mode_finesse-benchmark-database.pt
Submit
benchmark_results.jsonto the leaderboard (via HF Spaces, GitHub, etc.). Includecontent_hashandmodel_hashfor verification.
This setup guarantees all submissions use identical dataset sampling, probe generation, and randomness, focusing purely on model performance.
Using Custom Local Synthesizers with Finesse Benchmark
The finesse-benchmark package follows a dependency injection design pattern, allowing flexible substitution of core components: the embedder (for generating partial embeddings) and the synthesizer (for merging sequences). This enables users to:
- Use standard Hugging Face models for embedding (e.g., via
HuggingFaceEmbedder). - Swap in custom synthesizers for local models, such as trained checkpoints (
.ptfiles) from your own training pipeline. - Ensure seamless integration with the
FinesseEvaluatorfor benchmark scoring without modifying the package core.
This approach promotes modularity: you define your embedder and synthesizer engines, inject them into the evaluator, and run standardized probes to measure Relative Sequence Strength (RSS) scores. It's tamper-proof, scalable, and leaderboard-ready.
Custom Synthesizer Recipe: LocalCheckpointSynthesizer
Below is the complete, battle-tested code for LocalCheckpointSynthesizer. It inherits from FinesseSynthesizer and handles dynamic architecture detection (Micro/Macro/Massive/ResidualCorrectionMoe) from checkpoint metadata.
Place this in a script (e.g., evaluator_custom.py) or your project module.
import os
import torch
from finesse_benchmark.interfaces import FinesseSynthesizer
class LocalCheckpointSynthesizer(FinesseSynthesizer):
def __init__(self, checkpoint_path: str, device: torch.device = None, dtype: torch.dtype = None):
super().__init__() # Call parent init if needed
self.checkpoint_path = checkpoint_path
self.device = device or torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.dtype = dtype or (torch.float16 if self.device.type == 'cuda' else torch.float32)
self.model = self._load_model()
def _load_model(self):
model = model.to(device=self.device, dtype=self.dtype)
return model.eval()
def synthesize(self, embeddings: torch.Tensor) -> torch.Tensor:
# Ensure inputs are on correct device and dtype
embeddings = embeddings.to(device=self.device, dtype=self.dtype)
# Check sequence length (handle short/empty sequences)
if embeddings.shape[1] <= 1:
# Return as-is for single/empty sequences (Finesse expects identity transformation)
return embeddings[0]
with torch.no_grad():
outputs = self.model(embeddings)
if hasattr(outputs, 'pooler_output'):
synthesized = outputs.pooler_output
else:
synthesized = outputs # Direct output, assuming (B, D)
# Normalize output on device
synthesized = torch.nn.functional.normalize(synthesized, p=2, dim=1)
return synthesized
def device(self) -> torch.device:
"""Return the device's type for internal use."""
return self.device
Notes on the Recipe:
- Dynamic Detection: Relies on
'model_architecture'key in your.ptfile's metadata (added during training, e.g., viatorch.save({'model_architecture': 'MicroTransformerSynthesizer', ...}). - Error Handling: Falls back to
MicroTransformerSynthesizerif architecture is unknown. - Performance: Uses
.eval()andno_grad()for efficient inference. - Dependencies: Requires your custom model classes (e.g., in
models/synthesizer.py). Adjust imports as needed.
Step-by-Step Usage Guide
Follow these steps to integrate your custom synthesizer into a full benchmark evaluation.
1. Setup Environment and Config
- Install
finesse-benchmark:pip install finesse-benchmark - Prepare your
.ptcheckpoint file (must include'model_architecture'and'model_state_dict'). - Create as
finesse initor use abenchmark.yamlconfig file (standard Finesse format).
Example benchmark.yaml snippet:
probe_config:
sequence_length:
min: 4
max: 16
step: 4
models:
base_embedder:
name: "intfloat/multilingual-e5-base" # Or your preferred embedder model
2. Define Device and dtype
Unify settings to avoid runtime errors (e.g., dtype mismatches).
import torch
# Auto-detect device and set dtype
device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.float16 if device.type == 'cuda' else torch.float32
print(f"Using device: {device}")
print(f"Using dtype: {dtype}")
3. Import Modules and Load Config
import yaml
from pathlib import Path
from finesse_benchmark.config import BenchmarkConfig
from finesse_benchmark.implementations import HuggingFaceEmbedder
from finesse_benchmark.evaluator import FinesseEvaluator
# Load config
with open("benchmark.yaml", 'r', encoding='utf-8') as f:
config_dict = yaml.safe_load(f)
config = BenchmarkConfig.model_validate(config_dict)
4. Instantiate Embedder and Custom Synthesizer
Inject unified device/dtype into both components.
# Embedder: Standard Hugging Face model
embedder = HuggingFaceEmbedder(
model_path=config.models.base_embedder.name, # e.g., "BAAI/bge-base-en-v1.5"
device=device,
dtype=dtype
)
print(f"Loaded embedder: {config.models.base_embedder.name}")
# Custom Synthesizer: Your local .pt checkpoint
checkpoint_path = Path("checkpoints/your_model.pt") # Full path to .pt file
synthesizer = LocalCheckpointSynthesizer(
checkpoint_path=str(checkpoint_path),
device=device,
dtype=dtype
)
print(f"Loaded synthesizer: {checkpoint_path.name}")
5. Create FinesseEvaluator and Run Benchmark
Inject your custom engines and execute the evaluation.
# Initialize evaluator with injected engines
evaluator = FinesseEvaluator(
embedder_engine=embedder,
synthesizer_engine=synthesizer,
config=config
)
# Run raw evaluation
print(" Generating raw embeddings...")
try:
raw_data = evaluator.raw_run()
print(" Raw embeddings generated successfully")
except Exception as e:
print(f"❌ Error in raw_run: {e}")
print(traceback.format_exc())
continue
# Extract length_results for scoring
length_results = raw_data.get('raw_results', {}).get('length_results', {})
if not length_results:
print(" No length results found. Skipping scoring.")
continue
from finesse_benchmark.scoring import calculate_self_attestation_scores, calculate_self_attestation_scores_bottom_up
final_scores_per_length = {}
for target_length, raw in length_results.items():
sample_results = raw.get('sample_results', [])
if not sample_results:
final_scores_per_length[target_length] = 0.0
continue
sample_scores = []
for sample_dict in sample_results:
probe_embeddings = sample_dict.get('chunk_embeddings')
synthesis_embeddings = sample_dict.get('synthesis_embeddings')
if probe_embeddings and synthesis_embeddings and len(probe_embeddings) >= 2:
td_scores = calculate_self_attestation_scores(probe_embeddings, synthesis_embeddings)
bu_scores = calculate_self_attestation_scores_bottom_up(probe_embeddings, synthesis_embeddings)
avg_td = td_scores['contextual_coherence']
avg_bu = bu_scores['bottom_up_coherence']
imbalance = abs(avg_td - avg_bu)
final_score = ((avg_td + avg_bu) / 2) - imbalance
sample_scores.append(final_score)
else:
sample_scores.append(0.0)
avg_length_score = np.mean(sample_scores) if sample_scores else 0.0
final_scores_per_length[target_length] = round(avg_length_score * 500, 6) # Scale by 500
# Calculate average RSS
avg_rss = round(np.mean(list(final_scores_per_length.values())), 6)
print(f" Average RSS: {avg_rss}")
License
Apache 2.0 License. See LICENSE for details.
Acknowledgments
Built on insights from long-context evaluation research. Thanks to Hugging Face Transformers and Pydantic teams.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file finesse_benchmark-0.18.8.tar.gz.
File metadata
- Download URL: finesse_benchmark-0.18.8.tar.gz
- Upload date:
- Size: 59.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.7 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1eb9c9037a4b67d59e947ed41e72f506f61d240534d361afa8d3d307eda4e56
|
|
| MD5 |
b50180b2b016bb5108384becff24ee15
|
|
| BLAKE2b-256 |
06911e80bb7b5193c9da18ecdd0f0fb6221d57bfb4a79371b4f19fbcb6378f03
|
File details
Details for the file finesse_benchmark-0.18.8-py3-none-any.whl.
File metadata
- Download URL: finesse_benchmark-0.18.8-py3-none-any.whl
- Upload date:
- Size: 61.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.7 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
408510d0cb1f068312ccfd070fa38e149bdd6a1faa71083020981b8ed4b3a853
|
|
| MD5 |
080a2363f40757f3f030491ec6672ccd
|
|
| BLAKE2b-256 |
88928b2f2f2c7462323ae2b71622fecf94dc49707b15cc6f8dec181cd1364626
|