FoodEval: a benchmark for food-domain text embeddings
Project description
FoodEval
The food-domain benchmark for text embedding models.
Mainstream embedding benchmarks (MTEB, BEIR) contain no menu- or dish-level evaluations. The closest thing, NFCorpus, is medical-literature retrieval seeded from nutrition topics. Models that do well on those benchmarks stumble on food text. On our leaderboard, OpenAI's text-embedding-3-large, requested at d=384 through its native dimensions parameter, returns "egg white omelette" as its top hit for "egg free options" and "lassi" for "lactose free". Both queries score an NDCG@10 of 0.0, verifiable in its shipped result file. No model reaches 0.24 NDCG@10 on diet and allergen search, ours included.
On cross-lingual matching, two popular English-focused open-weight models (Nomic embed-text v1.5, Alibaba GTE-large-en v1.5) sit exactly at the all-positive floor, the F1 for calling every pair a match. That part is expected. The sharper result: the three dedicated multilingual models on the board (BAAI/bge-m3, multilingual-e5-large, Cohere multilingual-v3) clear the term-frequency baseline on that task by under 2.5 points of F1, while Voyage 4 Large, the strongest general-purpose model there, clears it by more than 11. bge-m3 prescribes no prefixes, so its margin is clean; the e5 row runs without its card's prefixes (see the protocol note below). FoodEval measures what generic benchmarks miss.
12 tasks. 5,868 evaluation examples. 26 menu classes. Graded relevance judgments over production menu data.
Published by Latimal.
Tasks
| Task | Type | Metric | Examples | What it measures |
|---|---|---|---|---|
| Food Search | Retrieval | NDCG@10 | 178 queries, 974 docs | Ranked retrieval across Indian, global, and beverage menus |
| Concept Search | Retrieval | NDCG@10 | 44 queries, 301 docs | Abstract food concept queries ("warm comfort food", "barbecue dish") |
| Diet/Allergen Search | Retrieval | NDCG@10 | 35 queries, 245 docs | Diet- and allergen-specific queries ("celiac friendly", "shellfish allergy safe") |
| Noisy Search | Retrieval | NDCG@10 | 46 queries, 304 docs | Misspelled and abbreviated queries ("bibimbab", "bbq burgeer") |
| Indian Match | Pair classification | Best F1 | 498 pairs | Same-dish detection for Indian cuisine menu items |
| Global Match | Pair classification | Best F1 | 318 pairs | Same-dish detection across global (non-Indian) cuisines |
| Beverage Match | Pair classification | Best F1 | 295 pairs | Same-drink detection across teas, coffees, juices, smoothies, and cocktails |
| Bakery Match | Pair classification | Best F1 | 336 pairs | Same-item detection for bakery and dessert items |
| Portion Size | Pair classification | Best F1 | 228 pairs | Distinguishing portion/size variants from genuinely different items |
| Noisy Menu Match | Pair classification | Best F1 | 323 pairs | Same-dish detection with markup noise, casing, and symbol artifacts |
| Cross-Lingual Match | Pair classification | Best F1 | 514 pairs | Cross-script, romanized, and bilingual dish matching |
| Cuisine Classification | Classification | Macro F1 | 3,053 items, 26 classes | Linear probe cuisine classification from frozen embeddings |
Quick Start
pip install "foodeval[local]"
Requires Python 3.10+. The [local] extra pulls in sentence-transformers (plus torch and einops). Bare pip install foodeval is enough for the lexical baseline and the REST-based API adapters (Cohere, Voyage, Gemini, Vertex AI); the OpenAI and Bedrock adapters need pip install "foodeval[api]".
Evaluate any sentence-transformers model in a few lines:
from foodeval.evaluate import run_benchmark
from foodeval.adapters.sentence_transformer import SentenceTransformerAdapter
adapter = SentenceTransformerAdapter("sentence-transformers/all-MiniLM-L6-v2", truncate_dim=384)
result = run_benchmark(adapter)
print(result.to_markdown())
Or from the command line:
foodeval run --model sentence-transformers/all-MiniLM-L6-v2 --dim 384
The MiniLM model is a 90 MB download, so this gives a first result table in minutes; swap in any leaderboard model, e.g. BAAI/bge-m3, when you want the real comparison.
Leaderboard
Nine dense models evaluated at d=384, plus a lexical baseline at its native term-frequency dimension. FoodEval Score = unweighted mean across all 12 tasks. The built-in adapters reach more providers than the launch board covers; rows for Gemini, Vertex AI, and Bedrock models come in as community submissions (see Contributing Results). Browse the board interactively at the FoodEval leaderboard Space; merged submissions appear there automatically.
A note on protocol, because the details change how you read the table. OpenAI is requested at 384 through its native dimensions parameter. The Latimal row is truncated to 384 the same way as the open-weight rows; its public API also serves 384 directly through a dimension parameter, which is what the reproduction script uses. Voyage is requested at 512, the narrowest width its API serves above 384, then truncated to 384 and re-normalized; every other dense model, Cohere included, is truncated to its first 384 dimensions and re-normalized. Truncation can understate a model's native-dimension quality, so read each row as that model's d=384 operating point, a width production deployments commonly pay for.
Cohere and Voyage are called with their retrieval input types (queries as search queries, everything else as documents); Cohere's classification and clustering input types are not used, so every text passes through each API's retrieval treatment rather than a per-task one. Open-weight models are encoded as raw text with no instruction prefixes throughout, including models whose cards recommend or require them (multilingual-e5-large, Nomic embed-text v1.5); both cards prescribe prefixes for classification and clustering too, so those rows can be understated beyond retrieval. Prefixed runs are welcome as separate submissions (the package ships a PromptedAdapter wrapper for exactly this).
Latimal food-embed-v1 is our model. We publish the benchmark and hold a row on it. The row is measured locally with the same harness, encoded as raw text with no instruction prefixes, exactly like the open-weight rows; the public API is the reproduction route. The sub-tables below show where it loses: Voyage, OpenAI, and Cohere embed-v4 all outscore it on Search; OpenAI and Voyage outscore it on Classification. The no-training rule binds our row too. We screen our training corpora against the benchmark data for overlap, and the removal policy in CONTRIBUTING applies to our row as it does to any submission. Reproduce the row with scripts/run_latimal.py and an API key from latimal.com (self-serve; the free credit grant covers the full suite about ten times over). That script is the custom-adapter route for any model without a built-in adapter; its one caveat is in the reproducer notes below.
Overall
| Rank | Model | Dim | Search | Matching | Classification | FoodEval Score |
|---|---|---|---|---|---|---|
| 1 | Latimal food-embed-v1 | 384 | 0.4783 | 0.8513 | 0.7382 | 0.7176 |
| 2 | OpenAI text-embedding-3-large | 384 | 0.5535 | 0.7580 | 0.8329 | 0.6961 |
| 3 | Voyage 4 Large | 384 | 0.5577 | 0.7409 | 0.7895 | 0.6838 |
| 4 | Cohere embed-v4 | 384 | 0.5173 | 0.7407 | 0.7369 | 0.6659 |
| 5 | Nomic embed-text v1.5 | 384 | 0.4362 | 0.7390 | 0.7103 | 0.6357 |
| 6 | Alibaba GTE-large-en v1.5 | 384 | 0.4741 | 0.6993 | 0.7156 | 0.6256 |
| 7 | BAAI/bge-m3 | 384 | 0.4156 | 0.7182 | 0.7009 | 0.6159 |
| 8 | Lexical (TF) | 4096 | 0.2852 | 0.7278 | 0.6885 | 0.5770 |
| 9 | multilingual-e5-large | 384 | 0.3936 | 0.7038 | 0.3986 | 0.5750 |
| 10 | Cohere multilingual-v3 | 384 | 0.3897 | 0.6823 | 0.5063 | 0.5701 |
multilingual-e5-large and Nomic embed-text v1.5 run without the instruction prefixes their model cards prescribe; see the protocol note above.
Search
NDCG@10, 4 tasks.
| Rank | Model | Food | Concept | Diet | Noisy | Avg |
|---|---|---|---|---|---|---|
| 1 | Voyage 4 Large | 0.6779 | 0.5615 | 0.2378 | 0.7535 | 0.5577 |
| 2 | OpenAI text-embedding-3-large | 0.6905 | 0.5503 | 0.2163 | 0.7569 | 0.5535 |
| 3 | Cohere embed-v4 | 0.6439 | 0.4941 | 0.1829 | 0.7482 | 0.5173 |
| 4 | Latimal food-embed-v1 | 0.6133 | 0.4348 | 0.2006 | 0.6646 | 0.4783 |
| 5 | Alibaba GTE-large-en v1.5 | 0.6020 | 0.4686 | 0.2273 | 0.5986 | 0.4741 |
| 6 | Nomic embed-text v1.5 | 0.5750 | 0.3661 | 0.1577 | 0.6461 | 0.4362 |
| 7 | BAAI/bge-m3 | 0.5529 | 0.3364 | 0.1483 | 0.6247 | 0.4156 |
| 8 | multilingual-e5-large | 0.5364 | 0.3164 | 0.1385 | 0.5831 | 0.3936 |
| 9 | Cohere multilingual-v3 | 0.5134 | 0.3483 | 0.1371 | 0.5602 | 0.3897 |
| 10 | Lexical (TF) | 0.5353 | 0.2011 | 0.0887 | 0.3157 | 0.2852 |
Matching
Best F1, 7 tasks.
| Rank | Model | Indian | Global | Bev | Bakery | Portion | Noisy Menu | X-Lingual | Avg |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Latimal food-embed-v1 | 0.8165 | 0.8673 | 0.7461 | 0.7545 | 0.9722 | 0.9164 | 0.8862 | 0.8513 |
| 2 | OpenAI text-embedding-3-large | 0.7454 | 0.8284 | 0.7149 | 0.7349 | 0.8493 | 0.6850 | 0.7484 | 0.7580 |
| 3 | Voyage 4 Large | 0.7178 | 0.7831 | 0.7192 | 0.7154 | 0.7909 | 0.6400 | 0.8196 | 0.7409 |
| 4 | Cohere embed-v4 | 0.7316 | 0.8288 | 0.7095 | 0.6914 | 0.8354 | 0.6667 | 0.7214 | 0.7407 |
| 5 | Nomic embed-text v1.5 | 0.7305 | 0.7320 | 0.7153 | 0.6838 | 0.8548 | 0.7500 | 0.7069 | 0.7390 |
| 6 | Lexical (TF) | 0.6867 | 0.6868 | 0.7061 | 0.6824 | 0.8044 | 0.8215 | 0.7069 | 0.7278 |
| 7 | BAAI/bge-m3 | 0.7107 | 0.7160 | 0.7061 | 0.6837 | 0.8207 | 0.6736 | 0.7169 | 0.7182 |
| 8 | multilingual-e5-large | 0.6805 | 0.7157 | 0.7061 | 0.6879 | 0.7572 | 0.6482 | 0.7310 | 0.7038 |
| 9 | Alibaba GTE-large-en v1.5 | 0.7048 | 0.6950 | 0.7098 | 0.6824 | 0.7246 | 0.6715 | 0.7069 | 0.6993 |
| 10 | Cohere multilingual-v3 | 0.6941 | 0.6711 | 0.7061 | 0.6824 | 0.6687 | 0.6400 | 0.7136 | 0.6823 |
Classification
Macro F1, 1 task.
| Rank | Model | Cuisine | Avg |
|---|---|---|---|
| 1 | OpenAI text-embedding-3-large | 0.8329 | 0.8329 |
| 2 | Voyage 4 Large | 0.7895 | 0.7895 |
| 3 | Latimal food-embed-v1 | 0.7382 | 0.7382 |
| 4 | Cohere embed-v4 | 0.7369 | 0.7369 |
| 5 | Alibaba GTE-large-en v1.5 | 0.7156 | 0.7156 |
| 6 | Nomic embed-text v1.5 | 0.7103 | 0.7103 |
| 7 | BAAI/bge-m3 | 0.7009 | 0.7009 |
| 8 | Lexical (TF) | 0.6885 | 0.6885 |
| 9 | Cohere multilingual-v3 | 0.5063 | 0.5063 |
| 10 | multilingual-e5-large | 0.3986 | 0.3986 |
Reproduce Any Row
The exact command for every row. Run these from a checkout of the GitHub repo after pip install -e ".[all]"; the notes below cover where a rerun can differ from the published numbers. Export OPENAI_API_KEY, COHERE_API_KEY, and VOYAGE_API_KEY first for the API rows (see Running the Benchmark). Local models are deterministic on a given hardware and library stack. API rows are point-in-time: each result file carries its evaluation timestamp, and reproduction depends on the provider serving the same model it served then.
foodeval run --model lexical-tf --output results/bm25.json # legacy filename, kept to match the shipped file
foodeval run --model BAAI/bge-m3 --dim 384 --output results/bge_m3_384.json
foodeval run --model Alibaba-NLP/gte-large-en-v1.5 --dim 384 --output results/gte_large_v15_384.json
foodeval run --model intfloat/multilingual-e5-large --dim 384 --output results/e5_large_384.json
foodeval run --model nomic-ai/nomic-embed-text-v1.5 --dim 384 --output results/nomic_embed_v15_384.json
foodeval run --model openai:text-embedding-3-large --dim 384 --output results/openai_te3_large_384.json
foodeval run --model cohere:embed-v4.0 --dim 384 --output results/cohere_embed_v4_384.json
foodeval run --model cohere:embed-multilingual-v3.0 --dim 384 --output results/cohere_multilingual_v3_384.json
foodeval run --model voyage:voyage-4-large --dim 384 --output results/voyage_4_large_384.json
LATIMAL_API_KEY=... python3 scripts/run_latimal.py --dim 384 --output results/latimal_food_embed_v1_384.json
Three notes for reproducers. The published local rows were measured with sentence-transformers 5.3.0, torch 2.10.0, and scikit-learn 1.8.0 on Apple Silicon (MPS); if your numbers differ in the last decimals, check those versions first. Shipped result files carry curated display names, and their timings reflect cached re-aggregation rather than wall-clock runs; a rerun's model_name will be the adapter's raw name. Only the scores need to match, with one exception. For the Latimal row, the production /embed endpoint applies standard input normalization, so an API rerun scores above the published row wherever inputs carry markup noise. In our verification run the aggregate came back +0.012 (task deltas -0.004 to +0.037, the largest on Bakery Match). The published row was measured on raw text, like every other row.
To submit your model's results, see Contributing Results.
Task Descriptions
Food Search
178 queries across three menu domains (Indian, global, beverage) matched against a shared corpus of 974 items. Each query has graded relevance judgments (0-3), assigned by the Latimal team and audited in multiple passes (see Data Provenance). This is the core search quality task. Category queries like "appetizers and starters" require knowing what counts as one: "arancini," "bruschetta," and "edamame" are all relevant, and general web-text models cluster them only loosely. Even the strongest model on this task leaves headroom (top score 0.69 NDCG@10).
Concept Search
44 abstract concept queries ("barbecue dish," "biryani and rice dish," "warm comfort food," "crispy appetizer") matched against 301 corpus items. Concept search tests whether the model understands food categories or just matches on lexical overlap. A model that pairs "biryani" with "chicken biryani" on substring overlap alone will fail when "warm comfort food" has to map to "mac and cheese" or "rajma."
Diet/Allergen Search
35 queries centered on diet and allergen constraints ("celiac friendly," "shellfish allergy safe," "halal food"), plus a few nutritional-property and meal-type queries ("high protein," "iron rich dishes," "breakfast options"), matched against 245 items. Requires the model to understand dietary properties that are rarely stated explicitly in menu text. "Grilled salmon" is, as typically prepared, gluten-free; the model has to know that on its own, since nothing in the item name says so. Relevance grades reflect typical preparations: the task measures retrieval under a dietary constraint. It does not certify allergen safety.
Noisy Search
46 queries with realistic misspellings and abbreviations ("bbq burgeer," "bibimbab," "paner tikka") matched against 304 corpus items. Real users do not type menu item names correctly. This task measures robustness to the kind of input a production search bar actually receives. The Lexical (TF) baseline manages only 0.316 NDCG@10 here, less than half what the leading dense models score on the same queries; subword-tokenized models handle noise much better.
Indian Match
498 menu-item pairs drawn from Indian-anchored production menus, which also carry the pizza, fast-food, and cafe items real menus do. Tests same-dish detection for pairs like "Chef's Special Biryani" vs. "Biryani" (same dish) and "Chicken Tikka" vs. "Paneer Tikka" (different dishes, confusable names).
Global Match
318 pairs of non-Indian cuisine menu items spanning global cuisines. Tests same-dish detection for pairs like "Pad Thai Chicken" vs. "Chicken Pad Thai" (same dish) and "Gyoza" vs. "Shumai" (different dumplings, same category). Global food items have fewer transliteration variants but more cross-cuisine confusables.
Beverage Match
295 pairs of beverage items including teas, coffees, juices, smoothies, and cocktails. Beverages have their own deduplication challenges: "Aam Lassi" vs. "Mango Lassi" (same drink, bilingual naming), "Acai Smoothie" vs. "Acai Bowl" (same base, different product). Modifier sensitivity matters more for beverages than for food.
Bakery Match
336 pairs of bakery and dessert items. Tests same-item detection for pairs like "Almond Croissant" vs. "Croissant Almond" (same item, reordered) and "Apple Crumble" vs. "Apple Cider" (different items, shared prefix). Bakery items have high within-category similarity that produces false merges.
Portion Size
228 pairs testing whether the model can distinguish portion or size variants from genuinely different items. Examples include "Regular Pepsi" vs. "Large Pepsi" (same item, different size) and "Large Coffee" vs. "Large Smoothie" (same size word, different items). Models must learn that size modifiers do not change dish identity, while ingredient and preparation modifiers do.
Noisy Menu Match
323 pairs with markup noise, casing artifacts, and symbol clutter. Tests robustness to real-world menu formatting like ***HOT*** Spicy Ramen (Large) @¥980 🌶 vs. Spicy Ramen and #7 BUTTER CHICKEN 🔥 $14.99 [BESTSELLER] vs. Butter Chicken (same dish behind the decoration). Production menus are messy; models have to see through the noise.
Cross-Lingual Match
514 pairs across three categories: romanized (186 pairs), bilingual (173), and cross-script/CJK (155). Tests whether the model can match dishes across writing systems and transliterations. Examples include Beef Pho vs. Pho Bo (romanized), Aloo Gobi आलू गोभी vs. आलू गोभी (bilingual), and cross-script pairs mixing Latin, Devanagari, CJK, and other scripts. General multilingual models often key on "both are cross-lingual food text" rather than actual dish identity, producing false merges.
Cuisine Classification
3,053 items across 26 menu-taxonomy classes: mostly cuisines (North Indian, Italian, Mexican, Japanese), plus format categories like QSR and Street Food that real menu systems must handle. The evaluation trains a LogisticRegression probe on frozen embeddings with an 80/20 stratified split, repeated across 10 random seeds. Macro F1 is reported. This measures how well the embedding space separates cuisines. Low-frequency classes like SE Asian, Goan, and Ethiopian are deliberately included with only 30 to 35 examples each, because real menus contain these cuisines and a production system has to cope with them.
Data Format
FoodEval stores all data as JSON files in foodeval/data/. Three schemas correspond to the three task types.
Retrieval Tasks (food_search, concept_search, diet_search, noisy_search)
{
"task": "food_search",
"version": "0.1.0",
"description": "...",
"metric": "ndcg@10",
"corpus": ["butter chicken", "paneer tikka", "..."],
"queries": [
{
"id": "q001",
"query": "american burger and fries",
"domain": "global",
"relevance": {
"cheeseburger": 3,
"hot dog": 2,
"bbq ribs": 1
}
}
],
"metadata": {
"n_queries": 178,
"n_corpus": 974,
"domains": ["beverage", "global", "indian"]
}
}
Relevance grades: 3 = highly relevant, 2 = relevant, 1 = marginally relevant. Grade-0 items are omitted from the relevance map; anything absent from a query's map scores 0, the standard convention for incompletely judged collections. Suspected missing judgments are handled as versioned data fixes (see Report a data issue in CONTRIBUTING).
Pair Classification Tasks (indian_match, global_match, beverage_match, bakery_match, portion_size, noisy_menu_match, cross_lingual_match)
{
"task": "indian_match",
"version": "0.1.0",
"description": "...",
"metric": "best_f1",
"pairs": [
{
"id": "p001",
"text_a": "**BESTSELLER** Butter Chicken",
"text_b": "Butter Chicken",
"label": 1
}
],
"metadata": {
"n_pairs": 498,
"n_positive": 234,
"n_negative": 264
}
}
Labels: 1 = same dish, 0 = different dish. cross_lingual_match pairs additionally carry a domain field (romanized, bilingual, or cross_script), echoed in its metadata as domains.
Classification Task (cuisine_classify)
{
"task": "cuisine_classify",
"version": "0.1.0",
"description": "...",
"metric": "macro_f1",
"items": [
{
"id": "i001",
"text": "aaloo gobi",
"label": "North Indian",
"source": "indian"
}
],
"label_names": ["American", "Bengali", "Biryani", "..."],
"metadata": {
"n_items": 3053,
"n_classes": 26
}
}
Running the Benchmark
Local Models (sentence-transformers)
pip install "foodeval[local]"
# Run all tasks
foodeval run --model BAAI/bge-m3 --dim 384
# Run specific tasks
foodeval run --model BAAI/bge-m3 --dim 384 --tasks food_search,indian_match
# Save results to JSON
foodeval run --model BAAI/bge-m3 --dim 384 --output results/bge_m3_384.json
# Local model directory
foodeval run --model ./my-fine-tuned-model --dim 384
Note: the sentence-transformers adapter loads models with trust_remote_code=True (some architectures require it, e.g. GTE v1.5). Point it only at checkpoints you trust.
OpenAI Models
pip install "foodeval[api]"
export OPENAI_API_KEY=sk-...
foodeval run --model openai:text-embedding-3-large --dim 384
AWS Bedrock Models
pip install "foodeval[api]"
foodeval run --model bedrock:cohere.embed-multilingual-v3 --dim 384
foodeval run --model bedrock:amazon.titan-embed-text-v2:0 --dim 384
Uses your standard AWS credential chain (env vars, ~/.aws/credentials, or an instance role). Default region is us-east-1; override with AWS_BEDROCK_REGION.
Cohere and Voyage
Both adapters use the providers' REST APIs directly, so the base install is enough.
export COHERE_API_KEY=...
foodeval run --model cohere:embed-v4.0 --dim 384
export VOYAGE_API_KEY=...
foodeval run --model voyage:voyage-4-large --dim 384
The Voyage adapter paces requests for the free tier (3 requests per minute, about 21 seconds between calls; a full run takes roughly 30 minutes). Set VOYAGE_MIN_INTERVAL=0 for paid keys. The default endpoint is the MongoDB-hosted Voyage service; override with VOYAGE_BASE_URL.
Gemini and Vertex AI
Gemini and Vertex AI also run on the base install, calling Google's REST endpoints directly. Vertex authenticates with Application Default Credentials (gcloud auth application-default login).
export GEMINI_API_KEY=...
foodeval run --model gemini:gemini-embedding-2 --dim 384
export GOOGLE_CLOUD_PROJECT=...
foodeval run --model vertex:gemini-embedding-001 --dim 384
Lexical Baseline (TF)
pip install foodeval
foodeval run --model lexical-tf
The CLI key is lexical-tf (bm25 is accepted as a legacy alias); the implementation is hashed term-frequency vectors with cosine similarity, deliberately minimal (no IDF, no BM25-style length weighting). It answers one question: does plain term overlap solve the task?
Other Commands
# List all available tasks
foodeval list
# Inspect a task
foodeval info food_search
# Generate leaderboard from saved result files
foodeval leaderboard results/
# Check a training corpus for exact overlap with benchmark data
foodeval preflight --compare path/to/training_data/
# Plan or execute the baseline run matrix
foodeval matrix --list
Custom Adapters
Implement the EmbeddingAdapter protocol to evaluate any model:
import numpy as np
from foodeval.evaluate import run_benchmark
class MyAdapter:
@property
def name(self) -> str:
return "my-model-384d"
@property
def dimension(self) -> int:
return 384
def encode(
self,
texts: list[str],
batch_size: int = 64,
normalize: bool = True,
) -> np.ndarray:
# Your encoding logic here
# Return float32 array of shape (len(texts), self.dimension)
...
result = run_benchmark(MyAdapter())
result.to_json("results/my_model.json")
Evaluation Methodology
Metrics
- NDCG@10 for retrieval tasks. Normalized Discounted Cumulative Gain at rank 10, computed from graded relevance judgments with exponential gains (2^rel - 1) and log2 position discounts; the normalizer is the ideal DCG at the same rank-10 cut, taken over the query's complete relevance set (the standard definition), so a perfect ranking scores 1.0 on every query. All corpus items are ranked by cosine similarity against the query embedding. Mean NDCG@10 across all queries is the primary score.
- Best F1 for pair classification tasks. Cosine similarity is computed for each pair. A threshold sweep finds the operating point that maximizes F1 score. The sweep uses the actual observed similarity values as threshold candidates, guaranteeing the true optimal threshold is found. Two properties to keep in mind when reading Best F1. First, because the threshold is chosen on the evaluation pairs themselves, Best F1 is an oracle operating point and an upper bound on deployed F1; choosing the threshold this way is standard practice for pair classification. Second, the metric has a floor: predicting every pair positive yields 2p/(1+p), where p is the positive rate; on these tasks that floor ranges from 0.639 (Indian Match) to 0.707 (Cross-Lingual Match). A score at the floor (several appear in the leaderboard, e.g. 0.7061 on Beverage Match) means no similarity threshold separates same-dish from different-dish pairs better than calling every pair a match.
- Macro F1 for classification. A LogisticRegression probe is trained on frozen embeddings with an 80/20 stratified split. The experiment is repeated across 10 random seeds (42-51). Mean macro F1 across seeds is reported along with standard deviation.
Aggregate Score
FoodEval Score is the unweighted mean of all 12 task main scores (1/12 each), stored as aggregate_score in result JSONs. That makes Matching, with 7 of the 12 tasks, the largest contributor to the composite. This is the flat-mean convention the original MTEB leaderboard popularized. It keeps the composite trivially recomputable: take the mean of the 12 main_score values. Hand-averaging the printed category cells can disagree in the last digit due to rounding. Category means for Search (mean NDCG@10 across 4 tasks), Matching (mean Best F1 across 7 tasks), and Classification (Macro F1) are reported alongside for capability-level comparison. Tasks have different scales (Best F1 has an all-positive floor of 0.639 to 0.707 on these tasks, while NDCG@10 on diet and allergen search tops out below 0.24), so the aggregate is best used for relative ranking rather than absolute interpretation.
Confidence Intervals
For retrieval tasks, 95% confidence intervals come from bootstrap resampling (1,000 iterations) of per-query NDCG@10. For matching tasks, pairs are resampled and F1 is recomputed at the chosen threshold. Classification reports mean and standard deviation across the 10 seeds instead. These appear in the detailed JSON output but are omitted from the leaderboard tables for readability.
Data Provenance
All items are individual menu-item names drawn from production menu data, then curated, relabeled, and deduplicated by the Latimal team. Relevance grades, pair labels, and cuisine labels were produced in-house and audited in multiple passes: label fixes, relevance-grade recalibration, and duplicate removal. The canonical copy of the dataset lives on the Hub at huggingface.co/datasets/latimal/foodeval (gated, agreement auto-approved). Labeling disputes surfaced by the community are handled through versioned data fixes: each task file carries a version field, and any fix large enough to move scores is flagged in the changelog (see Report a data issue).
Reproducibility
All evaluation data is fixed and ships with the package. Every stochastic component is seeded: the classification probe (10 fixed seeds, 42-51) and the bootstrap confidence intervals (fixed resampling seed), so repeat runs are stable given the same model weights and adapter configuration.
FAQ
Why food embeddings?
Food delivery platforms process millions of menu items across dozens of languages and writing systems. The core operations (search, deduplication, classification, recommendation) all depend on text embeddings. General-purpose models trained on web text miss food-specific semantics: that "Dal Makhani" and "Black Lentil Curry" are the same dish, that "warm comfort food" should return "mac and cheese" and "tomato soup," or that a menu item with ***HOT*** prefixed is the same dish as the one without it. These gaps cause real production failures.
Why not just use MTEB?
MTEB is the standard for general embedding evaluation. It contains no menu- or dish-level tasks (the closest, NFCorpus, retrieves medical literature for nutrition-seeded queries). Every model on our leaderboard, our own and the strongest general-purpose API models included, scores below 0.24 NDCG@10 on diet and allergen search, and two open-weight models sit at the all-positive floor on cross-lingual matching. FoodEval is complementary to MTEB: it measures domain-specific capabilities that general benchmarks do not surface.
Can I submit results?
Yes. Run the full benchmark at d=384, save the output JSON, and open a pull request with your result file; the steps are in Contributing Results.
What dimension should I use?
The leaderboard standardizes on d=384 so every model is compared at the same width. This is a practical operating point for production deployment (low memory, fast cosine search). On the board, every dense model runs at 384 regardless of whether it was trained for truncation (the protocol note above the leaderboard spells out how). Outside the leaderboard you can evaluate at any dimension by passing --dim to the CLI, or omit it for the model's native width.
Can I use this data for training?
No. FoodEval is licensed under CC-BY-NC-4.0 with an evaluation-only addendum. Using this data as training data for machine learning models is expressly prohibited. Without that restriction, the scores stop meaning anything.
Contributing Results
- Run the full benchmark (all 12 tasks) at d=384 for leaderboard inclusion.
- Save results:
foodeval run --model your-model --dim 384 --output results/your_model_384.json. - Open a PR to github.com/latimal/foodeval with the result JSON.
- Include the model name, source (Hugging Face ID, API, etc.), and any notable configuration.
- Confirm in the PR that the model was not trained, fine-tuned, or distilled on FoodEval data. Full requirements: CONTRIBUTING.md.
Citation
@misc{foodeval2026,
title = {FoodEval: A Benchmark for Food-Domain Text Embeddings},
author = {Patni, Aditya},
year = {2026},
url = {https://github.com/latimal/foodeval},
version = {0.1.0},
license = {CC-BY-NC-4.0 with evaluation-only addendum}
}
License
CC-BY-NC-4.0 with an evaluation-only addendum. See LICENSE for the license summary and addendum text, and creativecommons.org/licenses/by-nc/4.0/legalcode for the full legal code.
As licensor, Latimal grants two permissions on top of the license, recorded in LICENSE: running FoodEval to evaluate and compare models is permitted in any setting, including inside a company; and the evaluation code and tooling may be used commercially. What is never permitted, in any setting: using the data to train, fine-tune, or distill a model, or selling the data. The data may be redistributed only under these same terms, unmodified; the code carries the commercial-use permission above.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file foodeval-0.1.1.tar.gz.
File metadata
- Download URL: foodeval-0.1.1.tar.gz
- Upload date:
- Size: 232.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13fedf74fb9eb37c93af46ec9bb3dd0afd1af858a6b556f8cd6d28ea9521e88e
|
|
| MD5 |
ad0a14bf42f6a2333c6813614189c5df
|
|
| BLAKE2b-256 |
2ada2273210762b6064f25bec09f663547bb38873cd509ddfa0ce7b7f7d14492
|
File details
Details for the file foodeval-0.1.1-py3-none-any.whl.
File metadata
- Download URL: foodeval-0.1.1-py3-none-any.whl
- Upload date:
- Size: 196.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a73fefb55f02e738936b56e9d9d2f9cf643fd0588acdced141e3e2f4552ab47a
|
|
| MD5 |
d8827c78ea8a1a3c42720f0e98d8f80c
|
|
| BLAKE2b-256 |
53f5f32a900bfbfbb7522414427a9a8795e238bfd7a048d84ec80a0a657bd56f
|