Ecommerce search relevance evaluation tool

These details have not been verified by PyPI

Project links

Project description

veritail

LLM evals framework tailored for ecommerce search.

veritail scores every query-result pair, computes IR metrics from those scores, and runs deterministic quality checks — all in a single command. Run it on every release to track search quality, or compare two configurations side by side to measure the impact of a change before it ships.

Five evaluation layers:

LLM-as-a-Judge scoring — every query-result pair scored 0-3 with structured reasoning, using any cloud or local model
IR metrics — NDCG, MRR, MAP, Precision, and attribute match computed from LLM scores
Deterministic quality checks — low result counts, near-duplicate results, out-of-stock ranking issues, price outliers, and more
Autocorrect evaluation — catches intent-altering or unnecessary query corrections
Autocomplete evaluation — deterministic checks and LLM-based semantic evaluation for type-ahead suggestions

Includes 14 built-in ecommerce verticals for domain-aware judging, with support for custom vertical context and rubrics. Optional Langfuse integration for full observability — every judgment, score, and LLM call traced and grouped by evaluation run.

Search relevance evaluation demo

LLM-as-a-Judge scores every query-result pair, computes NDCG/MRR/MAP/Precision, runs deterministic checks, and evaluates autocorrect behavior.

Quick Start

1. Install

pip install veritail                   # OpenAI + local models (default)
pip install veritail[anthropic]        # + Claude support
pip install veritail[gemini]           # + Gemini support
pip install veritail[cloud]            # all three cloud providers
pip install veritail[cloud,langfuse]   # everything

The base install includes the OpenAI SDK because it doubles as the client for OpenAI-compatible local servers (Ollama, vLLM, LM Studio, etc.) — so pip install veritail works with both cloud and local models out of the box.

2. Bootstrap starter files (recommended)

veritail init

This generates:

adapter.py with a real HTTP request skeleton for both search() and suggest() (endpoint, auth header, timeout, JSON parsing)
queries.csv with example search queries (query types are automatically classified by the LLM during evaluation)
prefixes.csv with example prefixes (prefix types are automatically inferred from character count)

By default, existing files are not overwritten. Use --force to overwrite.

3. Create a query set (manual option)

query
red running shoes
wireless earbuds
nike air max 90

Optional columns: type (navigational, broad, long_tail, attribute) and category. When omitted, type is automatically classified by the LLM judge before evaluation.

4. Generate queries with an LLM (alternative)

If you don't have query logs yet, let an LLM generate a starter set:

# From a built-in vertical
veritail generate-queries --vertical electronics --output queries.csv --llm-model gpt-4o

# From business context
veritail generate-queries --context "B2B industrial fastener distributor" --output queries.csv --llm-model gpt-4o

# Both vertical and context, custom count
veritail generate-queries \
  --vertical foodservice \
  --context "BBQ restaurant equipment supplier" \
  --output queries.csv \
  --count 50 \
  --llm-model gpt-4o

This writes a CSV with query, type, category, and source columns. Review and edit the generated queries before running an evaluation — the file is designed for human-in-the-loop review.

Cost note: Query generation makes a single LLM call (a fraction of a cent with most cloud models).

5. Create an adapter (manual option)

# my_adapter.py
from veritail import SearchResponse, SearchResult


def search(query: str) -> SearchResponse:
    results = my_search_api.query(query)
    items = [
        SearchResult(
            product_id=r["id"],
            title=r["title"],
            description=r["description"],
            category=r["category"],
            price=r["price"],
            position=i,
            in_stock=r.get("in_stock", True),
            attributes=r.get("attributes", {}),
        )
        for i, r in enumerate(results)
    ]
    return SearchResponse(results=items)
    # To report autocorrect / "did you mean" corrections:
    # return SearchResponse(results=items, corrected_query="corrected text")

Adapters can return either SearchResponse or a bare list[SearchResult] (backward compatible). Use SearchResponse when your search engine returns autocorrect information.

6. Run evaluation

export OPENAI_API_KEY=sk-...

veritail run \
  --queries queries.csv \
  --adapter my_adapter.py \
  --llm-model gpt-4o \
  --top-k 10 \
  --open

For a detailed breakdown of API call volume and cost control options, see LLM Usage & Cost.

Outputs are written under:

eval-results/<generated-or-custom-config-name>/

7. Compare two search configurations

veritail run \
  --queries queries.csv \
  --adapter bm25_search_adapter.py --config-name bm25-baseline \
  --adapter semantic_search_adapter.py --config-name semantic-v2 \
  --llm-model gpt-4o

The comparison report shows metric deltas, overlap, rank correlation, and position shifts.

Vertical Guidance

--vertical injects domain-specific scoring guidance into the judge prompt. Each vertical teaches the LLM judge what matters most in a particular ecommerce domain — the hard constraints, industry jargon, certification requirements, and category-specific nuances that generic relevance scoring would miss.

Choose the vertical that best matches the ecommerce site you are evaluating.

Vertical	Description	Example retailers
`automotive`	Aftermarket, OEM, and remanufactured parts for cars, trucks, and light vehicles	RockAuto, AutoZone, FCP Euro
`beauty`	Skincare, cosmetics, haircare, fragrance, and body care	Sephora, Ulta Beauty, Dermstore
`electronics`	Consumer electronics and computer components	Best Buy, Newegg, B&H Photo
`fashion`	Clothing, shoes, and accessories	Nordstrom, ASOS, Zappos
`foodservice`	Commercial kitchen equipment and supplies for restaurants, cafeterias, and catering	WebstaurantStore, Katom, TigerChef
`furniture`	Furniture and home furnishings for residential, commercial, and contract use	Wayfair, Pottery Barn, IKEA
`groceries`	Online grocery retail covering food, beverages, and household essentials	Instacart, Amazon Fresh, FreshDirect
`home-improvement`	Building materials, hardware, plumbing, electrical, and tools for contractors and DIY	Home Depot, Lowe's, Menards
`industrial`	Industrial supply and MRO (Maintenance, Repair, and Operations)	Grainger, McMaster-Carr, Fastenal
`marketplace`	Multi-seller marketplace platforms	Amazon, eBay, Etsy
`medical`	Medical and surgical supplies for hospitals, clinics, and home health	Henry Schein, Medline, McKesson
`office-supplies`	Office products, ink/toner, paper, and workspace equipment	Staples, Office Depot, W.B. Mason
`pet-supplies`	Pet food, treats, toys, health products, and habitat equipment across all species	Chewy, PetSmart, Petco
`sporting-goods`	Athletic equipment, apparel, and accessories across all sports and outdoor activities	Dick's Sporting Goods, REI, Academy Sports

You can also provide a custom vertical as a plain text file with --vertical ./my_vertical.txt. Use the built-in verticals in src/veritail/verticals/ as templates.

Use --context to layer enterprise-specific rules on top of a vertical — things like brand priorities, certification requirements, or domain jargon unique to your store. See Custom Rubrics & Enterprise Context for details.

Examples:

# Built-in vertical
veritail run \
  --queries queries.csv \
  --adapter my_adapter.py \
  --vertical foodservice

# Custom vertical text file
veritail run \
  --queries queries.csv \
  --adapter my_adapter.py \
  --vertical ./my_vertical.txt

# Vertical + enterprise-specific rules
veritail run \
  --queries queries.csv \
  --adapter my_adapter.py \
  --vertical home-improvement \
  --context "Pro contractor supplier. Queries for lumber should always prioritize pressure-treated options."

# Vertical + detailed business context from a file
veritail run \
  --queries queries.csv \
  --adapter my_adapter.py \
  --vertical home-improvement \
  --context context.txt

More Reports

Evaluate autocomplete suggestions

Autocomplete evaluation demo

Deterministic checks (duplicates, prefix coherence, encoding) and LLM-based semantic scoring for suggestion relevance and diversity.

Side-by-side comparison

Side-by-side comparison demo

Two search configurations compared head-to-head: per-query NDCG deltas, win/loss/tie analysis, rank correlation, and result overlap.

Langfuse observability

Langfuse observability demo

Every judgment, score, and LLM call traced and grouped by evaluation run — with full prompt/response visibility.

Documentation

Guide	Description
Evaluation Model	LLM judgment scoring, deterministic checks, and IR metrics
Supported LLM Providers	Cloud providers, local model servers, and model quality guidance
LLM Usage & Cost	API call volume breakdown and cost control strategies
Batch Mode & Resume	50% cost reduction via batch APIs and resuming interrupted runs
Autocorrect Evaluation	Evaluating query correction quality
Autocomplete Evaluation	Type-ahead suggestion evaluation with checks and LLM scoring
Custom Rubrics & Enterprise Context	Custom scoring rubrics and business-specific evaluation rules
Custom Checks	Adding domain-specific deterministic checks
CLI Reference	Complete flag reference for all commands
Backends	File and Langfuse storage backends
Development	Local development setup and running tests

Disclaimer

veritail uses large language models to generate relevance judgments. LLM outputs can be inaccurate, inconsistent, or misleading. All scores, reasoning, and reports produced by this tool should be reviewed by a qualified human before informing production decisions. veritail is an evaluation aid, not a substitute for human judgment. The authors are not liable for any decisions made based on its output or for any API costs incurred by running evaluations. Users are responsible for complying with the terms of service of any LLM provider they use with this tool. Evaluation data is sent to the configured LLM provider for scoring — use a local model if data must stay on-premise. Adapter modules, custom check modules, and custom rubric files are loaded and executed as Python code at runtime — only run files you trust. Evaluation results, including product catalog data, are written to disk in plaintext under the output directory (eval-results/ by default) — ensure this directory is excluded from version control and not stored in shared or publicly accessible locations.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.1

Mar 15, 2026

0.5.0

Mar 14, 2026

0.4.2

Mar 4, 2026

0.4.1

Mar 3, 2026

0.4.0

Mar 2, 2026

0.3.1

Feb 28, 2026

0.3.0

Feb 28, 2026

0.2.2

Feb 23, 2026

0.2.1

Feb 23, 2026

0.2.0

Feb 22, 2026

This version

0.1.1

Feb 21, 2026

0.1.0

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

veritail-0.1.1.tar.gz (44.0 MB view details)

Uploaded Feb 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

veritail-0.1.1-py3-none-any.whl (167.2 kB view details)

Uploaded Feb 21, 2026 Python 3

File details

Details for the file veritail-0.1.1.tar.gz.

File metadata

Download URL: veritail-0.1.1.tar.gz
Upload date: Feb 21, 2026
Size: 44.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for veritail-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f53e5369d60899d4913af63fa9a0a44835b9700837628c56b5c7deaf031173d3`
MD5	`e8660c15ecb6e6a5443ee7299ba37436`
BLAKE2b-256	`09bee68bc4d6c6c6952a33dfb36c30b54e309f00981e267f0f8a880669d99914`

See more details on using hashes here.

File details

Details for the file veritail-0.1.1-py3-none-any.whl.

File metadata

Download URL: veritail-0.1.1-py3-none-any.whl
Upload date: Feb 21, 2026
Size: 167.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for veritail-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ecd1e30a0eacae5f1f060f257c580070c72216a2e86cd427d121d1df2085fef6`
MD5	`64862cbef38bce119f48874e742a58e7`
BLAKE2b-256	`ff10ea7bb10807ce5426fcde8ca1ab3aa673f675fbea0bcd46869341e7817682`

See more details on using hashes here.

veritail 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

veritail

Quick Start

1. Install

2. Bootstrap starter files (recommended)

3. Create a query set (manual option)

4. Generate queries with an LLM (alternative)

5. Create an adapter (manual option)

6. Run evaluation

7. Compare two search configurations

Vertical Guidance

More Reports

Evaluate autocomplete suggestions

Side-by-side comparison

Langfuse observability

Documentation

Disclaimer

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes