Agnostic-Verification with Embedding Ranking Tiers
Project description
A-VERT
A-VERT is a method for comparing LM generations to target responses. It is intended to replace the exact-match or logprobs technique normally used in benchmarks, which makes evaluations diverge from real-world scenarios.
This repository is ordered as follows:
./a_vert: Code for thea_vertlibrary../lm-eval_tasks: lm-eval compatible tasks that usea_vertlibrary../notebooks: Ipython notebooks used to produce the A-VERT paper results../examples: Example deployments usingdocker-composefor both the LLM and A-VERT models.
Installing
The package is available on pip:
pip install a_vert
Building
We use poetry to manage the package, to install just do:
poetry install
Usage
In order to use a_vert you need to have an embeddings or reranker model deployed and the access data available in the following environment variables. Please go to the examples folder for more detailed examples of deployment of an LLM and A-VERT model using docker-compose with vLLM.
Required Environment Variables
The following environment variables are required to use A-VERT:
AVERT_MODEL_ENDPOINT: Endpoint of the embedding or reranker model (e.g.,http://127.0.0.1:8000).AVERT_ENDPOINT_TYPE: Backend type - eithervllm(OpenAI-compatible) ortei.AVERT_MODEL_NAME: The name of theavertserved model (required forvllmandopenaiendpoint types).AVERT_METHOD: Method to use - eitherrerankorembedding(required, no default value).
Template Configuration
You can configure A-VERT templates in two ways:
Option 1: Use a predefined template (recommended)
Set AVERT_PROMPT_TEMPLATE to one of the following predefined templates:
qwen3-reranker: For Qwen3-Reranker-0.6B-seq-cls, Qwen3-Reranker-4B-seq-cls, Qwen3-Reranker-8B-seq-clsempty: For gte-reranker-modernbert-base, jina-reranker-v2-base-multilingual, bge-reranker-v2-m3, gte-modernbert-baseembedding-with-instruction: For Qwen3-Embedding-0.6B, Qwen3-Embedding-4B, Qwen3-Embedding-8B, multilingual-e5-large-instruct
Option 2: Provide custom templates
If you need custom templates, you must define both:
AVERT_DOCUMENT_TEMPLATE: Custom document template string (use{document}as placeholder)AVERT_QUERY_TEMPLATE: Custom query template string (use{query}as placeholder)
Important: When using custom templates, both AVERT_DOCUMENT_TEMPLATE and AVERT_QUERY_TEMPLATE must be set. Setting only one will raise an error.
Dynamic Instruction Injection:
A-VERT supports dynamic, task-aware instruction injection at runtime:
AVERT_INSTRUCTION_CONFIG_PATH: Path to a JSON file mapping task names to instruction strings (optional)AVERT_INSTRUCTION_PROMPT: Default instruction text when no task-specific instruction is found (optional, overrides JSON"default"key)
To use instruction injection:
- Include the
{instruction}placeholder in either your document template or query template (but not both). - Provide instructions via:
- A JSON file (pointed to by
AVERT_INSTRUCTION_CONFIG_PATH) containing task-specific instructions and optionally a"default"key - And/or a default instruction via
AVERT_INSTRUCTION_PROMPT(takes precedence over JSON"default")
- A JSON file (pointed to by
Instruction Precedence:
- Environment variable override: If
AVERT_INSTRUCTION_PROMPTis set, it replaces the"default"key from the JSON file - JSON default: If no env var is set, the
"default"key from the JSON file is used - Task-specific: At runtime, task-specific instructions from the JSON always override the default
Validation:
- At setup time, A-VERT validates that:
- The
{instruction}placeholder appears in at most one template - If any template contains
{instruction}, a non-empty default instruction exists (from eitherAVERT_INSTRUCTION_PROMPTor JSON"default")
- The
- At runtime, for each example:
- If a task-specific instruction exists in the JSON map, it is used
- Otherwise, the default instruction is used
- The instruction is injected into the template containing the placeholder
You must either set
AVERT_PROMPT_TEMPLATE, or provide custom templates viaAVERT_DOCUMENT_TEMPLATEandAVERT_QUERY_TEMPLATE. If neither is set, an error will be raised.
Additional Configuration
Grouping Method:
AVERT_GROUPING: Method to aggregate distances from multiple candidates (optional, defaults tomax)- Available static methods:
max,mean - Available dynamic methods:
mean_top_k_<k>where<k>is an integer (e.g.,mean_top_k_3,mean_top_k_5) - Example:
export AVERT_GROUPING="mean_top_k_5"
- Available static methods:
Enhancement:
AVERT_ENHANCE: Whether to enhance candidate groups -trueorfalse(optional, defaults totrue)- Example:
export AVERT_ENHANCE="true"
- Example:
Logging:
AVERT_LOG_LEVEL: Control logging verbosity (optional, defaults toWARNING)- Available levels:
DEBUG,INFO,WARNING,ERROR,CRITICAL - Example:
export AVERT_LOG_LEVEL="DEBUG"
- Available levels:
Example Configuration
# Required
export AVERT_MODEL_ENDPOINT="http://localhost:8000"
export AVERT_ENDPOINT_TYPE="vllm"
export AVERT_MODEL_NAME="avert-model"
export AVERT_METHOD="rerank" # or "embedding"
# Template (choose one approach)
export AVERT_PROMPT_TEMPLATE="qwen3-reranker"
# OR use custom templates:
# export AVERT_DOCUMENT_TEMPLATE="<Document>: {document}"
# export AVERT_QUERY_TEMPLATE="<Query>: {query}"
# Optional: Task-aware instruction injection
# export AVERT_INSTRUCTION_CONFIG_PATH="./examples/instructions.example.json"
# export AVERT_INSTRUCTION_PROMPT="Determine whether the document is relevant to the query."
# Note: If using {instruction} placeholder in templates, you must provide at least a default instruction
# Additional (optional)
export AVERT_GROUPING="max"
export AVERT_ENHANCE="true"
Running a Test Task
To run a test task (2 examples, fast), just use the lm-eval library and add the provided tasks as additional tasks:
lm_eval \
--model local-chat-completions \
--tasks babi-task_01-single_supporting_fact \
--model_args '{"base_url":"http://localhost:8001/v1/chat/completions","timeout":"600","max_retries":3,"tokenized_requests":false, "model":"llm-model"}' \
--num_fewshot 0 \
--apply_chat_template \
--trust_remote_code \
--include_path ./lm-eval_tasks \
--limit 2
Note: please adjust
base_urlandmodelinmodel_argsto point to your LLM endpoint if your are using a custom setup.
Paper
The paper is available here.
In order to reproduce the results, please download the test data from here (~480 MB) and extract it in the root of this repository (approx 3.1 GB of space is needed). The notebooks will look for it at: ./data.
Abstract:
The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than 10B parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.
Cite as:
@misc{aguirre2025avertagnosticverificationembedding,
title={A-VERT: Agnostic Verification with Embedding Ranking Targets},
author={Nicolás Aguirre and Ramiro Caso and Ramiro Rodríguez Colmeiro and Mauro Santelli and Joaquín Toranzo Calderón},
year={2025},
eprint={2510.01469},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.01469},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file a_vert-0.3.5.tar.gz.
File metadata
- Download URL: a_vert-0.3.5.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.4 Linux/6.17.0-19-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd263b52bceca08929579fe6ba09e85973e92be3909a07a47bc881e6a6f0740f
|
|
| MD5 |
3fb964f99e2ffbccfc2b0d4824a8e810
|
|
| BLAKE2b-256 |
d248329571ce001a11894e0423c5e890dfbd46c84678c9a214c7f0505daedff2
|
File details
Details for the file a_vert-0.3.5-py3-none-any.whl.
File metadata
- Download URL: a_vert-0.3.5-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.4 Linux/6.17.0-19-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
676b5fbc8fa5a5bbc09d2b8f979d2b2c3c5a618993379f808b0dde7341512beb
|
|
| MD5 |
646ba55d5bf3f442e251471671b01b12
|
|
| BLAKE2b-256 |
d5d02d32fe3710157e724dfb7273e6cbaecad32a957a40e8e556583b41368c6e
|