Long context evaluations - packaged by NVIDIA NeMo Evaluator
Project description
Competitive annalysis instructions
Docker:
docker build -f docker/Dockerfile_oai -t ruler .
This may be probably baked into container:
cd data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh
cd /workspace
For nvcf usage:
export API_KEY=...
If accessing to the gated repo you need to pass HF_TOKEN
export HF_TOKEN=...
Set model url (this url will be directly used in requests) e.g.,
URL=https://integrate.api.nvidia.com/v1/chat/completions
Set model name (this name will be send as model in request body) e.g.,
MODEL="meta/llama-3.2-3b-instruct"
Set tokenizer (this will be used for tokens estimation in data generation) and optionally tokenizer type (--tokenizer_type) (if not hf) e.g.
TOKENIZER="meta-llama/Llama-3.2-3B-Instruct"
Example with NVCF
python entrypoint.py --url $URL --tasks "niah_single_1,niah_single_2" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER
Run all tasks (already completed tasks will be skipped as long as they are present in the result directoryand the model and max_seq_length will not change).
python entrypoint.py --url $URL --tasks "all" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER
This will go through 13 RULER tasks and create a summary.csv file. The average of these 13 tasks, with num_samples = 500, is what we typically report for the given max_seq_length.
<result_dir>/<model_path>/synthetic/<max_seq_length>/pred/summary.csv
Evaluation with num_samples 500 and max_seq_length 4000
python entrypoint.py --url $URL --tasks "niah_single_1,niah_single_2" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER --num_samples 500 --max_seq_length 4000
Example with VLLM chat #TODO
Example with VLLM completion #TODO
📏 RULER: What’s the Real Context Size of Your Long-Context Language Models?
This repository contains code for our paper RULER: What’s the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity, and encompasses 13 tasks across 4 categories beyond simple retrieval from context. Here are our main results.
- Despite achieving nearly perfect performance on the vanilla NIAH test, all models exhibit large degradation on tasks in RULER as sequence length increases.
- While all models claim context size of 32k tokens or greater, only four of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%).
- Almost all models fall below the threshold before reaching the claimed context lengths.
💡 Requirements
- Docker container:
docker pull cphsieh/ruler:0.1.0 - Our dependencies are listed in
docker/Dockerfileanddocker/requirements.txt. We use following commad to build our container based on NVIDIA's PyTorch containernvcr.io/nvidia/pytorch:23.08-py3.
cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t cphsieh/ruler:0.1.0 .
🔍 Evaluate long-context LMs
1. Download data
- Paul Graham Essays for NIAH are downloaded from NIAH Github and Paul Graham Blog.
- QA datasets are downloaded from SQuAD and HotpotQA.
cd scripts/data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh
2. Download model
- Find and download the long-context language model from Huggingface.
- Add the chat template of your downloaded model in
scripts/data/template.py - (Optional) If you are using TensorRT-LLM, please build your model engine based on their example scripts (e.g., Llama) with their Docker container.
3. Run evaluation pipeline
- Setup
run.sh
GPUS="" # GPU size for tensor_parallel.
ROOT_DIR="" # the path that stores generated task samples and model predictions.
MODEL_DIR="" # the path that contains individual model folders from HUggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.
- Setup
config_models.sh
case $MODEL_NAME in
YOUR_HF_MODEL_NAME)
MODEL_PATH=${MODEL_DIR}/YOUR_MODEL_FOLDER
MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
MODEL_FRAMEWORK="" # hf or vllm
;;
YOUR_TRTLLM_ENGINE_NAME)
MODEL_PATH=${ENGINE_DIR}/YOUR_ENGINE_FOLDER
MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
MODEL_FRAMEWORK="trtllm"
;;
YOUR_OPENAI_MODEL_NAME)
MODEL_PATH="" # OpenAI model name listed in https://platform.openai.com/docs/models/
MODEL_TEMPLATE_TYPE="base"
MODEL_FRAMEWORK="openai"
TOKENIZER_PATH="cl100k_base"
TOKENIZER_TYPE="openai"
OPENAI_API_KEY="" # your OpenAI API key
;;
YOUR_GEMINI_MODEL_NAME)
MODEL_PATH="" # Gemini model name listed in https://ai.google.dev/gemini-api/docs/models/gemini
MODEL_TEMPLATE_TYPE="base"
MODEL_FRAMEWORK="gemini"
TOKENIZER_PATH=$MODEL_PATH
TOKENIZER_TYPE="gemini"
GEMINI_API_KEY="" # your Gemini API key
;;
- Start evaluation based on our default
syntheticbenchmark
bash run.sh YOUR_MODEL_NAME synthetic
🧠 (Optional) Customize task complexity
All tasks in RULER are selected in scripts/config_tasks.sh. Each task configuration is defined in scripts/synthetic.yaml. If you want to customize a task complexity, RULER supports tasks across four categories, each of which has multiple configurable arguments listed in below table.
| Category | Task name | Configurations |
|---|---|---|
| Retrieval | niah | type_haystack: repeat/essay/needle# repeat: repeated noise sentences # essay: Paul Graham Essays # needle: distracted needles type_needle_k: words/numbers/uuidstype_needle_v: words/numbers/uuids# words: adjective-noun # numbers: 7 digits # uuids: 32 digits num_needle_k: int >= 1# add multiple needles in haystack num_needle_v: int >= 1# retrieve multiple values from a single key num_needle_q: int >= 1# retrieve multiple values from multiple keys |
| Multi-hop Tracing |
variable_tracking | num_chains: int >= 1# number of variable name-binding chains num_hops: int >= 1# number of times binding variable names in each chain |
| Aggregation | common_words_extraction | freq_cw: int >= 1# frequency of common words freq_ucw: int >= 1# frequency of uncommon words num_cw: int >= 1 # number of common words |
| Aggregation | freq_words_extraction | alpha: float > 1.0# when alpha is close to one, the noise word appears less frequently and the difference in frequency among top-frequent words decreases, thereby increasing the complexity. Increasing the number of top-frequent words to return also increases the difficulty of this task even for short sequences, we thus use 3 in our evaluations. |
| Question Answering |
qa | dataset: squad or hotpotqa# the short-context qa dataset we use |
🚀 (Optional) Contribute a new synthetic task
1. Create a python script for data preparation
- Add basic arguments (required) and complexity configurations in the python script.
- Verify the script is reproducible given a tokenizer, a sequence length, and a random seed.
- Save the script under folder
scripts/data/synthetic.
2. Add task template
- Add
templateandtokens_to_generateinscripts/data/synthetic/constants.py. - Add
answer_predfixto prevent model refusing to answer.
3. Add evaluation metric
- Add the automatic metric to evaluate your task in
scripts/eval/synthetic/constants.py
4. Add required configurations
- Define your task name and complexity configurations in
scripts/synthetic.yaml. - Add your task name in
scripts/config_tasks.sh
📝 Citation
@article{hsieh2024ruler,
title={RULER: What's the Real Context Size of Your Long-Context Language Models?},
author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg},
year={2024}
journal={arXiv preprint arXiv:2404.06654},
}
Disclaimer: This project is strictly for research purposes, and not an official product from NVIDIA.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvidia_long_context_eval-26.1-py3-none-any.whl.
File metadata
- Download URL: nvidia_long_context_eval-26.1-py3-none-any.whl
- Upload date:
- Size: 18.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70e2b3a69047f174617cde23bf676ced7f66e2f825cff4be60533d373a6ba80b
|
|
| MD5 |
0c7f092d2af2a3dd82b33ac7f9f8962c
|
|
| BLAKE2b-256 |
b5c717a1cb6635c678f3bbd323b9f7068af92f4bc6e795fad4b4368047c148ea
|