Long context evaluations - packaged by NVIDIA NeMo Evaluator

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Competitive annalysis instructions

Docker:

docker build -f docker/Dockerfile_oai -t ruler .

This may be probably baked into container:

cd data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh
cd /workspace

For nvcf usage:

export API_KEY=...

If accessing to the gated repo you need to pass HF_TOKEN

export HF_TOKEN=...

Set model url (this url will be directly used in requests) e.g.,

URL=https://integrate.api.nvidia.com/v1/chat/completions

Set model name (this name will be send as model in request body) e.g.,

MODEL="meta/llama-3.2-3b-instruct"

Set tokenizer (this will be used for tokens estimation in data generation) and optionally tokenizer type (--tokenizer_type) (if not hf) e.g.

TOKENIZER="meta-llama/Llama-3.2-3B-Instruct"

Example with NVCF

 python entrypoint.py --url $URL --tasks "niah_single_1,niah_single_2" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER

Run all tasks (already completed tasks will be skipped as long as they are present in the result directoryand the model and max_seq_length will not change).

 python entrypoint.py --url $URL --tasks "all" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER

This will go through 13 RULER tasks and create a summary.csv file. The average of these 13 tasks, with num_samples = 500, is what we typically report for the given max_seq_length.

<result_dir>/<model_path>/synthetic/<max_seq_length>/pred/summary.csv

Evaluation with num_samples 500 and max_seq_length 4000

 python entrypoint.py --url $URL --tasks "niah_single_1,niah_single_2" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER --num_samples 500 --max_seq_length 4000

Example with VLLM chat #TODO

Example with VLLM completion #TODO

📏 RULER: What’s the Real Context Size of Your Long-Context Language Models?

This repository contains code for our paper RULER: What’s the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity, and encompasses 13 tasks across 4 categories beyond simple retrieval from context. Here are our main results.

MainResult

Despite achieving nearly perfect performance on the vanilla NIAH test, all models exhibit large degradation on tasks in RULER as sequence length increases.
While all models claim context size of 32k tokens or greater, only four of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%).
Almost all models fall below the threshold before reaching the claimed context lengths.

💡 Requirements

Docker container: docker pull cphsieh/ruler:0.1.0
Our dependencies are listed in docker/Dockerfile and docker/requirements.txt. We use following commad to build our container based on NVIDIA's PyTorch container nvcr.io/nvidia/pytorch:23.08-py3.

cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t cphsieh/ruler:0.1.0 .

🔍 Evaluate long-context LMs

1. Download data

Paul Graham Essays for NIAH are downloaded from NIAH Github and Paul Graham Blog.
QA datasets are downloaded from SQuAD and HotpotQA.

cd scripts/data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh

2. Download model

Find and download the long-context language model from Huggingface.
Add the chat template of your downloaded model in scripts/data/template.py
(Optional) If you are using TensorRT-LLM, please build your model engine based on their example scripts (e.g., Llama) with their Docker container.

3. Run evaluation pipeline

Setup run.sh

GPUS="" # GPU size for tensor_parallel.
ROOT_DIR="" # the path that stores generated task samples and model predictions. 
MODEL_DIR="" # the path that contains individual model folders from HUggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.

Setup config_models.sh

case $MODEL_NAME in
    YOUR_HF_MODEL_NAME)
        MODEL_PATH=${MODEL_DIR}/YOUR_MODEL_FOLDER
        MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
        MODEL_FRAMEWORK="" # hf or vllm
        ;;
    YOUR_TRTLLM_ENGINE_NAME)
        MODEL_PATH=${ENGINE_DIR}/YOUR_ENGINE_FOLDER
        MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
        MODEL_FRAMEWORK="trtllm"
        ;;
    YOUR_OPENAI_MODEL_NAME)
        MODEL_PATH="" # OpenAI model name listed in https://platform.openai.com/docs/models/
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="openai"
        TOKENIZER_PATH="cl100k_base"
        TOKENIZER_TYPE="openai"
        OPENAI_API_KEY="" # your OpenAI API key
        ;;
    YOUR_GEMINI_MODEL_NAME)
        MODEL_PATH="" # Gemini model name listed in https://ai.google.dev/gemini-api/docs/models/gemini
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="gemini"
        TOKENIZER_PATH=$MODEL_PATH
        TOKENIZER_TYPE="gemini"
        GEMINI_API_KEY="" # your Gemini API key
        ;;

Start evaluation based on our default synthetic benchmark

bash run.sh YOUR_MODEL_NAME synthetic

🧠 (Optional) Customize task complexity

All tasks in RULER are selected in scripts/config_tasks.sh. Each task configuration is defined in scripts/synthetic.yaml. If you want to customize a task complexity, RULER supports tasks across four categories, each of which has multiple configurable arguments listed in below table.

Category	Task name	Configurations
Retrieval	niah	type_haystack: `repeat/essay/needle` # repeat: repeated noise sentences # essay: Paul Graham Essays # needle: distracted needles type_needle_k: `words/numbers/uuids` type_needle_v: `words/numbers/uuids` # words: adjective-noun # numbers: 7 digits # uuids: 32 digits num_needle_k: `int >= 1` # add multiple needles in haystack num_needle_v: `int >= 1` # retrieve multiple values from a single key num_needle_q: `int >= 1` # retrieve multiple values from multiple keys
Multi-hop Tracing	variable_tracking	num_chains: `int >= 1` # number of variable name-binding chains num_hops: `int >= 1` # number of times binding variable names in each chain
Aggregation	common_words_extraction	freq_cw: `int >= 1` # frequency of common words freq_ucw: `int >= 1` # frequency of uncommon words num_cw: `int >= 1` # number of common words
Aggregation	freq_words_extraction	alpha: `float > 1.0` # when alpha is close to one, the noise word appears less frequently and the difference in frequency among top-frequent words decreases, thereby increasing the complexity. Increasing the number of top-frequent words to return also increases the difficulty of this task even for short sequences, we thus use `3` in our evaluations.
Question Answering	qa	dataset: `squad` or `hotpotqa` # the short-context qa dataset we use

🚀 (Optional) Contribute a new synthetic task

1. Create a python script for data preparation

Add basic arguments (required) and complexity configurations in the python script.
Verify the script is reproducible given a tokenizer, a sequence length, and a random seed.
Save the script under folder scripts/data/synthetic.

2. Add task template

Add template and tokens_to_generate in scripts/data/synthetic/constants.py.
Add answer_predfix to prevent model refusing to answer.

3. Add evaluation metric

Add the automatic metric to evaluate your task in scripts/eval/synthetic/constants.py

4. Add required configurations

Define your task name and complexity configurations in scripts/synthetic.yaml.
Add your task name in scripts/config_tasks.sh

📝 Citation

@article{hsieh2024ruler,
  title={RULER: What's the Real Context Size of Your Long-Context Language Models?},
  author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg},
  year={2024}
  journal={arXiv preprint arXiv:2404.06654},
}

Disclaimer: This project is strictly for research purposes, and not an official product from NVIDIA.

Project details

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

26.3

Mar 16, 2026

26.1

Feb 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvidia_long_context_eval-26.3-py3-none-any.whl (18.5 MB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file nvidia_long_context_eval-26.3-py3-none-any.whl.

File metadata

Download URL: nvidia_long_context_eval-26.3-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 18.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for nvidia_long_context_eval-26.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e44e86b3279b704ccd04ba2f9694e8cb74697be222015803c1c733e838cb98fd`
MD5	`5dd845c0a0ed4be69adf0b51b04ba196`
BLAKE2b-256	`23cdea4b38711fd5820e13ee99ea5391e1cfa8542fee643f1534ba6b0a2d399d`

See more details on using hashes here.

nvidia-long-context-eval 26.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers