Skip to main content

Long context evaluations - packaged by NVIDIA NeMo Evaluator

Project description

Competitive annalysis instructions

Docker:

docker build -f docker/Dockerfile_oai -t ruler .

This may be probably baked into container:

cd data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh
cd /workspace

For nvcf usage:

export API_KEY=...

If accessing to the gated repo you need to pass HF_TOKEN

export HF_TOKEN=...

Set model url (this url will be directly used in requests) e.g.,

URL=https://integrate.api.nvidia.com/v1/chat/completions

Set model name (this name will be send as model in request body) e.g.,

MODEL="meta/llama-3.2-3b-instruct"

Set tokenizer (this will be used for tokens estimation in data generation) and optionally tokenizer type (--tokenizer_type) (if not hf) e.g.

TOKENIZER="meta-llama/Llama-3.2-3B-Instruct"

Example with NVCF

 python entrypoint.py --url $URL --tasks "niah_single_1,niah_single_2" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER

Run all tasks (already completed tasks will be skipped as long as they are present in the result directoryand the model and max_seq_length will not change).

 python entrypoint.py --url $URL --tasks "all" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER

This will go through 13 RULER tasks and create a summary.csv file. The average of these 13 tasks, with num_samples = 500, is what we typically report for the given max_seq_length.

<result_dir>/<model_path>/synthetic/<max_seq_length>/pred/summary.csv

Evaluation with num_samples 500 and max_seq_length 4000

 python entrypoint.py --url $URL --tasks "niah_single_1,niah_single_2" --result_dir "nvcf-test" --model $MODEL --mode "chat" --tokenizer_path $TOKENIZER --num_samples 500 --max_seq_length 4000

Example with VLLM chat #TODO

Example with VLLM completion #TODO

📏 RULER: What’s the Real Context Size of Your Long-Context Language Models?

This repository contains code for our paper RULER: What’s the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity, and encompasses 13 tasks across 4 categories beyond simple retrieval from context. Here are our main results.

MainResult

  • Despite achieving nearly perfect performance on the vanilla NIAH test, all models exhibit large degradation on tasks in RULER as sequence length increases.
  • While all models claim context size of 32k tokens or greater, only four of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%).
  • Almost all models fall below the threshold before reaching the claimed context lengths.

💡 Requirements

  • Docker container: docker pull cphsieh/ruler:0.1.0
  • Our dependencies are listed in docker/Dockerfile and docker/requirements.txt. We use following commad to build our container based on NVIDIA's PyTorch container nvcr.io/nvidia/pytorch:23.08-py3.
cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t cphsieh/ruler:0.1.0 .

🔍 Evaluate long-context LMs

1. Download data

cd scripts/data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh

2. Download model

  • Find and download the long-context language model from Huggingface.
  • Add the chat template of your downloaded model in scripts/data/template.py
  • (Optional) If you are using TensorRT-LLM, please build your model engine based on their example scripts (e.g., Llama) with their Docker container.

3. Run evaluation pipeline

  • Setup run.sh
GPUS="" # GPU size for tensor_parallel.
ROOT_DIR="" # the path that stores generated task samples and model predictions. 
MODEL_DIR="" # the path that contains individual model folders from HUggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.
  • Setup config_models.sh
case $MODEL_NAME in
    YOUR_HF_MODEL_NAME)
        MODEL_PATH=${MODEL_DIR}/YOUR_MODEL_FOLDER
        MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
        MODEL_FRAMEWORK="" # hf or vllm
        ;;
    YOUR_TRTLLM_ENGINE_NAME)
        MODEL_PATH=${ENGINE_DIR}/YOUR_ENGINE_FOLDER
        MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
        MODEL_FRAMEWORK="trtllm"
        ;;
    YOUR_OPENAI_MODEL_NAME)
        MODEL_PATH="" # OpenAI model name listed in https://platform.openai.com/docs/models/
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="openai"
        TOKENIZER_PATH="cl100k_base"
        TOKENIZER_TYPE="openai"
        OPENAI_API_KEY="" # your OpenAI API key
        ;;
    YOUR_GEMINI_MODEL_NAME)
        MODEL_PATH="" # Gemini model name listed in https://ai.google.dev/gemini-api/docs/models/gemini
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="gemini"
        TOKENIZER_PATH=$MODEL_PATH
        TOKENIZER_TYPE="gemini"
        GEMINI_API_KEY="" # your Gemini API key
        ;;
  • Start evaluation based on our default synthetic benchmark
bash run.sh YOUR_MODEL_NAME synthetic

🧠 (Optional) Customize task complexity

All tasks in RULER are selected in scripts/config_tasks.sh. Each task configuration is defined in scripts/synthetic.yaml. If you want to customize a task complexity, RULER supports tasks across four categories, each of which has multiple configurable arguments listed in below table.

Category Task name Configurations
Retrieval niah type_haystack: repeat/essay/needle
# repeat: repeated noise sentences
# essay: Paul Graham Essays
# needle: distracted needles

type_needle_k: words/numbers/uuids
type_needle_v: words/numbers/uuids
# words: adjective-noun
# numbers: 7 digits
# uuids: 32 digits

num_needle_k: int >= 1
# add multiple needles in haystack
num_needle_v: int >= 1
# retrieve multiple values from a single key
num_needle_q: int >= 1
# retrieve multiple values from multiple keys
Multi-hop
Tracing
variable_tracking num_chains: int >= 1
# number of variable name-binding chains
num_hops: int >= 1
# number of times binding variable names in each chain
Aggregation common_words_extraction freq_cw: int >= 1
# frequency of common words
freq_ucw: int >= 1
# frequency of uncommon words
num_cw: int >= 1
# number of common words
Aggregation freq_words_extraction alpha: float > 1.0
# when alpha is close to one, the noise word appears less frequently and the difference in frequency among top-frequent words decreases, thereby increasing the complexity. Increasing the number of top-frequent words to return also increases the difficulty of this task even for short sequences, we thus use 3 in our evaluations.
Question
Answering
qa dataset: squad or hotpotqa
# the short-context qa dataset we use

🚀 (Optional) Contribute a new synthetic task

1. Create a python script for data preparation

  • Add basic arguments (required) and complexity configurations in the python script.
  • Verify the script is reproducible given a tokenizer, a sequence length, and a random seed.
  • Save the script under folder scripts/data/synthetic.

2. Add task template

  • Add template and tokens_to_generate in scripts/data/synthetic/constants.py.
  • Add answer_predfix to prevent model refusing to answer.

3. Add evaluation metric

  • Add the automatic metric to evaluate your task in scripts/eval/synthetic/constants.py

4. Add required configurations

  • Define your task name and complexity configurations in scripts/synthetic.yaml.
  • Add your task name in scripts/config_tasks.sh

📝 Citation

@article{hsieh2024ruler,
  title={RULER: What's the Real Context Size of Your Long-Context Language Models?},
  author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg},
  year={2024}
  journal={arXiv preprint arXiv:2404.06654},
}

Disclaimer: This project is strictly for research purposes, and not an official product from NVIDIA.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvidia_long_context_eval-26.3-py3-none-any.whl (18.5 MB view details)

Uploaded Python 3

File details

Details for the file nvidia_long_context_eval-26.3-py3-none-any.whl.

File metadata

File hashes

Hashes for nvidia_long_context_eval-26.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e44e86b3279b704ccd04ba2f9694e8cb74697be222015803c1c733e838cb98fd
MD5 5dd845c0a0ed4be69adf0b51b04ba196
BLAKE2b-256 23cdea4b38711fd5820e13ee99ea5391e1cfa8542fee643f1534ba6b0a2d399d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page