Skip to main content

A Plug-and-Play and Comprehensive evaluation framework specifically for long-context models

Project description

LOOM-Scope: LOng-cOntext Model evaluation framework

LOOM-Scope Logo

中文版本 | English

-->

Demo Video

📣 Latest News !!

  • [2025/04] We release LOOM-Scope, providing a convient and comprehensive framework for long-context model evaluation.
  • [2025/05] We have updated the LLM leaderboard to reflect the latest advancements in large language model performance.

🔍 Overview

Key Features:

  • Single-line command is all you need

    • Single-line command automatically detects datasets and models, seamlessly handling download, implementation, and evaluation.
  • Comprehensive Long-context Benchmarks

    • 15 standard long-context benchmarks for LLMs, with hundreds of subtasks and variants implemented.
  • Efficient Inference

    • Support for fast and memory-efficient inference with vLLM.
    • 12 computational efficient inference (acceleration) methods such as CakeKV and FlexPrefill.
  • Model Compatibility

  • Reproducibility

    • Publicly available runtime metrics and results across different platforms (including 24GB 3090, 40GB A100, and 92GB H20 GPUs) .
    • Publicly available prompts for different benchmarks.
    • WebUI (Gradio) available for users.

🪜 LLM leaderboard

Rank Model Avg Score L_CiteEval LEval LooGLE RULER(0 - 128k) longbench babilong(0 - 128k) Counting - Stars LongIns LVEval longbench_v2 NIAH NThread InfiniteBench LongWriter LIBRA
1 Qwen3 - 30B - A3B 46.08 37.96 40.61 11.61 78.32 43.24 60.31 48.96 41.30 22.82 28.42 100.00 24.12 14.14 83.24 56.09
2 Qwen3 - 14B 45.97 35.64 43.84 11.79 74.94 45.47 59.15 56.41 31.95 21.26 29.85 100.00 27.35 10.24 85.75 55.87
3 Meta - Llama - 3.1 - 8B - Instruct 41.37 25.79 39.70 11.81 86.79 37.94 57.42 37.68 25.40 25.66 30.40 91.00 20.06 33.64 45.96 51.24
4 Qwen3 - 8B 40.18 33.18 41.15 11.67 67.68 38.62 55.28 52.32 32.61 15.15 27.25 64.00 21.92 8.06 81.99 51.78
5 Qwen3 - 4B 38.70 24.55 39.03 11.69 70.29 39.32 55.01 42.06 33.66 18.24 32.52 62.00 17.95 13.05 74.25 46.92

💻 Environment & Installation

To install theloom package from the github repository, run:

git clone https://gitlab.com/ZetangForward1/LOOM-Scape.git
cd LOOM-Scape
conda create -n loom python=3.10 -y
conda activate loom
pip install -e .
# install flash attention
Download the suitable version of flash_attn from https://github.com/Dao-AILab/flash-attention/releases
pip install <path_to_flash_attn_whl_file>

📊 Benchmark Dashboard

> Overview

Benchmarks are organized by Benchmark Features.

LOOM-Scope Logo

‼️ The specific configuration file for each benchmark is placed in the "benchmarks/Ability/{benchmark_name}/configs"

> Computational Cost & Evaluation Results

LOOM-Scope's computational costs for each benchmark (across implementations) are documented in View Computational Costs.

Example: L-Eval Benchmark

1. Computational Cost (Inference Time)

Benchmark Hardware Configuration Inference Time (ALL tasks)
LEval 3090 (24GB) × 8 6:09:10
LEval A100 (40GB) × 4 3:58:46
LEval H20 (96GB) × 2 3:41:35

2. Evaluation Results

Benchmark Subtask Metrics Model Results Official Results (Reported in Paper)
LEval TOEFL exam Meta-Llama-3.1-8B-Instruct 81.40 82.89
LEval QuALITY exam Meta-Llama-3.1-8B-Instruct 63.37 64.85
LEval Coursera exam Meta-Llama-3.1-8B-Instruct 54.94 53.77
LEval SFiction exact_match Meta-Llama-3.1-8B-Instruct 78.90 69.53
LEval GSM exam Meta-Llama-3.1-8B-Instruct 78.00 79.00
LEval CodeU exam Meta-Llama-3.1-8B-Instruct 6.60 2.20

🚀 Quickly Start

> Automated Evaluation Command

Taking the L_CiteEval benchmark and the Meta-Llama-3.1-8B-Instruct model as an example, the following code can be used to quickly implement the following workflow: → downloading the benchmark/model → deploying the model → model prediction on benchmark → evaluating on the generated results with specified metricsn

loom-scope.run \
    --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/test.yaml \
    --device 0 1 \
    --gp_num 2 \
    --eval \
    --save_tag L-CiteEval_Data 

> Data Download Instructions

If your server has completely no network, you can download the dataset and upload it to the server. Except for NIAH, RULER, and NoLiMa, which should be placed in the corresponding ./benchmark/{ability}/{benchmark_name}/tmp_Rawdata folder, please download other data to the ./benchmark/{ability}/{benchmark_name}/data folder.

> Command-line Interface

Category Parameter Type/Format Description Required/Default
Core Parameters
--model_path str Model path/service provider (e.g., Meta-Llama-3-70B-Instruct) Required
--cfg_path str (YAML) Benchmark configuration file (e.g., benchmarks/General/LongBench.yaml) Required
--device List[int] GPU indices (e.g., 0 1 3) Default: All available GPUs
--torch_dtype str | torch.dtype Precision (e.g., float16, bfloat16) Default: torch.bfloat16
--max_length int maximum input context length (e.g., 128000) Default: 128000
--gp_num int | str GPUs per model (e.g., 2) or models per GPU (e.g., '1/2') Default: 1
Optimization
--acceleration str Experimental methods (duo_attn, snapkv, streaming_llm, ...)
--server str Options: ['transformer', 'vllm', 'rwkv'] Default: transformer
--max_model_len int [VLLM] Max sequence length Default: 128000
--gpu_memory_utilization float [VLLM] VRAM ratio (0.0-1.0) Default: 0.95
Evaluation
--eval bool Toggle evaluation phase Default: False
Extensions
--limit Optional[int] In the restricted framework evaluation, each task only has N samples.
--rag_method str Retrieval-augmented generation (openai, bm25, llama_index, contriever)
--adapter_path str Path to PEFT/LoRA weights
--enable_thinking bool inject step-by-step reasoning prompts into the template (for supported models). Default: False

> Custom Your Own Prompt

  1. Implement Your Function:

    • Add your custom function in here.
    • Example (default chatglm implementation):
      def chatglm(tokenizer, input, **kwargs):
          input = tokenizer.build_prompt(input)
          return input
      
  2. Update Configuration:

    • Set chat_model: {your_function_name} in the benchmark’s configuration file.
    • Example (a single configuration setting)
      benchmark_name: xxxx
      task_names: xxxx
      chat_model: Null  # choose from ./models/utils/build_chat.pys
      

> Manual Evaluation of Generated Outputs

To evaluate generated model outputs (useful when you already have model predictions and only need scoring), you can use --folder_name flag:

loom-scope.eval \
  --folder_name <artifact_directory> \    # Path to generation outputs
  --model_name <registered_model_id> \   # Model identifier (e.g., Meta-Llama-3-8B-Instruct)
  --cfg_path <task_config>               # Original benchmark config path

Example Usage

loom-scope.eval \
    --folder_name Counting_Stars \
    --model_name Meta-Llama-3.1-8B-Instruct \
    --cfg_path ./benchmarks/Reasoning/Counting_Stars/Counting_Stars.yaml  

WebUI Implementation

python WebUI/app.py

🚄 RAG and Efficient Inference Options

This section provides a unified approach to optimize inference efficiency through configurable retrieval engines, flexible inference frameworks, and advanced acceleration techniques.Key features include customizable RAG parameters across multiple engines, framework selection tailored for throughput (vLLM), compatibility (Transformers), or memory efficiency (RWKV), and accelerated inference via token eviction, quantization, and context optimization methods compatible with mainstream GPUs (A100, 3090, H20).

> RAG Implementation Guide

1. Install

pip install -r ./models/rag/requirements.txt

2. Quick Start

Example: Benchmark Evaluation with Configurable Retrieval To assess the Meta-Llama-3.1-8B-Instruct model on the L_CiteEval benchmark using the BM25 retrieval method, run:

loom-scope.run \
    --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \  
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/test.yaml 
    --device 0 \  
    --eval \     
    --save_tag L-CiteEval_Data \  
    --rag_method BM25 

Select a retrieval engine using the --rag_method flag, with supported options:BM25, llamaindex, openai, contriever

3. Configuration Guide

To customize the RAG settings, you will need to edit the RAG configuration file,click here. Below are the key parameters you can adjust:

# === Core Retrieval Parameters ===
chunk_size: 512          # Number of tokens/characters per chunk.
num_chunks: 15           # Total number of chunks generated from the input text.
num_processes: 4        # Number of parallel processes for task execution.
#only for llamaindex
chunk_overlap: 128      # Overlapping tokens/characters between consecutive chunks.
#only for openai
embedding_model: text-embedding-3-small  # OpenAI embedding model for generating text embeddings.for --rag_method openai
openai_api: ""  # API key for OpenAI authentication. --rag_method openai
base_url: "https://api.openai.com/v1"  # Base URL for the API endpoint. --rag_method openai

4. ⚠Limitations

Since the problems in the following tasks are unsuitable for text retrieval, direct retrieval yields poor results. It is recommended to apply RAG discretionarily for these tasks.

benchmark Subtasks Unsuitable for Retrieval
L-CiteEval qmsum multi_news gov_report
LEval tv_show_summ, patent_summ, paper_assistant, news_summ, gov_report_summ, codeU
LongBench gov_report, vcsum, multi_news, samsum passage_count passage_retrieval_en "passage_retrieval_zh", "lcc", "repobench-p" "trec" "lsht"
LooGLE summarization
RULER cwe, fwe, vt
LongWriter ----
InfiniteBench longbook_sum_eng math_find math_calc code_run code_debug longdialogue_qa_eng
LongIns ----
LongHealth ----
Ada_LEval ----
BAMBOO altqa senhallu abshallu meetingpred showspred reportsumsort showssort private_eval

Note ---- indicates that all subtasks are unsuitable for retrieval.

> Inference Framework Selection

When performing inference tasks, the choice of the inference framework is crucial as it directly impacts the performance, compatibility, and resource utilization of the system. Different frameworks offer distinct advantages, and you can select the most suitable one using the --server option form vllm,transformers(Default),rwkv. Here are several inference frameworks we've adapted:

1. vLLM Optimized Configuration (High Throughput)

loom-scope.run \
    --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/test.yaml \                  
    --server vllm \                
    --max_model_len 128000 \           
    --gpu_memory_utilization 0.95 \ 
    --eval \                         

2. Transformers Default Configuration (Compatibility)

loom-scope.run \
    --server transformers \         
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/test.yaml \
    --model_path local_llama_model \   
    --eval \                         

3. RWKV Configuration (Memory Efficiency)

loom-scope.run \
    --server rwkv \                   
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/test.yaml \
    --model_path RWKV-x070-World-2.9B-v3-20250211-ctx4096 \    
    --device 0 \
    --eval \                         

> Acceleration

1. Overview

The Acceleration Toolkit currently supports the following acceleration methods:

Acceleration Type Method Remark
Token Eviction H2O Attention-based selection
StreamingLLM Retain first few tokens
SnapKV Attention Pooling before selection
L2Compress L2 Norm is better than attention as a metrics
Layer-wise PyramidKV Layer-wise budget allocation
CakeKV Layer-specific preference score
Quantization KIVI Asymmetric 2bit Quantization
Quantization+Eviction ThinK Thinner Key Cache by Query-Driven Pruning
Token Merge CaM Cache Merging for Memory-efficient LLMs Inference
Sparse Attention FlexPrefill Dynamic and context-aware sparse attention mechanism
XAttention Dynamic and context-aware sparse attention mechanism by the Strided Antidiagonal Scoring

All supported acceleration methods, except for L2Compress, are capable of performing single-GPU inference for 128K tasks on 40GB A100 GPUs. These methods are fully optimized for multi-GPU inference and are compatible with a variety of high-performance GPUs, including NVIDIA 3090, A-series, and H-series cards.

2. Environment & Installation

Please note that these acceleration frameworks have specific requirements for environment setup and model compatibility. Refer to each framework's official documentation (or the README files) to configure your environment and ensure your model is compatible.

3. Command-line Interface

You can easily enable any of these methods by specifying the --acceleration flag followed by the method name, e.g., --acceleration L2Compress.

Category Parameter Type/Format Description Required/Default
Core Parameters
--model_path str Model path/service provider (e.g., Meta-Llama-3-70B-Instruct) Required
--cfg_path str (YAML) Benchmark configuration file (e.g., benchmarks/General/LongBench.yaml) Required
--device List[int] GPU indices (e.g., 0 1 3) Default: All available GPUs
--torch_dtype str | torch.dtype Precision (e.g., float16, bfloat16) Default: torch.bfloat16
GPU Allocation
--gp_num int | str GPUs per model (e.g., 2) or models per GPU (e.g., '1/2') Default: 1
Optimization
--acceleration str Experimental methods (H2O, SnapKV, StreamingLLM, ...) Choice: {“H2O”, “StreamingLLM”, “SnapKV”, “L2Compress”, “FlexPrefill”, “PyramidKV”, “CakeKV”, “KIVI”, “ThinK”, “CaM”}

For example, if you want to use the L2Compress method, please use the following command:

loom-scope.run \
    --model_path Meta-Llama/Meta-Llama-3.1-8B-Instruct \
    --cfg_path ./benchmarks/Faithfulness/L_CiteEval/test.yaml \
    --device 0 1 \                    
    --gp_num 2 \                    
    --acceleration L2Compress \                

‼️After properly setting up the environment, please uncomment the relevant code in ./models/init.py to enable the acceleration frameworks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loom_scope-0.1.tar.gz (382.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

loom_scope-0.1-py3-none-any.whl (548.9 kB view details)

Uploaded Python 3

File details

Details for the file loom_scope-0.1.tar.gz.

File metadata

  • Download URL: loom_scope-0.1.tar.gz
  • Upload date:
  • Size: 382.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for loom_scope-0.1.tar.gz
Algorithm Hash digest
SHA256 61bfe6629315e608150f8b2594aaa0fd0e78492619cc82ec6f59e30b00055fc8
MD5 25c13533085774a4ae2ed92c8c49d9d9
BLAKE2b-256 01e0e7b734ffd6c847afdb9286493b9de7074cf7678851ed656e1ed68daa75b2

See more details on using hashes here.

File details

Details for the file loom_scope-0.1-py3-none-any.whl.

File metadata

  • Download URL: loom_scope-0.1-py3-none-any.whl
  • Upload date:
  • Size: 548.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for loom_scope-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 630bdab61f473793c3c14db3d47c59d5eef2c3dc1d827454a4e581f1e1314133
MD5 e4edb58d54800d290ae87bf3470c531e
BLAKE2b-256 23188ebc11d8c9361dac4c11cbbf7840b77710261ea6ddd41e8504445650bc71

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page