"RepoQA for Evaluating Long-Context Code Understanding"
Project description
RepoQA: Evaluating Long-Context Code Understanding
🏠 Homepage: https://evalplus.github.io/repoqa.html
🚀 Installation
# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# To enable vLLM
pip install --upgrade "repoqa[vllm]"
⏬ Install nightly version :: click to expand ::
pip install --upgrade "git+https://github.com/evalplus/repoqa.git" # without vLLM
pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main" # with vLLM
⏬ Using RepoQA as a local repo? :: click to expand ::
git clone https://github.com/evalplus/repoqa.git
cd repoqa
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
🏁 Search Needle Function (SNF)
Search Needle Function is the first and base RepoQA task which aims to practice LLMs' ability of long-context code understanding and retrieval. Its corresponding real-life scenario is to perform precise code search from function description.
🔎 More dataset details :: click to expand ::
[!Note]
SNF includes 500 tests (5 programming languages x 10 repos x 10 needle functions) where an LLM is given:
- A large code context sorted in file dependency
- A NL description of the needle function without revealing keywords like function names
- An instruction to retrieve the described function
The evaluator passes a test if the searched function is syntactically closest to the ground-truth compared against other functions (systematically parsed by
treesitter
) and the similarity is greater than a user defined threshold (by default 0.8).
You can run the SNF evaluation using various backends:
OpenAI Compatible Servers
repoqa.search_needle_function --model "gpt4-turbo" --backend openai
# 💡 If you use openai API compatible server such as vLLM servers:
# repoqa.search_needle_function --base-url "http://localhost:[PORT]/v1" \
# --model "Qwen/CodeQwen1.5-7B-Chat" --backend openai
Anthropic Compatible Servers
repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic
vLLM
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
🔎 Context extension for small-ctx models :: click to expand ::
There are two ways to unlock a model's context at inference time:
- Direct Extension: Edit
max_positional_embedding
of the model'sconfig.json
(e.g.,hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/[hash]/config.json
) to something like22528
.- Dynamic RoPE Scaling: To extend
Meta-Llama-3-8B-Instruct
from 8k to 32k (4x), edit theconfig.json
:
"rope_scaling": {"type": "dynamic", "factor": 4.0}
Note: This works for vLLM
<0.4.3
and HuggingFace transformers. RepoQA will automatically configure dynamic RoPE for vLLM>= 0.4.3
[!Note]
Reference evaluation time:
- Llama3-8B-Instruct: 45 minutes on 2xA6000 (PCIe NVLink)
- Llama3-70B-Instruct: 100 minutes on 4xA100 (PCIe NVLink)
HuggingFace transformers
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code
[!Tip]
Installing flash-attn and additionally set
--attn-implementation "flash_attention_2"
can largely lower the memory requirement.
🔨 Having trouble installing `flash-attn`? :: click to expand ::
If you have trouble with
pip install flash-attn --no-build-isolation
, you can try to directly use pre-built wheels:export FLASH_ATTN_VER=2.5.8 # check latest version at https://github.com/Dao-AILab/flash-attention/releases export CUDA_VER="cu122" # check available ones at https://github.com/Dao-AILab/flash-attention/releases export TORCH_VER=$(python -c "import torch; print('.'.join(torch.__version__.split('.')[:2]))") export PY_VER=$(python -c "import platform; print(''.join(platform.python_version().split('.')[:2]))") export OS_ARCH=$(python -c "import platform; print(f'{platform.system().lower()}_{platform.machine()}')") export WHEEL=flash_attn-${FLASH_ATTN_VER}+${CUDA_VER}torch${TORCH_VER}cxx11abiFALSE-cp${PY_VER}-cp${PY_VER}-${OS_ARCH}.whl wget https://github.com/Dao-AILab/flash-attention/releases/download/v${FLASH_ATTN_VER}/${WHEEL} pip install ${WHEEL}
Google Generative AI API (Gemini)
repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google
CLI Usage
- Input:
--model
: Hugging-Face model ID, such asise-uiuc/Magicoder-S-DS-6.7B
--backend
:vllm
(default) oropenai
--base-url
: OpenAI API base URL--code-context-size
(default: 16384): #tokens (by DeepSeekCoder tokenizer) of repository context--caching
(default: True): accelerate subsequent runs by caching preprocessing;--nocaching
to disable--max-new-tokens
(default: 1024): Maximum #new tokens to generate--system-message
(default: None): system message (note it's not supported by some models)--tensor-parallel-size
: #GPUS for doing tensor parallelism (only for vLLM)--languages
(default: None): List of languages to evaluate (None means all)--result-dir
(default: "results"): Directory to save the model outputs and evaluation results--ignore-comments
(default: False): During evaluation, ignore groundtruth and model comments--trust-remote-code
(default: False): allow remote code (for HuggingFace transformers and vLLM)--attn-implementation
(default: None): Use "flash_attention_2" if your HF hits OOM
- Output:
results/ntoken_{code-context-size}/{model}.jsonl
: Model generated outputsresults/ntoken_{code-context-size}/{model}-SCORE.json
: Evaluation results
Compute Scores
By default, the repoqa.search_needle_function
command will evaluate model outputs and compute scores after text generation.
However, you can also separately compute scores using the following command:
repoqa.compute_score --model-output-path={model-output}.jsonl
[!Tip]
- Input: Path to the model generated outputs.
- Output: The evaluation scores would be stored in
{model-output}-SCORES.json
📚 Read More
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.