"RepoQA for Evaluating Long-Context Code Understanding"
Project description
RepoQA: Evaluating Long-Context Code Understanding
🏠 Homepage: https://evalplus.github.io/repoqa.html
🚀 Installation
# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# To enable vLLM
pip install --upgrade "repoqa[vllm]"
⏬ Install nightly version :: click to expand ::
pip install --upgrade "git+https://github.com/evalplus/repoqa.git" # without vLLM
pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main" # with vLLM
⏬ Using RepoQA as a local repo? :: click to expand ::
git clone https://github.com/evalplus/repoqa.git
cd repoqa
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
🏁 Search Needle Function (SNF)
You can run the SNF evaluation using various backends.
[!Note]
All evaluation can be performed in one just command.
As a reference of evaluation time, it takes 30 minutes to evaluate a 7B model using two A6000s.
OpenAI Compatible Servers
repoqa.search_needle_function --model "gpt4-turbo" --backend openai
# 💡 If you use customized server such vLLM:
# repoqa.search_needle_function --base-url "http://url.to.vllm.server/v1" \
# --model "gpt4-turbo" --backend openai
Anthropic Compatible Servers
repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic
vLLM
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
HuggingFace transformers
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code
Google Generative AI API (Gemini)
repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google
[!Tip]
To evaluate models whose context size is smaller than the prompt, you can edit the
config.json
file to modifymax_position_embeddings
for the model in HuggingFace cache directory.
Usage
[!Tip]
- Input:
--model
: Hugging-Face model ID, such asise-uiuc/Magicoder-S-DS-6.7B
--backend
:vllm
(default) oropenai
--base-url
: OpenAI API base URL--code-context-size
(default: 16384): #tokens (by DeepSeekCoder tokenizer) of repository context--caching
(default: True): accelerate subsequent runs by caching preprocessing;--nocaching
to disable--max-new-tokens
(default: 1024): Maximum #new tokens to generate--system-message
(default: None): system message (note it's not supported by some models)--tensor-parallel-size
: #GPUS for doing tensor parallelism (only for vLLM)--languages
(default: None): List of languages to evaluate (None means all)--result-dir
(default: "results"): Directory to save the model outputs and evaluation results--trust-remote-code
(default: False): allow remote code (for HuggingFace transformers and vLLM)- Output:
results/ntoken_{code-context-size}/{model}.jsonl
: Model generated outputsresults/ntoken_{code-context-size}/{model}-SCORE.json
: Evaluation results
Compute Scores
By default, the repoqa.search_needle_function
command will evaluate model outputs and compute scores after text generation.
However, you can also separately compute scores using the following command:
repoqa.compute_score --model-output-path={model-output}.jsonl
[!Tip]
- Input: Path to the model generated outputs.
- Output: The evaluation scores would be stored in
{model-output}-SCORES.json
📚 Read More
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.