Skip to main content

"RepoQA for Evaluating Long-Context Code Understanding"

Project description

RepoQA: Evaluating Long-Context Code Understanding

🏠 Homepage: https://evalplus.github.io/repoqa.html

🚀 Installation

# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# To enable vLLM
pip install --upgrade "repoqa[vllm]"
⏬ Install nightly version :: click to expand ::
pip install --upgrade "git+https://github.com/evalplus/repoqa.git"                 # without vLLM
pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main" # with vLLM
⏬ Using RepoQA as a local repo? :: click to expand ::
git clone https://github.com/evalplus/repoqa.git
cd repoqa
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

🏁 Search Needle Function (SNF)

Search Needle Function is the first RepoQA task which aims to practice LLMs' ability of long-context code understanding and retrieval. Its corresponding real-life application is to perform precise code search from user intent rather than simple keyword match.

[!Important]

SNF includes 500 tests (5 programming languages x 10 repositories x 10 needle functions) where an LLM is given:

  1. A large code context sorted in file dependency
  2. A NL description of the needle function without revealing keywords like function names
  3. An instruction to retrieve the described function

The evaluator passes a test if the searched function is syntactically closest to the ground-truth compared against other functions (systematically parsed by treesitter) and the similarity is greater than a user defined threshold (by default 0.8).

You can run the SNF evaluation using various backends:

OpenAI Compatible Servers

repoqa.search_needle_function --model "gpt4-turbo" --backend openai
# 💡 If you use openai API compatible server such as vLLM servers:
# repoqa.search_needle_function --base-url "http://localhost:[PORT]/v1" \
#                               --model "Qwen/CodeQwen1.5-7B-Chat" --backend openai

Anthropic Compatible Servers

repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic

vLLM

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm

[!Tip]

You can unlock the model's context using dynamic RoPE scaling. For example, Meta-Llama-3-8B-Instruct has 8k context but running the default 16k test needs more (approx. 20k).

To extend the context to 32k, in its config file (hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/[hash]/config.json) set:

"rope_scaling": {"type": "dynamic", "factor": 4.0}

Note: This works for vLLM <0.4.3 and HuggingFace transformers. RepoQA will automatically configure dynamic RoPE for vLLM >= 0.4.3

[!Note]

Reference evaluation time:

  • Llama3-8B-Instruct: 45 minutes on 2xA6000 (PCIe NVLink)
  • Llama3-70B-Instruct: 100 minutes on 4xA100 (PCIe NVLink)

HuggingFace transformers

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code

Google Generative AI API (Gemini)

repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google

CLI Usage

  • Input:
    • --model: Hugging-Face model ID, such as ise-uiuc/Magicoder-S-DS-6.7B
    • --backend: vllm (default) or openai
    • --base-url: OpenAI API base URL
    • --code-context-size (default: 16384): #tokens (by DeepSeekCoder tokenizer) of repository context
    • --caching (default: True): accelerate subsequent runs by caching preprocessing; --nocaching to disable
    • --max-new-tokens (default: 1024): Maximum #new tokens to generate
    • --system-message (default: None): system message (note it's not supported by some models)
    • --tensor-parallel-size: #GPUS for doing tensor parallelism (only for vLLM)
    • --languages (default: None): List of languages to evaluate (None means all)
    • --result-dir (default: "results"): Directory to save the model outputs and evaluation results
    • --ignore-comments (default: False): During evaluation, ignore groundtruth and model comments
    • --trust-remote-code (default: False): allow remote code (for HuggingFace transformers and vLLM)
    • --attn-implementation (default: None): Use "flash_attention_2" if your HF hits OOM
  • Output:
    • results/ntoken_{code-context-size}/{model}.jsonl: Model generated outputs
    • results/ntoken_{code-context-size}/{model}-SCORE.json: Evaluation results

Compute Scores

By default, the repoqa.search_needle_function command will evaluate model outputs and compute scores after text generation. However, you can also separately compute scores using the following command:

repoqa.compute_score --model-output-path={model-output}.jsonl

[!Tip]

  • Input: Path to the model generated outputs.
  • Output: The evaluation scores would be stored in {model-output}-SCORES.json

📚 Read More

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repoqa-0.1.1.tar.gz (5.4 MB view hashes)

Uploaded Source

Built Distribution

repoqa-0.1.1-py3-none-any.whl (27.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page