Skip to main content

"RepoQA for Evaluating Long-Context Code Understanding"

Project description

RepoQA: Evaluating Long-Context Code Understanding

🏠 Homepage: https://evalplus.github.io/repoqa.html

🚀 Installation

# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# To enable vLLM
pip install --upgrade "repoqa[vllm]"
⏬ Install nightly version :: click to expand ::
pip install --upgrade "git+https://github.com/evalplus/repoqa.git"                 # without vLLM
pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main" # with vLLM
⏬ Using RepoQA as a local repo? :: click to expand ::
git clone https://github.com/evalplus/repoqa.git
cd repoqa
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

🏁 Search Needle Function (SNF)

Search Needle Function is the first and base RepoQA task which aims to practice LLMs' ability of long-context code understanding and retrieval. Its corresponding real-life scenario is to perform precise code search from function description.

🔎 More dataset details :: click to expand ::

[!Note]

SNF includes 500 tests (5 programming languages x 10 repos x 10 needle functions) where an LLM is given:

  1. A large code context sorted in file dependency
  2. A NL description of the needle function without revealing keywords like function names
  3. An instruction to retrieve the described function

The evaluator passes a test if the searched function is syntactically closest to the ground-truth compared against other functions (systematically parsed by treesitter) and the similarity is greater than a user defined threshold (by default 0.8).

You can run the SNF evaluation using various backends:

OpenAI Compatible Servers

repoqa.search_needle_function --model "gpt4-turbo" --backend openai
# 💡 If you use openai API compatible server such as vLLM servers:
# repoqa.search_needle_function --base-url "http://localhost:[PORT]/v1" \
#                               --model "Qwen/CodeQwen1.5-7B-Chat" --backend openai

Anthropic Compatible Servers

repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic

vLLM

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
🔎 Context extension for small-ctx models :: click to expand ::

There are two ways to unlock a model's context at inference time:

  1. Direct Extension: Edit max_positional_embedding of the model's config.json (e.g., hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/[hash]/config.json) to something like 22528.
  2. Dynamic RoPE Scaling: To extend Meta-Llama-3-8B-Instruct from 8k to 32k (4x), edit the config.json:

"rope_scaling": {"type": "dynamic", "factor": 4.0}

Note: This works for vLLM <0.4.3 and HuggingFace transformers. RepoQA will automatically configure dynamic RoPE for vLLM >= 0.4.3

[!Note]

Reference evaluation time:

  • Llama3-8B-Instruct: 45 minutes on 2xA6000 (PCIe NVLink)
  • Llama3-70B-Instruct: 100 minutes on 4xA100 (PCIe NVLink)

HuggingFace transformers

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code

[!Tip]

Installing flash-attn and additionally set --attn-implementation "flash_attention_2" can largely lower the memory requirement.

🔨 Having trouble installing `flash-attn`? :: click to expand ::

If you have trouble with pip install flash-attn --no-build-isolation, you can try to directly use pre-built wheels:

export FLASH_ATTN_VER=2.5.8 # check latest version at https://github.com/Dao-AILab/flash-attention/releases
export CUDA_VER="cu122"     # check available ones at https://github.com/Dao-AILab/flash-attention/releases
export TORCH_VER=$(python -c "import torch; print('.'.join(torch.__version__.split('.')[:2]))")
export PY_VER=$(python -c "import platform; print(''.join(platform.python_version().split('.')[:2]))")
export OS_ARCH=$(python -c "import platform; print(f'{platform.system().lower()}_{platform.machine()}')")

export WHEEL=flash_attn-${FLASH_ATTN_VER}+${CUDA_VER}torch${TORCH_VER}cxx11abiFALSE-cp${PY_VER}-cp${PY_VER}-${OS_ARCH}.whl
wget https://github.com/Dao-AILab/flash-attention/releases/download/v${FLASH_ATTN_VER}/${WHEEL}
pip install ${WHEEL}

Google Generative AI API (Gemini)

repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google

CLI Usage

  • Input:
    • --model: Hugging-Face model ID, such as ise-uiuc/Magicoder-S-DS-6.7B
    • --backend: vllm (default) or openai
    • --base-url: OpenAI API base URL
    • --code-context-size (default: 16384): #tokens (by DeepSeekCoder tokenizer) of repository context
    • --caching (default: True): accelerate subsequent runs by caching preprocessing; --nocaching to disable
    • --max-new-tokens (default: 1024): Maximum #new tokens to generate
    • --system-message (default: None): system message (note it's not supported by some models)
    • --tensor-parallel-size: #GPUS for doing tensor parallelism (only for vLLM)
    • --languages (default: None): List of languages to evaluate (None means all)
    • --result-dir (default: "results"): Directory to save the model outputs and evaluation results
    • --ignore-comments (default: False): During evaluation, ignore groundtruth and model comments
    • --trust-remote-code (default: False): allow remote code (for HuggingFace transformers and vLLM)
    • --attn-implementation (default: None): Use "flash_attention_2" if your HF hits OOM
  • Output:
    • results/ntoken_{code-context-size}/{model}.jsonl: Model generated outputs
    • results/ntoken_{code-context-size}/{model}-SCORE.json: Evaluation results

Compute Scores

By default, the repoqa.search_needle_function command will evaluate model outputs and compute scores after text generation. However, you can also separately compute scores using the following command:

repoqa.compute_score --model-output-path={model-output}.jsonl

[!Tip]

  • Input: Path to the model generated outputs.
  • Output: The evaluation scores would be stored in {model-output}-SCORES.json

📚 Read More

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repoqa-0.1.2.tar.gz (5.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repoqa-0.1.2-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file repoqa-0.1.2.tar.gz.

File metadata

  • Download URL: repoqa-0.1.2.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.13

File hashes

Hashes for repoqa-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c08efb7f700c40c4d2e6f89d748049c3d14eaea3f14fda7192bcd1392d16cd60
MD5 387d7403add90106a4762c91e0156e5f
BLAKE2b-256 1363c8154ab5680cdd6d8d56408a57af8e64cb95c6d52dbdf5e19f5866722777

See more details on using hashes here.

File details

Details for the file repoqa-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: repoqa-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.13

File hashes

Hashes for repoqa-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f18e82dc678ff90478a80473a171a63fffc49563283a0714a942dabdc1be35ab
MD5 46f4617c338f380875d0fcf8baa43c89
BLAKE2b-256 e30de85e54061821c76af16fcc9264b6019a0c1391ce7024d6b6d5a96154cd4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page