Skip to main content

"RepoQA for Evaluating Long-Context Code Understanding"

Project description

RepoQA: Evaluating Long-Context Code Understanding

🏠 Homepage: https://evalplus.github.io/repoqa.html

🚀 Installation

# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# To enable vLLM
pip install --upgrade "repoqa[vllm]"
⏬ Install nightly version :: click to expand ::
pip install --upgrade "git+https://github.com/evalplus/repoqa.git"                 # without vLLM
pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main" # with vLLM
⏬ Using RepoQA as a local repo? :: click to expand ::
git clone https://github.com/evalplus/repoqa.git
cd repoqa
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

🏁 Search Needle Function (SNF)

You can run the SNF evaluation using various backends.

[!Note]

All evaluation can be performed in one just command.

As a reference of evaluation time, it takes 30 minutes to evaluate a 7B model using two A6000s.

OpenAI Compatible Servers

repoqa.search_needle_function --model "gpt4-turbo" --backend openai
# 💡 If you use customized server such vLLM:
# repoqa.search_needle_function --base-url "http://url.to.vllm.server/v1" \
#                               --model "gpt4-turbo" --backend openai

Anthropic Compatible Servers

repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic

vLLM

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm

HuggingFace transformers

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code

Google Generative AI API (Gemini)

repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google

[!Tip]

To evaluate models whose context size is smaller than the prompt, you can edit the config.json file to modify max_position_embeddings for the model in HuggingFace cache directory.

Usage

[!Tip]

  • Input:
    • --model: Hugging-Face model ID, such as ise-uiuc/Magicoder-S-DS-6.7B
    • --backend: vllm (default) or openai
    • --base-url: OpenAI API base URL
    • --code-context-size (default: 16384): #tokens (by DeepSeekCoder tokenizer) of repository context
    • --caching (default: True): accelerate subsequent runs by caching preprocessing; --nocaching to disable
    • --max-new-tokens (default: 1024): Maximum #new tokens to generate
    • --system-message (default: None): system message (note it's not supported by some models)
    • --tensor-parallel-size: #GPUS for doing tensor parallelism (only for vLLM)
    • --languages (default: None): List of languages to evaluate (None means all)
    • --result-dir (default: "results"): Directory to save the model outputs and evaluation results
    • --trust-remote-code (default: False): allow remote code (for HuggingFace transformers and vLLM)
  • Output:
    • results/ntoken_{code-context-size}/{model}.jsonl: Model generated outputs
    • results/ntoken_{code-context-size}/{model}-SCORE.json: Evaluation results

Compute Scores

By default, the repoqa.search_needle_function command will evaluate model outputs and compute scores after text generation. However, you can also separately compute scores using the following command:

repoqa.compute_score --model-output-path={model-output}.jsonl

[!Tip]

  • Input: Path to the model generated outputs.
  • Output: The evaluation scores would be stored in {model-output}-SCORES.json

📚 Read More

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repoqa-0.1.0.tar.gz (5.4 MB view hashes)

Uploaded Source

Built Distribution

repoqa-0.1.0-py3-none-any.whl (25.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page