Skip to main content

No project description provided

Project description

LLM-as-an-Interviewer🎤📄

Paper Dataset Dataset

This is the official GitHub repository for LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation.

LLM-as-an-Interviewer is an evaluation framework that assesses the capabilities of LLMs through an interview-style process. In this approach, the LLM acting as the interviewer evaluates other LLMs by providing feedback and asking follow-up questions, enabling a more comprehensive assessment of their capabilities.

Our framework includes a flexible pipeline that can be easily adapted to various tasks by incorporating a customized evaluation rubric.

All you need to do is fix the configs.yaml file! (Refer to 6. ⚙️ Configuration)

Table of Contents

  1. 🚀 Quick Start
  2. 📦 Installation
  3. 🌟 Features
  4. 🖥️ Working Process
  5. 🛠️ Basic Usage
  6. ⚙️ Configuration
  7. 🔑 Guideline for Customizing the YAML File
  8. 🔄 Question Decontamination
  9. Citation

🚀 Quick Start

  • Git Clone
  • Install requirements
"openai>=1.55.2",
"python-dotenv>=1.0.1",
"pyyaml>=6.0.2",
"rich>=13.9.4",
"click"
  • Setup API key in .env file
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
OPENROUTER_API_KEY=YOUR_OPENROUTER_API_KEY
  • Note: To test local models, you can use vLLM serve to launch the OpenAI compatible server.

  • Run the following command to start the interview

python libs/interview_eval/main.py --config examples/configs/math_problem_solving.yaml

📦 Installation

pip install interview-eval

🌟 Features

  • AI-powered interviewer and interviewee agents
  • Configurable interview parameters and evaluation rubrics
  • Real-time conversation display with rich formatting
  • Detailed scoring and feedback system
  • Progress tracking and maximum question limits
  • Customizable OpenAI client configuration

🖥️ Working Process

  • The interviewer and interviewee engage in a sequential conversation.
  • The evaluation of the interviewee’s responses(as a feedback agent) is also displayed (feedback is not disclosed to the interviewee).

Interview Process:

  1. The first question is the Seed Question (question1).
  2. Follow-up questions (question2, ...) are provided thereafter.

Terminal Display Example:

Terminal Display Example

📄 Interview Report

A report containing scores and a comprehensive summary of the interview is saved after the session.

The report includes:

  • Detailed evaluation scores.
  • A summary of the interviewee's performance.
  • Examples of interview logs for each score range.

This report can be configured to be saved automatically in the specified directory.

🛠️ Basic Usage

  • All you need is the below code snippet and your customized config.yaml file!
from interview_eval import InterviewRunner, Interviewer, Interviewee, InterviewReportManager
from interview_eval.utils import console, load_config, setup_logging
import logging
import yaml

# Load configuration
config = load_config("config.yaml")

# Setup logging and console
logger, log_file_path = setup_logging(config_data, verbose)

# Initialize agents
interviewer = Interviewer(config=config_data, name="Interviewer")
interviewee = Interviewee(config=config_data, name="Student")
report_manager = InterviewReportManager(config=config_data)
# Create and run interview
runner = InterviewRunner(interviewer, student, config_data, logger, log_file_path, console, report_manager)
results = runner.run()
report_manager.generate_report(interviewer)

⚙️ Configuration

Create a YAML configuration file with the following structure:

interviewer:
  name: "Technical Interviewer"
  model: "model name"
  client:
    api_key: ${OPENAI_API_KEY}
  instructions: "Your interview guidelines..."
  rubric: "Evaluation criteria..."
  
  strategy:
    policy: [...]
    follow_up_rules: [...]
  seed_question: [...]
  rubric: [...]

interviewee:
  name: "Candidate"
  model: "model name"
  client:
    api_key: ${OPENAI_API_KEY}
  instructions: "Interviewee behavior guidelines..."

session:
  initial_message: "Welcome to the interview..."
  max_questions: 10
  max_retries: 2
  initial_context: {}

logging:
  save_to_file: true
  output_dir: "logs"
  filename_template: "session_{timestamp}.log"

report:
  save_to_file: true
  output_dir: "reports"
  filename_template: "report_{timestamp}.txt"

📄 Example Configurations

Refer to the following examples for creating your own configuration:

🔑 Guideline for Customizing the YAML File

This guide explains how to customize the YAML file based on your needs to create and configure an effective interview session between an interviewer and an interviewee.

Table of Contents for Customization


1. Interview Type

Define the type of interview:

interview_type: <type>

2. Interviewer Configuration

The interviewer section customizes the behavior and attributes of the interviewer.

Basic Settings

interviewer:
  name: "<Interviewer Name>"
  model: "<Interviewer AI Model>"
  client:
    api_key: ${<API_KEY_VARIABLE>}
  • name: Provide a name for the interviewer (e.g., "Teacher").
  • model: Specify the AI model to use (e.g., gpt-4o-mini).
  • api_key: Add the API key environment variable. (e.g., ${OPENAI_API_KEY}).

Instructions

Define the behavior and goals of the interviewer:

  instructions: |
    <Interviewer instructions here>
  • Example: "You are a science interviewer assessing high school knowledge. Topics to cover: Biology, Physics."

Hint Strategy

This is the prompt used when the interviewer provides hints (feedback) to the interviewee. Customize how the interviewer provides hints:

  hint_prompt_template: |
    "<Hint strategy description>"

If not specified, it will default to the following:

  • Default: "Given the following, you have to give a hint to the interviewee to help them answer the question correctly. \nIf the {interviewee_name} makes repeated mistakes, give more hints to fix the mistake.\n"

Questioning Strategy For Follow-Up Questions

This is the instruction used when the interviewer provides follow-up questions to the interviewee.

Define the approach to questioning:

  strategy:
    max_questions: <Number>
    policy:
      - "<Questioning rule 1>"
      - "<Questioning rule 2>"
    follow_up_rules:
      - "<Follow-up rule 1>"
      - "<Follow-up rule 2>"
  • max_questions: Set the maximum number of questions allowed.
  • policy: Define the overall questioning flow (e.g., increasing difficulty, no duplicates).
  • follow_up_rules: Specify how to probe deeper or handle incomplete answers.

Seed Question

Provide the initial question to kickstart the interview:

  seed_question: "<First question>"
  • If you are not using the existing benchmark dataset and prefer to define your own scenario (e.g., a café interview scenario), you can set your custom seed question here.
  • If you are using a benchmark dataset, the seed question can be dynamically assigned as follows:
interviewer = Interviewer(config=config_data, name="Interviewer")
interviewer.seed_question = question['question']
interviewer.seed_question_answer = question['solution']

Grading Rubric

Specify the grading criteria:

  rubric: |
    <Grading criteria>
  • Example: "Score from 0-10 based on problem-solving accuracy and explanation depth."

3. Interviewee Configuration

The interviewee section defines the simulated student or participant.

Basic Settings

interviewee:
  name: "<Interviewee Name>"
  model: "<Interviewee AI Model>"
  client:
    api_key: ${<API_KEY_VARIABLE>}
  • name: Name of the participant (e.g., "Student").
  • model: Specify the AI model (e.g., openai/gpt-4o-mini-2024-07-18).

Instructions

Define the participant’s role, strengths, and challenges:

  instructions: |
    <Interviewee instructions here>
  • Example: "You are a high school student who excels in Geometry but struggles with Algebra."

4. Session Configuration

Control session parameters, such as retries and initial messages:

session:
  max_questions: <Number>
  max_retries: <Number>
  initial_message: "<Message>"
  initial_context:
    interview_complete: <true/false>
    current_topic: "<Topic>"
    questions_asked: <Number>
    assessment_notes: []
  • max_questions: Limit the session length.
  • max_retries: Set how many retries are allowed.
  • initial_message: Customize the opening message.
  • initial_context: Define the starting context, such as the topic and initial notes.

5. Logging Configuration

Enable or disable logging and customize log file details:

logging:
  save_to_file: <true/false>
  output_dir: "<Directory>"
  filename_template: "<Filename format>"
  • save_to_file: Set to true to save logs.
  • output_dir: Define the directory for logs.
  • filename_template: Use placeholders like {timestamp} for dynamic filenames.

6. Interview Report Configuration

Control Interview Report Saving options for the session:

report:
  save_to_file: <true/false>
  output_dir: "<Directory>"
  filename_template: "<Filename format>"
  • Similar to the logging configuration, but for reports.

Customization Tips

  • Focus on Roles: Clearly define interviewer and interviewee roles for clarity in behavior.
  • Adapt Strategies: Tailor hint and questioning strategies to align with the interview's goals.
  • Contextual Seed Questions: Use a relevant seed question to set the tone.
  • Test Configuration: Validate settings in a test environment to ensure smooth performance.
  • Dynamic Variables: Leverage placeholders (e.g., ${OPENAI_API_KEY}, {timestamp}) for flexibility.

🔄 Question Decontamination

For users conducting benchmark-based interview (like GSM8K, MMLU, etc.), interview-eval provides functions to prevent test set contamination through three transformation strategies:

from interview_eval import decontaminate_question

# Choose from three decontamination methods
question = decontaminate_question(
    question="What is 15% of 200?",
    reference_answer="30",
    method="modifying"  # or "unclarifying" or "paraphrasing"
)
  1. Unclarifying (method="unclarifying")

    • Removes key information while maintaining grammar
    • Forces interviewee to ask clarifying questions
    • Evaluates information-gathering skills
  2. Paraphrasing (method="paraphrasing")

    • Preserves exact meaning with different wording
    • Changes sentence structure
    • Maintains problem complexity
  3. Modifying (method="modifying")

    • Creates new but related questions
    • Keeps similar domain and difficulty
    • Tests same knowledge areas

Batch processing of questions is also supported:

from interview_eval import batch_decontaminate

questions = [
    {"question": "Q1...", "reference_answer": "A1..."},
    {"question": "Q2...", "reference_answer": "A2..."}
]

decontaminated = batch_decontaminate(
    questions,
    method="modifying",
    model="gpt-4"
)

Citation

@misc{kim2024llmasaninterviewer,
      title={LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation}, 
      author={Eunsu Kim and Juyoung Suk and Seungone Kim and Niklas Muennighoff and Dongkwan Kim and Alice Oh},
      year={2024},
      eprint={2412.10424},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.10424}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interview_eval-0.1.3.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

interview_eval-0.1.3-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file interview_eval-0.1.3.tar.gz.

File metadata

  • Download URL: interview_eval-0.1.3.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for interview_eval-0.1.3.tar.gz
Algorithm Hash digest
SHA256 8bf39a23c1ff9293964a5005308bd9d12b48479b5515bac2a680a804c8297ce5
MD5 7d0af87d42df0580b44c0db38e1bd546
BLAKE2b-256 c616c00845f591a01c93ecc6a0d81aac5c520ab49f1a9d6ff2052358b81bd24b

See more details on using hashes here.

File details

Details for the file interview_eval-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for interview_eval-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e2b3bc4e3c90a2059800e257c610269fc5962412db23ab92e62f27a35545d5cb
MD5 e899833b039dc14d3be02501c98fb038
BLAKE2b-256 11f274227c5facbbb1045cd4df9a56a634b7eed9a457ab7ba92853862bb379af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page