No project description provided

Project description

LLM-as-an-Interviewer🎤📄

This is the official GitHub repository for LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation.

LLM-as-an-Interviewer is an evaluation framework that assesses the capabilities of LLMs through an interview-style process. In this approach, the LLM acting as the interviewer evaluates other LLMs by providing feedback and asking follow-up questions, enabling a more comprehensive assessment of their capabilities.

Our framework includes a flexible pipeline that can be easily adapted to various tasks by incorporating a customized evaluation rubric.

All you need to do is fix the configs.yaml file! (Refer to 6. ⚙️ Configuration)

🚀 Quick Start
📦 Installation
🌟 Features
🖥️ Working Process
🛠️ Basic Usage
⚙️ Configuration
- 📄 Example Configurations
🔑 Guideline for Customizing the YAML File
🔄 Question Decontamination
Citation

🚀 Quick Start

Git Clone
Install requirements

"openai>=1.55.2",
"python-dotenv>=1.0.1",
"pyyaml>=6.0.2",
"rich>=13.9.4",
"click"

Setup API key in .env file

OPENAI_API_KEY=YOUR_OPENAI_API_KEY
OPENROUTER_API_KEY=YOUR_OPENROUTER_API_KEY

Note: To test local models, you can use vLLM serve to launch the OpenAI compatible server.
Run the following command to start the interview

python libs/interview_eval/main.py --config examples/configs/math_problem_solving.yaml

📦 Installation

pip install interview-eval

🌟 Features

AI-powered interviewer and interviewee agents
Configurable interview parameters and evaluation rubrics
Real-time conversation display with rich formatting
Detailed scoring and feedback system
Progress tracking and maximum question limits
Customizable OpenAI client configuration

🖥️ Working Process

The interviewer and interviewee engage in a sequential conversation.
The evaluation of the interviewee’s responses(as a feedback agent) is also displayed (feedback is not disclosed to the interviewee).

Interview Process:

The first question is the Seed Question (question1).
Follow-up questions (question2, ...) are provided thereafter.

Terminal Display Example:

Terminal Display Example

📄 Interview Report

A report containing scores and a comprehensive summary of the interview is saved after the session.

The report includes:

Detailed evaluation scores.
A summary of the interviewee's performance.
Examples of interview logs for each score range.

This report can be configured to be saved automatically in the specified directory.

🛠️ Basic Usage

All you need is the below code snippet and your customized config.yaml file!

from interview_eval import InterviewRunner, Interviewer, Interviewee, InterviewReportManager
from interview_eval.utils import console, load_config, setup_logging
import logging
import yaml

# Load configuration
config = load_config("config.yaml")

# Setup logging and console
logger, log_file_path = setup_logging(config_data, verbose)

# Initialize agents
interviewer = Interviewer(config=config_data, name="Interviewer")
interviewee = Interviewee(config=config_data, name="Student")
report_manager = InterviewReportManager(config=config_data)
# Create and run interview
runner = InterviewRunner(interviewer, student, config_data, logger, log_file_path, console, report_manager)
results = runner.run()
report_manager.generate_report(interviewer)

⚙️ Configuration

Create a YAML configuration file with the following structure:

interviewer:
  name: "Technical Interviewer"
  model: "model name"
  client:
    api_key: ${OPENAI_API_KEY}
  instructions: "Your interview guidelines..."
  rubric: "Evaluation criteria..."
  
  strategy:
    policy: [...]
    follow_up_rules: [...]
  seed_question: [...]
  rubric: [...]

interviewee:
  name: "Candidate"
  model: "model name"
  client:
    api_key: ${OPENAI_API_KEY}
  instructions: "Interviewee behavior guidelines..."

session:
  initial_message: "Welcome to the interview..."
  max_questions: 10
  max_retries: 2
  initial_context: {}

logging:
  save_to_file: true
  output_dir: "logs"
  filename_template: "session_{timestamp}.log"

report:
  save_to_file: true
  output_dir: "reports"
  filename_template: "report_{timestamp}.txt"

📄 Example Configurations

Refer to the following examples for creating your own configuration:

🔑 Guideline for Customizing the YAML File

This guide explains how to customize the YAML file based on your needs to create and configure an effective interview session between an interviewer and an interviewee.

Table of Contents for Customization

Interview Type
Interviewer Configuration
Interviewee Configuration
- Basic Settings
- Instructions
Session Configuration
Logging Configuration
Interview Report Configuration
Customization Tips

1. Interview Type

Define the type of interview:

interview_type: <type>

2. Interviewer Configuration

The interviewer section customizes the behavior and attributes of the interviewer.

Basic Settings

interviewer:
  name: "<Interviewer Name>"
  model: "<Interviewer AI Model>"
  client:
    api_key: ${<API_KEY_VARIABLE>}

name: Provide a name for the interviewer (e.g., "Teacher").
model: Specify the AI model to use (e.g., gpt-4o-mini).
api_key: Add the API key environment variable. (e.g., ${OPENAI_API_KEY}).

Instructions

Define the behavior and goals of the interviewer:

  instructions: |
    <Interviewer instructions here>

Example: "You are a science interviewer assessing high school knowledge. Topics to cover: Biology, Physics."

Hint Strategy

This is the prompt used when the interviewer provides hints (feedback) to the interviewee. Customize how the interviewer provides hints:

  hint_prompt_template: |
    "<Hint strategy description>"

If not specified, it will default to the following:

Default: "Given the following, you have to give a hint to the interviewee to help them answer the question correctly. \nIf the {interviewee_name} makes repeated mistakes, give more hints to fix the mistake.\n"

Questioning Strategy For Follow-Up Questions

This is the instruction used when the interviewer provides follow-up questions to the interviewee.

Define the approach to questioning:

  strategy:
    max_questions: <Number>
    policy:
      - "<Questioning rule 1>"
      - "<Questioning rule 2>"
    follow_up_rules:
      - "<Follow-up rule 1>"
      - "<Follow-up rule 2>"

max_questions: Set the maximum number of questions allowed.
policy: Define the overall questioning flow (e.g., increasing difficulty, no duplicates).
follow_up_rules: Specify how to probe deeper or handle incomplete answers.

Seed Question

Provide the initial question to kickstart the interview:

  seed_question: "<First question>"

If you are not using the existing benchmark dataset and prefer to define your own scenario (e.g., a café interview scenario), you can set your custom seed question here.
If you are using a benchmark dataset, the seed question can be dynamically assigned as follows:

interviewer = Interviewer(config=config_data, name="Interviewer")
interviewer.seed_question = question['question']
interviewer.seed_question_answer = question['solution']

Grading Rubric

Specify the grading criteria:

  rubric: |
    <Grading criteria>

Example: "Score from 0-10 based on problem-solving accuracy and explanation depth."

3. Interviewee Configuration

The interviewee section defines the simulated student or participant.

Basic Settings

interviewee:
  name: "<Interviewee Name>"
  model: "<Interviewee AI Model>"
  client:
    api_key: ${<API_KEY_VARIABLE>}

name: Name of the participant (e.g., "Student").
model: Specify the AI model (e.g., openai/gpt-4o-mini-2024-07-18).

Instructions

Define the participant’s role, strengths, and challenges:

  instructions: |
    <Interviewee instructions here>

Example: "You are a high school student who excels in Geometry but struggles with Algebra."

4. Session Configuration

Control session parameters, such as retries and initial messages:

session:
  max_questions: <Number>
  max_retries: <Number>
  initial_message: "<Message>"
  initial_context:
    interview_complete: <true/false>
    current_topic: "<Topic>"
    questions_asked: <Number>
    assessment_notes: []

max_questions: Limit the session length.
max_retries: Set how many retries are allowed.
initial_message: Customize the opening message.
initial_context: Define the starting context, such as the topic and initial notes.

5. Logging Configuration

Enable or disable logging and customize log file details:

logging:
  save_to_file: <true/false>
  output_dir: "<Directory>"
  filename_template: "<Filename format>"

save_to_file: Set to true to save logs.
output_dir: Define the directory for logs.
filename_template: Use placeholders like {timestamp} for dynamic filenames.

6. Interview Report Configuration

Control Interview Report Saving options for the session:

report:
  save_to_file: <true/false>
  output_dir: "<Directory>"
  filename_template: "<Filename format>"

Similar to the logging configuration, but for reports.

Customization Tips

Focus on Roles: Clearly define interviewer and interviewee roles for clarity in behavior.
Adapt Strategies: Tailor hint and questioning strategies to align with the interview's goals.
Contextual Seed Questions: Use a relevant seed question to set the tone.
Test Configuration: Validate settings in a test environment to ensure smooth performance.
Dynamic Variables: Leverage placeholders (e.g., ${OPENAI_API_KEY}, {timestamp}) for flexibility.

🔄 Question Decontamination

For users conducting benchmark-based interview (like GSM8K, MMLU, etc.), interview-eval provides functions to prevent test set contamination through three transformation strategies:

from interview_eval import decontaminate_question

# Choose from three decontamination methods
question = decontaminate_question(
    question="What is 15% of 200?",
    reference_answer="30",
    method="modifying"  # or "unclarifying" or "paraphrasing"
)

Unclarifying (method="unclarifying")
- Removes key information while maintaining grammar
- Forces interviewee to ask clarifying questions
- Evaluates information-gathering skills
Paraphrasing (method="paraphrasing")
- Preserves exact meaning with different wording
- Changes sentence structure
- Maintains problem complexity
Modifying (method="modifying")
- Creates new but related questions
- Keeps similar domain and difficulty
- Tests same knowledge areas

Batch processing of questions is also supported:

from interview_eval import batch_decontaminate

questions = [
    {"question": "Q1...", "reference_answer": "A1..."},
    {"question": "Q2...", "reference_answer": "A2..."}
]

decontaminated = batch_decontaminate(
    questions,
    method="modifying",
    model="gpt-4"
)

Citation

@misc{kim2024llmasaninterviewer,
      title={LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation}, 
      author={Eunsu Kim and Juyoung Suk and Seungone Kim and Niklas Muennighoff and Dongkwan Kim and Alice Oh},
      year={2024},
      eprint={2412.10424},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.10424}, 
}

Project details

Release history Release notifications | RSS feed

This version

0.1.3

Jan 6, 2025

0.1.2

Dec 14, 2024

0.1.1

Dec 14, 2024

0.1.0

Nov 28, 2024

0.0.1

Nov 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interview_eval-0.1.3.tar.gz (20.6 kB view details)

Uploaded Jan 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

interview_eval-0.1.3-py3-none-any.whl (17.4 kB view details)

Uploaded Jan 6, 2025 Python 3

File details

Details for the file interview_eval-0.1.3.tar.gz.

File metadata

Download URL: interview_eval-0.1.3.tar.gz
Upload date: Jan 6, 2025
Size: 20.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.2

File hashes

Hashes for interview_eval-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`8bf39a23c1ff9293964a5005308bd9d12b48479b5515bac2a680a804c8297ce5`
MD5	`7d0af87d42df0580b44c0db38e1bd546`
BLAKE2b-256	`c616c00845f591a01c93ecc6a0d81aac5c520ab49f1a9d6ff2052358b81bd24b`

See more details on using hashes here.

File details

Details for the file interview_eval-0.1.3-py3-none-any.whl.

File metadata

Download URL: interview_eval-0.1.3-py3-none-any.whl
Upload date: Jan 6, 2025
Size: 17.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.2

File hashes

Hashes for interview_eval-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2b3bc4e3c90a2059800e257c610269fc5962412db23ab92e62f27a35545d5cb`
MD5	`e899833b039dc14d3be02501c98fb038`
BLAKE2b-256	`11f274227c5facbbb1045cd4df9a56a634b7eed9a457ab7ba92853862bb379af`

See more details on using hashes here.

interview-eval 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LLM-as-an-Interviewer🎤📄

Table of Contents

🚀 Quick Start

📦 Installation

🌟 Features

🖥️ Working Process

Interview Process:

Terminal Display Example:

📄 Interview Report

🛠️ Basic Usage

⚙️ Configuration

📄 Example Configurations

🔑 Guideline for Customizing the YAML File

Table of Contents for Customization

1. Interview Type

2. Interviewer Configuration

Basic Settings

Instructions

Hint Strategy

Questioning Strategy For Follow-Up Questions

Seed Question

Grading Rubric

3. Interviewee Configuration

Basic Settings

Instructions

4. Session Configuration

5. Logging Configuration

6. Interview Report Configuration

Customization Tips

🔄 Question Decontamination

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes