Comprehensive benchmark and evaluation framework for educational AI question generation

These details have not been verified by PyPI

Project links

Project description

InceptBench

Educational question evaluation CLI tool with comprehensive AI-powered assessment. Evaluates questions locally using multiple evaluation modules including compliance_math_evaluator, answer_verification, reading_question_qc, and EduBench tasks.

Repository: https://github.com/trilogy-group/inceptbench

Features

🎯 Comprehensive Evaluation

Internal Evaluator - Scaffolding quality and DI compliance scoring (0-1 scale)
Answer Verification - GPT-4o powered correctness checking
Reading Question QC - MCQ distractor and question quality checks
EduBench Tasks - Educational benchmarks (QA, EC, IP, AG, QG, TMG) (0-10 scale)

📊 Flexible Output

Simplified mode (default) for quick score viewing - ~95% smaller output
Full mode (--full) with all detailed metrics, issues, strengths, and reasoning
Append mode (-a) for collecting multiple evaluations
JSON output for easy integration

🚀 Easy to Use

Simple CLI interface
Runs locally with OpenAI and Anthropic API integrations
Batch processing support
High-throughput benchmark mode for parallel evaluation
Only evaluates requested modules (configurable via submodules_to_run)

Installation

pip install inceptbench

# Or upgrade to latest version
pip install inceptbench --upgrade --no-cache-dir

Quick Start

1. Set up API Keys

Create a .env file in your working directory:

OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
HUGGINGFACE_TOKEN=your_hf_token  # Optional for EduBench tasks

2. Generate Sample File

inceptbench example

This creates qs.json with a complete example question including the submodules_to_run configuration.

3. Evaluate

# Simplified output (default)
inceptbench evaluate qs.json

# With progress messages
inceptbench evaluate qs.json --verbose

# Full detailed output
inceptbench evaluate qs.json --full --verbose

Usage

Commands

`evaluate` - Evaluate questions from JSON file

# Basic evaluation (simplified scores - default)
inceptbench evaluate questions.json

# Verbose output with progress messages
inceptbench evaluate questions.json --verbose

# Full detailed evaluation results
inceptbench evaluate questions.json --full

# Save results to file (overwrite)
inceptbench evaluate questions.json -o results.json

# Append results to file (creates if not exists)
inceptbench evaluate questions.json -a all_evaluations.json --verbose

# Full detailed results to file
inceptbench evaluate questions.json --full -o detailed_results.json --verbose

`example` - Generate sample input file

# Generate qs.json (default)
inceptbench example

# Save to custom filename
inceptbench example -o sample.json

`benchmark` - High-throughput parallel evaluation

Process many questions in parallel for maximum throughput. Perfect for evaluating large datasets.

# Basic benchmark (100 parallel workers by default)
inceptbench benchmark questions.json

# Custom worker count
inceptbench benchmark questions.json --workers 50

# Save results with verbose output
inceptbench benchmark questions.json -o results.json --verbose

# With custom settings
inceptbench benchmark questions.json --workers 200 -o benchmark_results.json --verbose

Benchmark Output:

{
  "request_id": "uuid",
  "total_questions": 100,
  "successful": 98,
  "failed": 2,
  "scores": [
    {
      "id": "q1",
      "final_score": 0.91,
      "scores": {
        "compliance_math_evaluator": {"overall": 0.93},
        "answer_verification": {"is_correct": true},
        "reading_question_qc": {"overall_score": 0.8}
      }
    }
  ],
  "failed_ids": ["q42", "q87"],
  "evaluation_time_seconds": 45.3,
  "avg_score": 0.89
}

`help` - Show detailed help

inceptbench help

Input Format

The input JSON file must contain:

submodules_to_run: List of evaluation modules to run
generated_questions: Array of questions to evaluate

Available Modules:

compliance_math_evaluator - Internal evaluator (scaffolding + DI compliance)
answer_verification - GPT-4o answer correctness checking
reading_question_qc - MCQ distractor quality checks
directionai_edubench - EduBench educational tasks (QA, EC, IP, etc.)

Example:

{
  "submodules_to_run": [
    "compliance_math_evaluator",
    "answer_verification",
    "reading_question_qc"
  ],
  "generated_questions": [
    {
      "id": "q1",
      "type": "mcq",
      "question": "إذا كان ثمن 2 قلم هو 14 ريالًا، فما ثمن 5 أقلام بنفس المعدل؟",
      "answer": "35 ريالًا",
      "answer_explanation": "الخطوة 1: تحليل المسألة — لدينا ثمن 2 قلم وهو 14 ريالًا. نحتاج إلى معرفة ثمن 5 أقلام بنفس المعدل. يجب التفكير في العلاقة بين عدد الأقلام والسعر وكيفية تحويل عدد الأقلام بمعدل ثابت.\nالخطوة 2: تطوير الاستراتيجية — يمكننا أولًا إيجاد ثمن قلم واحد بقسمة 14 ÷ 2 = 7 ريال، ثم ضربه في 5 لإيجاد ثمن 5 أقلام: 7 × 5 = 35 ريالًا.\nالخطوة 3: التطبيق والتحقق — نتحقق من منطقية الإجابة بمقارنة السعر بعدد الأقلام. السعر يتناسب طرديًا مع العدد، وبالتالي 35 ريالًا هي الإجابة الصحيحة والمنطقية.",
      "answer_options": {
        "A": "28 ريالًا",
        "B": "70 ريالًا",
        "C": "30 ريالًا",
        "D": "35 ريالًا"
      },
      "skill": {
        "title": "Grade 6 Mid-Year Comprehensive Assessment",
        "grade": "6",
        "subject": "mathematics",
        "difficulty": "medium",
        "description": "Apply proportional reasoning, rational number operations, algebraic thinking, geometric measurement, and statistical analysis to solve multi-step real-world problems",
        "language": "ar"
      },
      "image_url": null,
      "additional_details": "🔹 **Question generation logic:**\nThis question targets proportional reasoning for Grade 6 students, testing their ability to apply ratios and unit rates to real-world problems. It follows a classic proportionality structure — starting with a known ratio (2 items for 14 riyals) and scaling it up to 5 items. The stepwise reasoning develops algebraic thinking and promotes estimation checks to confirm logical correctness.\n\n🔹 **Personalized insight examples:**\n- Choosing 28 ريالًا shows a misunderstanding by doubling instead of proportionally scaling.\n- Choosing 7 ريالًا indicates the learner found the unit rate but didn't scale it up to 5.\n- Choosing 14 ريالًا confuses the given 2-item cost with the required 5-item cost.\n\n🔹 **Instructional design & DI integration:**\nThe question aligns with *Percent, Ratio, and Probability* learning targets. In DI format 15.7, it models how equivalent fractions and proportional relationships can predict outcomes across different scales. This builds foundational understanding for probability and proportional reasoning. By using a simple, relatable context (price of pens), it connects mathematical ratios to practical real-world applications, supporting concept transfer and cognitive engagement."
    }
  ]
}

Use inceptbench example to generate this file automatically.

Authentication

Required API Keys:

The tool integrates with OpenAI and Anthropic APIs for running evaluations. Create a .env file in your working directory:

OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
HUGGINGFACE_TOKEN=your_hf_token  # Optional, for EduBench tasks

The tool will automatically load these from the .env file when you run evaluations.

Output Format

Simplified Mode (default)

Returns only essential scores - ~95% smaller output:

{
  "request_id": "c7bce978-66e9-4f8f-ac52-5468340fde8f",
  "evaluations": {
    "q1": {
      "compliance_math_evaluator": {
        "overall": 0.9333333333333333
      },
      "answer_verification": {
        "is_correct": true
      },
      "reading_question_qc": {
        "overall_score": 0.8
      },
      "final_score": 0.9111111111111111
    }
  },
  "evaluation_time_seconds": 12.151433229446411
}

Note: Only requested modules (specified in submodules_to_run) will be included in the output. Unrequested modules will not appear.

Full Mode (`--full` flag)

Returns complete evaluation details including all scores, issues, strengths, reasoning, and recommendations:

{
  "request_id": "uuid",
  "evaluations": {
    "q1": {
      "compliance_math_evaluator": {
        "overall": 0.95,
        "scores": {
          "correctness": 1.0,
          "grade_alignment": 0.9,
          "difficulty_alignment": 0.9,
          "language_quality": 0.8,
          "pedagogical_value": 0.9,
          "explanation_quality": 0.9,
          "instruction_adherence": 0.9,
          "format_compliance": 1.0,
          "query_relevance": 1.0,
          "di_compliance": 0.9
        },
        "issues": [],
        "strengths": ["Clear explanation", "Good grade alignment"],
        "recommendation": "accept",
        "suggested_improvements": [...],
        "di_scores": {...},
        "section_evaluations": {...}
      },
      "answer_verification": {
        "is_correct": true,
        "correct_answer": "35 ريالًا",
        "confidence": 10,
        "reasoning": "The answer is correct because..."
      },
      "reading_question_qc": {
        "overall_score": 0.85,
        "distractor_checks": {...},
        "question_checks": {...},
        "passed": true
      },
      "final_score": 0.91
    }
  },
  "evaluation_time_seconds": 45.2
}

Command Reference

Command	Description
`evaluate`	Evaluate questions from JSON file
`benchmark`	High-throughput parallel evaluation for large datasets
`example`	Generate sample input file
`help`	Show detailed help and usage examples

Evaluate Options

Option	Short	Description
`--output PATH`	`-o`	Save results to file (overwrites)
`--append PATH`	`-a`	Append results to file (creates if not exists)
`--full`	`-f`	Return full detailed evaluation results (default: simplified scores only)
`--verbose`	`-v`	Show progress messages
`--timeout SECS`	`-t`	Request timeout in seconds (default: 600)

Benchmark Options

Option	Short	Description
`--output PATH`	`-o`	Save results to file
`--workers NUM`	`-w`	Number of parallel workers (default: 100)
`--verbose`	`-v`	Show progress messages

Examples

Basic Evaluation

# Evaluate with default settings (simplified scores)
inceptbench evaluate questions.json

# With progress messages
inceptbench evaluate questions.json --verbose

Full Detailed Evaluation

# Get complete evaluation with all details
inceptbench evaluate questions.json --full --verbose

# Save full results to file
inceptbench evaluate questions.json --full -o detailed_results.json

Collecting Multiple Evaluations

# Append multiple evaluations to one file
inceptbench evaluate test1.json -a all_results.json --verbose
inceptbench evaluate test2.json -a all_results.json --verbose
inceptbench evaluate test3.json -a all_results.json --verbose

# Result: all_results.json contains an array of all 3 evaluations

Batch Processing

# Evaluate all files and append to one results file
for file in questions/*.json; do
  inceptbench evaluate "$file" -a batch_results.json --verbose
done

Benchmark Mode (High-Throughput Parallel Processing)

For large-scale evaluations, use benchmark mode to process hundreds of questions in parallel:

# Evaluate 100 questions with 100 parallel workers
inceptbench benchmark large_dataset.json --verbose

# Process 1000 questions with 200 workers, save results
inceptbench benchmark dataset_1000.json --workers 200 -o benchmark_results.json --verbose

# Results include: success rate, avg score, timing, and failed question IDs

When to use benchmark mode:

Large datasets (100+ questions)
Need for maximum throughput
Want simplified scores only (no detailed output)
Need to identify failed questions quickly

Output includes:

Total questions processed
Success/failure counts
Failed question IDs for easy debugging
Average score across all questions
Total evaluation time
One simplified score per question

Evaluation Modules

compliance_math_evaluator (Internal Evaluator)

Scaffolding quality assessment (answer_explanation structure)
Direct Instruction (DI) compliance checking
Pedagogical structure validation
Language quality scoring
Grade and difficulty alignment
Returns scores on 0-1 scale

answer_verification

GPT-4o powered correctness checking
Mathematical accuracy validation
Confidence scoring (0-10)
Reasoning explanation

reading_question_qc

MCQ distractor quality checks
Question clarity validation
Overall quality scoring

directionai_edubench

QA: Question Answering - Can the model answer the question?
EC: Error Correction - Can the model identify and correct errors?
IP: Instructional Planning - Can the model provide step-by-step solutions?
AG: Answer Generation - Can the model generate correct answers?
QG: Question Generation - Question quality assessment
TMG: Test Making Generation - Test design quality
Returns scores on 0-10 scale

All modules are optional and configurable via submodules_to_run in the input JSON.

Requirements

Python >= 3.11
OpenAI API key
Anthropic API key
Hugging Face token (optional, for EduBench tasks)

Support

Repository: https://github.com/trilogy-group/inceptbench
Issues: GitHub Issues
Help: Run inceptbench help for detailed documentation

License

MIT License - see LICENSE file for details.

Made by the Incept Team

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.4.0

Apr 16, 2026

2.3.10

Apr 8, 2026

2.3.9 yanked

Apr 8, 2026

Reason this release was yanked:

Gemini SDK needs >=1.70 version to work with this.

2.3.8

Apr 2, 2026

2.3.7 yanked

Mar 30, 2026

Reason this release was yanked:

Regressed evaluations due to incorrect code merges.

2.3.6

Mar 17, 2026

2.3.5

Mar 10, 2026

2.3.4

Mar 3, 2026

2.3.3

Feb 10, 2026

2.3.2

Feb 6, 2026

2.3.1

Jan 29, 2026

2.3.0

Jan 10, 2026

2.2.1

Jan 7, 2026

2.2.0

Jan 2, 2026

2.1.0

Dec 16, 2025

2.0.0

Nov 27, 2025

1.5.5

Nov 24, 2025

1.5.4

Nov 20, 2025

1.5.3

Nov 19, 2025

1.5.2

Nov 18, 2025

1.5.0

Nov 18, 2025

1.4.5

Nov 17, 2025

1.4.4

Nov 12, 2025

1.4.3

Nov 10, 2025

1.4.2

Nov 10, 2025

1.4.1

Oct 30, 2025

1.4.0

Oct 28, 2025

1.3.5

Oct 27, 2025

1.3.4

Oct 27, 2025

1.3.3

Oct 27, 2025

1.3.2

Oct 24, 2025

1.3.1

Oct 23, 2025

1.3.0

Oct 23, 2025

1.2.4

Oct 22, 2025

1.2.3

Oct 22, 2025

1.2.2

Oct 22, 2025

1.2.1

Oct 22, 2025

1.2.0

Oct 21, 2025

1.1.8

Oct 21, 2025

1.1.7

Oct 20, 2025

1.1.6

Oct 20, 2025

1.1.5

Oct 20, 2025

1.1.4

Oct 20, 2025

1.1.3

Oct 20, 2025

1.1.2

Oct 20, 2025

1.1.1

Oct 20, 2025

1.1.0

Oct 20, 2025

1.0.9

Oct 20, 2025

1.0.8

Oct 20, 2025

1.0.7

Oct 17, 2025

This version

1.0.6

Oct 17, 2025

1.0.5

Oct 17, 2025

1.0.4

Oct 17, 2025

1.0.3

Oct 17, 2025

1.0.2

Oct 17, 2025

1.0.1

Oct 17, 2025

1.0.0

Oct 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inceptbench-1.0.6.tar.gz (45.2 MB view details)

Uploaded Oct 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inceptbench-1.0.6-py3-none-any.whl (45.6 MB view details)

Uploaded Oct 17, 2025 Python 3

File details

Details for the file inceptbench-1.0.6.tar.gz.

File metadata

Download URL: inceptbench-1.0.6.tar.gz
Upload date: Oct 17, 2025
Size: 45.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`9582a271303675516edce869534e295f9dad7ebe64bd047053d9b9f93f336a4d`
MD5	`4dac0d66a50c211ef14996ab71cf1104`
BLAKE2b-256	`48d5c396e912715527eb254f8178bc4e2c912a2edc14dea2c3703246991afa45`

See more details on using hashes here.

File details

Details for the file inceptbench-1.0.6-py3-none-any.whl.

File metadata

Download URL: inceptbench-1.0.6-py3-none-any.whl
Upload date: Oct 17, 2025
Size: 45.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`26330907d4c4766d7b8c1d2a16a93ace055be89a66028d590b21914bde262da9`
MD5	`314cc23187c0585dbb96410cd8213689`
BLAKE2b-256	`ef8f7acd5e8d196a7087d543b6f69fdfaab61696df5ad5597dedbc7098cb8fad`

See more details on using hashes here.

inceptbench 1.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InceptBench

Features

Installation

Quick Start

1. Set up API Keys

2. Generate Sample File

3. Evaluate

Usage

Commands

evaluate - Evaluate questions from JSON file

example - Generate sample input file

benchmark - High-throughput parallel evaluation

help - Show detailed help

Input Format

Authentication

Output Format

Simplified Mode (default)

Full Mode (--full flag)

Command Reference

Evaluate Options

Benchmark Options

Examples

Basic Evaluation

Full Detailed Evaluation

Collecting Multiple Evaluations

Batch Processing

Benchmark Mode (High-Throughput Parallel Processing)

Evaluation Modules

compliance_math_evaluator (Internal Evaluator)

answer_verification

reading_question_qc

directionai_edubench

Requirements

Support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`evaluate` - Evaluate questions from JSON file

`example` - Generate sample input file

`benchmark` - High-throughput parallel evaluation

`help` - Show detailed help

Full Mode (`--full` flag)