Comprehensive benchmark and evaluation framework for educational AI question generation

These details have not been verified by PyPI

Project links

Project description

Incept Eval

Local CLI tool for evaluating educational questions with comprehensive AI-powered assessment. Runs evaluation locally using compliance_math_evaluator, answer_verification, and reading_question_qc modules, plus EduBench tasks.

Features

🎯 Comprehensive Evaluation

Internal Evaluator - Scaffolding quality and DI compliance scoring
Answer Verification - GPT-4o powered correctness checking
Reading Question QC - MCQ distractor and question quality checks
EduBench Tasks - Educational benchmarks (QA, EC, IP, AG, QG, TMG)

📊 Flexible Output

Pretty mode for quick score viewing
Full detailed results with all metrics
Append mode for collecting multiple evaluations
JSON output for easy integration

🚀 Easy to Use

Simple CLI interface
Runs completely locally - no API calls to external services
Requires OpenAI and Anthropic API keys in .env file
Batch processing support

Installation

pip install inceptbench

Quick Start

1. Install

pip install inceptbench

2. Set up API Keys

Create a .env file in your project directory with:

OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
HUGGINGFACE_TOKEN=your_hf_token  # Optional for EduBench

3. Generate Sample File

inceptbench example

This creates qs.json with a complete example question.

4. Evaluate

inceptbench evaluate qs.json --verbose

Usage

Commands

`evaluate` - Evaluate questions from JSON file

# Basic evaluation (pretty mode by default)
inceptbench evaluate questions.json

# Verbose output with progress messages
inceptbench evaluate questions.json --verbose

# Save results to file (overwrite)
inceptbench evaluate questions.json -o results.json

# Append results to file (creates if not exists)
inceptbench evaluate questions.json -a all_evaluations.json --verbose

# Use local API server
inceptbench evaluate questions.json --api-url http://localhost:8000

# Full results without pretty formatting
inceptbench evaluate questions.json --no-pretty

`example` - Generate sample input file

# Generate qs.json (default)
inceptbench example

# Save to custom filename
inceptbench example -o sample.json

`configure` - Save API key

inceptbench configure YOUR_API_KEY

`help` - Show detailed help

inceptbench help

Input Format

The input JSON file must contain:

request: Question generation request metadata (grade, subject, instructions, etc.)
questions: Array of 1-5 questions to evaluate

Example:

{
  "request": {
    "grade": 3,
    "count": 2,
    "subject": "mathematics",
    "instructions": "Generate multiplication word problems that involve equal groups.",
    "language": "arabic"
  },
  "questions": [
    {
      "type": "mcq",
      "question": "إذا كان لديك 4 علب من القلم وكل علبة تحتوي على 7 أقلام، كم عدد الأقلام لديك إجمالاً؟",
      "answer": "28",
      "difficulty": "medium",
      "explanation": "استخدام ضرب لحساب مجموع الأقلام في جميع العلب.",
      "options": {
        "A": "21",
        "B": "32",
        "C": "35",
        "D": "28"
      },
      "answer_choice": "D",
      "detailed_explanation": { ... },
      "voiceover_script": { ... },
      "skill": null,
      "image_url": null,
      "di_formats_used": [ ... ]
    }
  ]
}

Use inceptbench example to see a complete example with all fields.

Authentication

Required API Keys:

The tool requires API keys from OpenAI and Anthropic for running evaluations. Create a .env file in your working directory:

OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
HUGGINGFACE_TOKEN=your_hf_token  # Optional, for EduBench tasks

The tool will automatically load these from the .env file when you run evaluations.

Output Format

Pretty Mode (default)

Shows only the scores:

{
  "overall_scores": {
    "total_questions": 1.0,
    "v3_average": 0.9555555555555555,
    "answer_correctness_rate": 1.0,
    "total_edubench_tasks": 3.0
  },
  "v3_scores": [
    {
      "correctness": 1.0,
      "grade_alignment": 1.0,
      "difficulty_alignment": 1.0,
      "language_quality": 0.9,
      "pedagogical_value": 0.9,
      "explanation_quality": 0.8,
      "instruction_adherence": 1.0,
      "format_compliance": 1.0,
      "query_relevance": 1.0,
      "di_compliance": 0.9,
      "overall": 0.9555555555555555,
      "recommendation": "accept"
    }
  ],
  "answer_verification": [
    {
      "is_correct": true,
      "confidence": 10
    }
  ]
}

Full Mode (`--no-pretty`)

Includes all evaluation details:

overall_scores: Aggregate metrics
v3_scores: Per-question scaffolding scores
answer_verification: Answer correctness checks
edubench_results: Full task evaluation responses
summary: Evaluation metadata and timing

Command Reference

Command	Description
`evaluate`	Evaluate questions from JSON file
`example`	Generate sample input file
`configure`	Save API key to config file
`help`	Show detailed help and usage examples

Evaluate Options

Option	Short	Description
`--output PATH`	`-o`	Save results to file (overwrites)
`--append PATH`	`-a`	Append results to file (creates if not exists)
`--api-key KEY`	`-k`	API key (or use INCEPT_API_KEY env var)
`--api-url URL`		API endpoint (default: production)
`--pretty`		Show only scores (default: true)
`--no-pretty`		Show full results including EduBench details
`--verbose`	`-v`	Show progress messages

Examples

Basic Evaluation

# Evaluate with default settings (pretty mode)
inceptbench evaluate questions.json --verbose

Collecting Multiple Evaluations

# Append multiple evaluations to one file
inceptbench evaluate test1.json -a all_results.json
inceptbench evaluate test2.json -a all_results.json
inceptbench evaluate test3.json -a all_results.json

# Result: all_results.json contains an array of all 3 evaluations

Batch Processing

# Evaluate all files and append to one results file
for file in questions/*.json; do
  inceptbench evaluate "$file" -a batch_results.json --verbose
done

Local Development

# Test against local API server
inceptbench evaluate test.json --api-url http://localhost:8000 --verbose

Full Results

# Get complete evaluation with EduBench details
inceptbench evaluate questions.json --no-pretty -o full_results.json

Evaluation Modules

The API evaluates questions using three main modules:

V3 Evaluation

Scaffolding quality assessment (detailed_explanation steps)
Direct Instruction (DI) compliance checking
Pedagogical structure validation
Language quality scoring
Grade and difficulty alignment

Answer Verification

GPT-4o powered correctness checking
Mathematical accuracy validation
Confidence scoring (0-10)

EduBench Tasks

QA: Question Answering - Can the model answer the question?
EC: Error Correction - Can the model identify and correct errors?
IP: Instructional Planning - Can the model provide step-by-step solutions?

All modules run by default. Future versions will support configurable module selection.

Requirements

Python >= 3.11
Incept API key

Support

Issues: GitHub Issues
Help: Run inceptbench help for detailed documentation

License

MIT License - see LICENSE file for details.

Made by the Incept Team

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.4.0

Apr 16, 2026

2.3.10

Apr 8, 2026

2.3.9 yanked

Apr 8, 2026

Reason this release was yanked:

Gemini SDK needs >=1.70 version to work with this.

2.3.8

Apr 2, 2026

2.3.7 yanked

Mar 30, 2026

Reason this release was yanked:

Regressed evaluations due to incorrect code merges.

2.3.6

Mar 17, 2026

2.3.5

Mar 10, 2026

2.3.4

Mar 3, 2026

2.3.3

Feb 10, 2026

2.3.2

Feb 6, 2026

2.3.1

Jan 29, 2026

2.3.0

Jan 10, 2026

2.2.1

Jan 7, 2026

2.2.0

Jan 2, 2026

2.1.0

Dec 16, 2025

2.0.0

Nov 27, 2025

1.5.5

Nov 24, 2025

1.5.4

Nov 20, 2025

1.5.3

Nov 19, 2025

1.5.2

Nov 18, 2025

1.5.0

Nov 18, 2025

1.4.5

Nov 17, 2025

1.4.4

Nov 12, 2025

1.4.3

Nov 10, 2025

1.4.2

Nov 10, 2025

1.4.1

Oct 30, 2025

1.4.0

Oct 28, 2025

1.3.5

Oct 27, 2025

1.3.4

Oct 27, 2025

1.3.3

Oct 27, 2025

1.3.2

Oct 24, 2025

1.3.1

Oct 23, 2025

1.3.0

Oct 23, 2025

1.2.4

Oct 22, 2025

1.2.3

Oct 22, 2025

1.2.2

Oct 22, 2025

1.2.1

Oct 22, 2025

1.2.0

Oct 21, 2025

1.1.8

Oct 21, 2025

1.1.7

Oct 20, 2025

1.1.6

Oct 20, 2025

1.1.5

Oct 20, 2025

1.1.4

Oct 20, 2025

1.1.3

Oct 20, 2025

1.1.2

Oct 20, 2025

1.1.1

Oct 20, 2025

1.1.0

Oct 20, 2025

1.0.9

Oct 20, 2025

1.0.8

Oct 20, 2025

1.0.7

Oct 17, 2025

1.0.6

Oct 17, 2025

1.0.5

Oct 17, 2025

1.0.4

Oct 17, 2025

1.0.3

Oct 17, 2025

1.0.2

Oct 17, 2025

This version

1.0.1

Oct 17, 2025

1.0.0

Oct 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inceptbench-1.0.1.tar.gz (62.6 MB view details)

Uploaded Oct 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inceptbench-1.0.1-py3-none-any.whl (63.3 MB view details)

Uploaded Oct 17, 2025 Python 3

File details

Details for the file inceptbench-1.0.1.tar.gz.

File metadata

Download URL: inceptbench-1.0.1.tar.gz
Upload date: Oct 17, 2025
Size: 62.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`f04c8edfbea661924ee9600db3b00fa2eaec4e1f44e6ffeb04f3c594dcc6bf67`
MD5	`8aad518e8772df00754653ecac943537`
BLAKE2b-256	`68e3b68aa74ac8d3e0229802d18ce0d4d17a1f9463bb299ba856ec35c4c133a3`

See more details on using hashes here.

File details

Details for the file inceptbench-1.0.1-py3-none-any.whl.

File metadata

Download URL: inceptbench-1.0.1-py3-none-any.whl
Upload date: Oct 17, 2025
Size: 63.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb10b90a5f1d6a0757b24b07e4f9f2a7363f221299dd2b1cf457362f96d17cb0`
MD5	`4c095160ff08a406ef1316c7b9b93989`
BLAKE2b-256	`fcb13a43a7a70cdbf910263450db6c4175ad509e2165761b6ff4188b5c4eec4c`

See more details on using hashes here.

inceptbench 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Incept Eval

Features

Installation

Quick Start

1. Install

2. Set up API Keys

3. Generate Sample File

4. Evaluate

Usage

Commands

evaluate - Evaluate questions from JSON file

example - Generate sample input file

configure - Save API key

help - Show detailed help

Input Format

Authentication

Output Format

Pretty Mode (default)

Full Mode (--no-pretty)

Command Reference

Evaluate Options

Examples

Basic Evaluation

Collecting Multiple Evaluations

Batch Processing

Local Development

Full Results

Evaluation Modules

V3 Evaluation

Answer Verification

EduBench Tasks

Requirements

Support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`evaluate` - Evaluate questions from JSON file

`example` - Generate sample input file

`configure` - Save API key

`help` - Show detailed help

Full Mode (`--no-pretty`)