Skip to main content

Comprehensive benchmark and evaluation framework for educational AI question generation

Project description

InceptBench

PyPI version Python Version License: Proprietary

Educational content evaluation framework with multiple AI-powered assessment modules.

๐Ÿ“– Documentation

Official Sites

Website โ€ข Benchmarks โ€ข Glossary โ€ข Docs โ€ข API Endpoint โ€ข API Docs

User Guides

Developer Guides

Resources

๐Ÿš€ Quick Start

# Install from PyPI (latest published release)
pip install inceptbench

# Or install from source (current repo snapshot)
git clone https://github.com/incept-ai/inceptbench.git
cd inceptbench
python3 -m venv venv && source venv/bin/activate
pip install -e .

# Create .env file (optional - for API-based evaluation)
echo "OPENAI_API_KEY=your_key" >> .env
echo "ANTHROPIC_API_KEY=your_key" >> .env

# Generate example
inceptbench example

# Run evaluation via CLI
inceptbench evaluate qs.json --full

# Or call the CLI module directly (no install needed)
PYTHONPATH="$(pwd)/src:$PYTHONPATH" python -m inceptbench.cli evaluate qs.json --full

โœจ Features

  • 6 Specialized Evaluators - Quality assessment across multiple dimensions
  • Automatic Image Evaluation - Context-aware DI rubric scoring
  • Parallel Processing - 47+ tasks running concurrently
  • Multi-language Support - Evaluate content in any language
  • Dual Content Types - Questions (MCQ/fill-in) and text content (passages/explanations)
  • Production-Ready - Full demo in qs.json (~3-4 minutes)

๐Ÿ“Š Evaluators

Evaluator Type Auto
ti_question_qa Question quality (10 dimensions) Yes
answer_verification Answer correctness Yes
reading_question_qc MCQ distractor analysis Yes
math_content_evaluator Content quality (9 criteria) Yes
text_content_evaluator Pedagogical text assessment Yes
image_quality_di_evaluator DI rubric image quality Auto
external_edubench Educational benchmark (6 tasks) No

See EVALUATORS.md for details.

๐Ÿ“ฆ Architecture

inceptbench/
โ”œโ”€โ”€ src/inceptbench/          # Unified package (src/ layout)
โ”‚   โ”œโ”€โ”€ orchestrator.py        # Main evaluation orchestrator
โ”‚   โ”œโ”€โ”€ cli.py                 # Command-line interface
โ”‚   โ”œโ”€โ”€ core/                  # Core evaluators and utilities
โ”‚   โ”œโ”€โ”€ agents/                # Agent-based evaluators
โ”‚   โ”œโ”€โ”€ qc/                    # Quality control modules
โ”‚   โ”œโ”€โ”€ evaluation/            # Evaluation templates
โ”‚   โ””โ”€โ”€ image/                 # Image quality evaluation
โ”œโ”€โ”€ submodules/                # External dependencies
โ”‚   โ”œโ”€โ”€ reading-question-qc/
โ”‚   โ”œโ”€โ”€ EduBench/
โ”‚   โ”œโ”€โ”€ agentic-incept-reasoning/
โ”‚   โ””โ”€โ”€ image_generation_package/
โ””โ”€โ”€ pyproject.toml             # Package configuration

๐ŸŽฏ Demo

The qs.json file demonstrates all capabilities:

  • 8 questions (MCQ/fill-in, Arabic/English)
  • 4 text content items
  • 7 images (auto-evaluated)
  • All 6 evaluators active
  • ~3-4 minute runtime

โœ… Local Smoke Test

Use the bundled demo file to validate your environment before making changes:

# Using CLI (recommended)
inceptbench evaluate qs.json --full

# Or run locally without installing the package
PYTHONPATH="$(pwd)/src:$PYTHONPATH" python -m inceptbench.cli evaluate qs.json --full

# Or using Python API
python -c "from inceptbench import universal_unified_benchmark, UniversalEvaluationRequest; import json; data = json.load(open('qs.json')); request = UniversalEvaluationRequest(**data); result = universal_unified_benchmark(request); print(result.model_dump_json(indent=2))"

These commands exercise every evaluator (including localization + DI image checks) and report per-item scores plus the combined inceptbench_version. Sample data leaves some image_url fields set to null, so the DI image checker will log FileNotFoundError: 'null' entriesโ€”those are expected for the placeholders and can be ignored during the smoke test.

๐ŸŒ Locale-Aware Localization

UniversalEvaluationRequest now accepts a locale such as ar-AE, en-AE, or en-IN. The format is:

  • First segment (ar, en, etc.): language of the text
  • Second segment (AE, IN, etc.): cultural/regional guardrails to apply

When locale is provided, all localization checks use the corresponding language + cultural context. If it is omitted, we fall back to the legacy language field and heuristics (auto-detecting non-ASCII text when necessary).

๐Ÿ“ Example Usage

CLI

inceptbench evaluate qs.json --full
inceptbench evaluate qs.json -o results.json

Python API

from inceptbench import universal_unified_benchmark, UniversalEvaluationRequest

request = UniversalEvaluationRequest(
    submodules_to_run=["ti_question_qa", "answer_verification"],
    generated_questions=[{
        "id": "q1",
        "type": "mcq",
        "question": "What is 2+2?",
        "answer": "4",
        "answer_options": {"A": "3", "B": "4", "C": "5"},
        "answer_explanation": "2+2 equals 4",
        "skill": {
            "title": "Basic Addition",
            "grade": "1",
            "subject": "mathematics",
            "difficulty": "easy"
        }
    }]
)

response = universal_unified_benchmark(request)
print(response.evaluations["q1"].score)

See USAGE.md for complete examples.

๐Ÿ–ผ๏ธ Image Evaluation

Add image_url to any question or content:

{
  "id": "q1",
  "question": "How many apples?",
  "image_url": "https://example.com/apples.png"
}

The image_quality_di_evaluator runs automatically with:

  • Context-aware evaluation (accompaniment vs standalone)
  • DI rubric scoring (0-100, normalized to 0-1)
  • Hard-fail gates (answer leakage, wrong representations)
  • Canonical DI representation checks

๐Ÿ“ฅ Input Format

Questions:

{
  "submodules_to_run": ["ti_question_qa"],
  "generated_questions": [{
    "id": "q1",
    "type": "mcq",
    "question": "...",
    "answer": "...",
    "image_url": "..."  // Optional
  }]
}

Text Content:

{
  "submodules_to_run": ["text_content_evaluator"],
  "generated_content": [{
    "id": "text1",
    "type": "text",
    "content": "...",
    "image_url": "..."  // Optional
  }]
}

See INPUT_OUTPUT.md for complete schema.

๐Ÿ“ค Output Format

Simplified (default):

{
  "evaluations": {
    "q1": {"score": 0.89}
  }
}

Full (verbose=True):

{
  "evaluations": {
    "q1": {
      "ti_question_qa": {
        "overall": 0.95,
        "scores": {...},
        "issues": [...],
        "strengths": [...]
      },
      "score": 0.89
    }
  }
}

๐Ÿ”„ Module Selection

Automatic (if submodules_to_run not specified):

  • Questions โ†’ ti_question_qa, answer_verification, math_content_evaluator, reading_question_qc
  • Text โ†’ text_content_evaluator, math_content_evaluator
  • Images โ†’ image_quality_di_evaluator (auto-added)

Manual:

request = UniversalEvaluationRequest(
    submodules_to_run=["ti_question_qa", "answer_verification"],  # Only these
    generated_questions=[...]
)

๐Ÿ“œ License

Proprietary - Copyright Trilogy Education Services

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inceptbench-1.5.2.tar.gz (186.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inceptbench-1.5.2-py3-none-any.whl (206.2 kB view details)

Uploaded Python 3

File details

Details for the file inceptbench-1.5.2.tar.gz.

File metadata

  • Download URL: inceptbench-1.5.2.tar.gz
  • Upload date:
  • Size: 186.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-1.5.2.tar.gz
Algorithm Hash digest
SHA256 8bd553041ffd1e7fc610c51eb53bf1e39d7035f372dd822eb5e44974213bb298
MD5 1caa3107ffc2493b75eed231d0b9ed8b
BLAKE2b-256 b4ba1520a11af7347858afc011edafa92569b47efb40d153d8d9e3f8a019e157

See more details on using hashes here.

File details

Details for the file inceptbench-1.5.2-py3-none-any.whl.

File metadata

  • Download URL: inceptbench-1.5.2-py3-none-any.whl
  • Upload date:
  • Size: 206.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-1.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2f25fadc0cc93df79ea1e8fff9c6b3d5fc7e5a8dd1f8505cba2e932c29fecd04
MD5 b544516e62dc160658e5df4068b22faa
BLAKE2b-256 c7a115e47109a93b619cb0278a3b055cb732ba283ad1117a794d6a673bb1763c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page