Skip to main content

Comprehensive benchmark and evaluation framework for educational AI question generation

Project description

InceptBench

PyPI version Python Version License: Proprietary

Educational content evaluation framework with multiple AI-powered assessment modules.

⚠️ DEPRECATION NOTICE - Action Required

The legacy evaluator (v1.5.x) is deprecated and will be removed on December 6, 2025.

After this date, the --new flag behavior will become the default.

Migration: Update your integrations now by adding --new to your commands:

# Before (legacy - deprecated)
inceptbench evaluate qs.json

# After (v2.0 - recommended)
inceptbench evaluate qs.json --new

See Migration Guide for details.

📖 Documentation

Official Sites

WebsiteBenchmarksGlossaryDocsAPI EndpointAPI Docs

User Guides

Developer Guides

Resources

🚀 Quick Start

# Install from PyPI (latest published release)
pip install inceptbench

# Or install from source (current repo snapshot)
git clone https://github.com/incept-ai/inceptbench.git
cd inceptbench
python3 -m venv venv && source venv/bin/activate
pip install -e .

# Create .env file (required for evaluation)
echo "OPENAI_API_KEY=your_key" >> .env
echo "ANTHROPIC_API_KEY=your_key" >> .env

# Generate example
inceptbench example

# Run evaluation - Legacy system (v1.5.5)
inceptbench evaluate qs.json --full

# Run evaluation - NEW system (v2.0.0) - RECOMMENDED
inceptbench evaluate qs.json --new

# Advanced mode - Evaluate raw files directly
inceptbench evaluate article.md --new --advanced

# Or call the CLI module directly (no install needed)
PYTHONPATH="$(pwd)/src:$PYTHONPATH" python -m inceptbench evaluate qs.json --new

🆕 Two Evaluation Systems

InceptBench offers two evaluation systems:

Legacy System (v1.5.5) - DEPRECATED

⚠️ The legacy evaluator will be removed in a future release.

# Legacy evaluation (default, no flags)
inceptbench evaluate qs.json

New System (v2.0.0) - RECOMMENDED

🚀 Enhanced hierarchical evaluation with detailed reasoning.

# Standard mode: Structured JSON input
inceptbench evaluate qs.json --new

# Advanced mode: Raw file/folder input
inceptbench evaluate article.md --new --advanced
inceptbench evaluate ./lessons/ --new --advanced

Benefits of v2.0.0:

  • Hierarchical Evaluation - Questions, quizzes, articles with nested content
  • Detailed Reasoning - See why content received each score
  • Actionable Suggestions - Specific improvements for each metric
  • Better Error Handling - Individual failures don't crash entire batch
  • Advanced Mode - Evaluate raw files without JSON structuring

Migration Guide: Simply add --new to your existing commands:

# Before (legacy)
inceptbench evaluate qs.json

# After (new system)
inceptbench evaluate qs.json --new

✨ Features

  • 6 Specialized Evaluators - Quality assessment across multiple dimensions
  • Automatic Image Evaluation - Context-aware DI rubric scoring
  • Parallel Processing - 47+ tasks running concurrently
  • Multi-language Support - Evaluate content in any language
  • Hierarchical Content - Evaluate nested structures (articles with quizzes/questions)
  • Raw File Support - Advanced mode for direct file/folder evaluation
  • Production-Ready - Full demo in qs.json (~3-4 minutes)

📊 Evaluators

Evaluator Type Auto
ti_question_qa Question quality (10 dimensions) Yes
answer_verification Answer correctness Yes
reading_question_qc MCQ distractor analysis Yes
math_content_evaluator Content quality (9 criteria) Yes
text_content_evaluator Pedagogical text assessment Yes
image_quality_di_evaluator DI rubric image quality Auto
external_edubench Educational benchmark (6 tasks) No

See EVALUATORS.md for details.

📦 Architecture

inceptbench/
├── src/inceptbench/          # Unified package (src/ layout)
│   ├── orchestrator.py        # Main evaluation orchestrator
│   ├── cli.py                 # Command-line interface
│   ├── core/                  # Core evaluators and utilities
│   ├── agents/                # Agent-based evaluators
│   ├── qc/                    # Quality control modules
│   ├── evaluation/            # Evaluation templates
│   └── image/                 # Image quality evaluation
├── submodules/                # External dependencies
│   ├── reading-question-qc/
│   ├── EduBench/
│   ├── agentic-incept-reasoning/
│   └── image_generation_package/
└── pyproject.toml             # Package configuration

🎯 Demo

The qs.json file demonstrates all capabilities:

  • 8 questions (MCQ/fill-in, Arabic/English)
  • 4 text content items
  • 7 images (auto-evaluated)
  • All 6 evaluators active
  • ~3-4 minute runtime

✅ Local Smoke Test

Use the bundled demo file to validate your environment before making changes:

# Using new evaluator (v2.0.0) - RECOMMENDED
inceptbench evaluate qs.json --new

# Using legacy evaluator (v1.5.5)
inceptbench evaluate qs.json --full

# Or run locally without installing the package
PYTHONPATH="$(pwd)/src:$PYTHONPATH" python -m inceptbench evaluate qs.json --new

# Or using Python API (legacy)
python -c "from inceptbench import universal_unified_benchmark, UniversalEvaluationRequest; import json; data = json.load(open('qs.json')); request = UniversalEvaluationRequest(**data); result = universal_unified_benchmark(request); print(result.model_dump_json(indent=2))"

These commands exercise the evaluator and report per-item scores plus the inceptbench_version (1.5.5 for legacy, 2.0.0 for new). Sample data leaves some image_url fields set to null, so the DI image checker will log FileNotFoundError: 'null' entries—those are expected for the placeholders and can be ignored during the smoke test.

🌐 Locale-Aware Localization

UniversalEvaluationRequest now accepts a locale such as ar-AE, en-AE, or en-IN. The format is:

  • First segment (ar, en, etc.): language of the text
  • Second segment (AE, IN, etc.): cultural/regional guardrails to apply

When locale is provided, all localization checks use the corresponding language + cultural context. If it is omitted, we fall back to the legacy language field and heuristics (auto-detecting non-ASCII text when necessary).

Localization now runs for every item (including English) so cultural guardrails are always enforced; locale/language metadata simply control which prompts fire. Localized prompts run through a dedicated localization_evaluator, making cultural QA a first-class signal rather than a side-effect of other evaluators. Technical checks (schema fidelity, grammar, etc.) live in other modules—this evaluator focuses only on cultural neutrality and regional appropriateness.

Rule-based regionalization checks (ITD guidance):

  • Familiarity & relevance: keep contexts understandable for the target region/grade (no “filing taxes” for Grade 3, no hyper-local fruit for remote regions).
  • Regional reference limit: at most one explicit local prop—multiple props often create caricatures.
  • Instruction-aligned language: only switch languages when the student’s classroom instruction uses that language (respect bilingual/international settings).
  • Respectful tone & content: references must not mock, stereotype, or oversimplify cultures; neutral fallbacks beat risky flair.
  • Rule-first transparency: every failure cites the violated rule, favoring deterministic guardrails over fuzzy similarity scores.

All localization guardrails live in src/inceptbench/agents/localization_guidelines.json, so future tweaks are data-only—add new cultural rules/prompts in JSON and the evaluator automatically picks them up without code changes.

Each rule is scored via its own compact prompt that returns 0 (fail) or 1 (pass); section and overall scores are simply the percentage of guardrail rules satisfied, so localization quality is now a transparent, deterministic checklist.

📝 Example Usage

CLI - Standard Mode

# New evaluator (v2.0.0) - RECOMMENDED
inceptbench evaluate qs.json --new
inceptbench evaluate qs.json --new -o results.json
inceptbench evaluate qs.json --new --max-threads 20

# Legacy evaluator (v1.5.5) - DEPRECATED
inceptbench evaluate qs.json --full
inceptbench evaluate qs.json -o results.json

CLI - Advanced Mode (Raw Files)

# Evaluate a single file
inceptbench evaluate article.md --new --advanced
inceptbench evaluate lesson.txt --new --advanced -o result.json

# Evaluate all files in a folder
inceptbench evaluate ./lessons/ --new --advanced
inceptbench evaluate ./content/ --new --advanced --max-threads 5 -o batch.json

Advanced Mode Features:

  • No JSON structuring required - just pass raw text files
  • Supports markdown, text, HTML, or any text-based format
  • Automatic content type detection
  • Batch processing for folders
  • Output keyed by filename

Example Output (Advanced Mode):

{
  "request_id": "abc123...",
  "evaluations": {
    "article.md": {
      "inceptbench_new_evaluation": {
        "content_type": "article",
        "overall": {
          "score": 0.85,
          "reasoning": "Well-structured with clear explanations...",
          "suggested_improvements": "Add more practice problems..."
        },
        "factual_accuracy": { ... },
        // ... all metrics
      },
      "score": 0.85
    }
  },
  "evaluation_time_seconds": 45.3,
  "inceptbench_version": "2.0.0"
}

Python API

from inceptbench import universal_unified_benchmark, UniversalEvaluationRequest

request = UniversalEvaluationRequest(
    submodules_to_run=["ti_question_qa", "answer_verification"],
    generated_questions=[{
        "id": "q1",
        "type": "mcq",
        "question": "What is 2+2?",
        "answer": "4",
        "answer_options": {"A": "3", "B": "4", "C": "5"},
        "answer_explanation": "2+2 equals 4",
        "skill": {
            "title": "Basic Addition",
            "grade": "1",
            "subject": "mathematics",
            "difficulty": "easy"
        }
    }]
)

response = universal_unified_benchmark(request)
print(response.evaluations["q1"].score)

See USAGE.md for complete examples.

🖼️ Image Evaluation

Add image_url to any question or content:

{
  "id": "q1",
  "question": "How many apples?",
  "image_url": "https://example.com/apples.png"
}

The image_quality_di_evaluator runs automatically with:

  • Context-aware evaluation (accompaniment vs standalone)
  • DI rubric scoring (0-100, normalized to 0-1)
  • Hard-fail gates (answer leakage, wrong representations)
  • Canonical DI representation checks

📥 Input Format

Questions:

{
  "submodules_to_run": ["ti_question_qa"],
  "generated_questions": [{
    "id": "q1",
    "type": "mcq",
    "question": "...",
    "answer": "...",
    "image_url": "..."  // Optional
  }]
}

Text Content:

{
  "submodules_to_run": ["text_content_evaluator"],
  "generated_content": [{
    "id": "text1",
    "type": "text",
    "content": "...",
    "image_url": "..."  // Optional
  }]
}

See INPUT_OUTPUT.md for complete schema.

📤 Output Format

Legacy System (v1.5.5)

Simplified (default):

{
  "evaluations": {
    "q1": {"score": 0.89}
  },
  "inceptbench_version": "1.5.5"
}

Full (verbose=True):

{
  "evaluations": {
    "q1": {
      "ti_question_qa": {
        "overall": 0.95,
        "scores": {...},
        "issues": [...],
        "strengths": [...]
      },
      "score": 0.89
    }
  },
  "inceptbench_version": "1.5.5"
}

New System (v2.0.0)

Response Structure:

Every evaluation in v2.0.0 follows this consistent structure:

  1. Universal Metrics (all content types):

    • overall - Holistic quality assessment
    • factual_accuracy - Correctness of all facts and information
    • educational_accuracy - Alignment with learning objectives
  2. Content-Specific Metrics (varies by type):

    • Questions: clarity_precision, difficulty_appropriateness, distractor_quality, answer_explanation_quality, curriculum_alignment, stimulus_quality, mastery_learning_alignment
    • Quizzes: Same as questions (evaluated as a collection)
    • Articles: curriculum_alignment, teaching_quality, worked_examples, practice_problems, follows_direct_instruction, stimulus_quality, diction_and_sentence_structure
    • Readings: Reading-specific metrics
  3. Hierarchical Evaluation:

    • subcontent_evaluations - Array of evaluations for nested content (e.g., questions within quizzes/articles)
    • null if no nested content exists
  4. Metric Format (all metrics follow this pattern):

    {
      "score": 0.85,  // 0.0 to 1.0 (binary metrics: 0.0 or 1.0)
      "reasoning": "Clear explanation of why this score was given...",
      "suggested_improvements": "Specific actionable suggestions..."  // null if score is 1.0
    }
    

Standard Mode Example:

{
  "evaluations": {
    "q1": {
      "inceptbench_new_evaluation": {
        "content_type": "question",
        "overall": {
          "score": 0.92,
          "reasoning": "High-quality MCQ with clear stem...",
          "suggested_improvements": "Consider adding..."
        },
        "factual_accuracy": {
          "score": 1.0,
          "reasoning": "All facts are correct...",
          "suggested_improvements": null
        },
        "educational_accuracy": {
          "score": 1.0,
          "reasoning": "Aligns perfectly with grade-level objectives...",
          "suggested_improvements": null
        },
        "clarity_precision": {
          "score": 0.9,
          "reasoning": "Question is clear but could be more concise...",
          "suggested_improvements": "Remove redundant phrase in stem..."
        },
        // ... 6 more content-specific metrics
        "subcontent_evaluations": null
      },
      "score": 0.92
    }
  },
  "inceptbench_version": "2.0.0"
}

Advanced Mode Example (Hierarchical Content):

{
  "request_id": "def456...",
  "evaluations": {
    "article.md": {
      "inceptbench_new_evaluation": {
        "content_type": "article",
        "overall": {
          "score": 0.85,
          "reasoning": "Well-structured article with good pedagogical flow...",
          "suggested_improvements": "Add more worked examples before practice problems..."
        },
        "factual_accuracy": {
          "score": 1.0,
          "reasoning": "All mathematical concepts are accurate...",
          "suggested_improvements": null
        },
        "educational_accuracy": {
          "score": 0.9,
          "reasoning": "Aligns well with grade 6 standards...",
          "suggested_improvements": "Add explicit connection to 6.RP.A.2..."
        },
        "curriculum_alignment": { "score": 1.0, ... },
        "teaching_quality": { "score": 0.8, ... },
        "worked_examples": { "score": 0.7, ... },
        // ... 4 more article-specific metrics
        "subcontent_evaluations": [
          {
            "content_type": "question",
            "overall": {
              "score": 0.88,
              "reasoning": "Strong practice question...",
              "suggested_improvements": "Add one more distractor..."
            },
            "factual_accuracy": { "score": 1.0, ... },
            "educational_accuracy": { "score": 1.0, ... },
            // ... 7 more question-specific metrics
            "subcontent_evaluations": null
          },
          // ... more embedded questions
        ]
      },
      "score": 0.85
    }
  },
  "evaluation_time_seconds": 67.8,
  "inceptbench_version": "2.0.0"
}

Key Points:

  • Consistency: All content types use the same metric structure (score, reasoning, suggestions)
  • Transparency: Every score includes detailed reasoning
  • Actionable: Suggestions only appear when score < 1.0
  • Hierarchical: Nested content (questions in quizzes/articles) fully evaluated
  • Comprehensive: 10 metrics per content type (3 universal + 7 content-specific)

Metrics by Content Type:

Content Type Universal Metrics Content-Specific Metrics (7)
Question overall, factual_accuracy, educational_accuracy clarity_precision, difficulty_appropriateness, distractor_quality, answer_explanation_quality, curriculum_alignment, stimulus_quality, mastery_learning_alignment
Quiz overall, factual_accuracy, educational_accuracy Same as Question (evaluated as collection)
Article overall, factual_accuracy, educational_accuracy curriculum_alignment, teaching_quality, worked_examples, practice_problems, follows_direct_instruction, stimulus_quality, diction_and_sentence_structure
Reading (Fiction/Nonfiction) overall, factual_accuracy, educational_accuracy clarity_precision, difficulty_appropriateness, engagement_quality, comprehension_support, stimulus_quality, diction_and_sentence_structure, length_appropriateness

Score Types:

  • Binary (0.0 or 1.0): curriculum_alignment, follows_direct_instruction, and others where pass/fail is appropriate
  • Continuous (0.0-1.0): Most metrics that assess quality on a spectrum

Hierarchical Evaluation Structure:

Article (with embedded quiz)
├── overall: {score, reasoning, suggestions}
├── factual_accuracy: {score, reasoning, suggestions}
├── educational_accuracy: {score, reasoning, suggestions}
├── [7 article-specific metrics]
└── subcontent_evaluations:
    └── Quiz
        ├── overall: {score, reasoning, suggestions}
        ├── factual_accuracy: {score, reasoning, suggestions}
        ├── educational_accuracy: {score, reasoning, suggestions}
        ├── [7 quiz-specific metrics]
        └── subcontent_evaluations:
            ├── Question 1
            │   ├── overall: {score, reasoning, suggestions}
            │   ├── factual_accuracy: {score, reasoning, suggestions}
            │   ├── educational_accuracy: {score, reasoning, suggestions}
            │   ├── [7 question-specific metrics]
            │   └── subcontent_evaluations: null
            └── Question 2
                ├── [same structure]
                └── subcontent_evaluations: null

🔄 Module Selection

Automatic (if submodules_to_run not specified):

  • Questions → ti_question_qa, answer_verification, math_content_evaluator, reading_question_qc
  • Text → text_content_evaluator, math_content_evaluator
  • Images → image_quality_di_evaluator (auto-added)
  • Localization → localization_evaluator (auto for all languages; uses locale/language metadata to pick prompts)

Manual:

request = UniversalEvaluationRequest(
    submodules_to_run=["ti_question_qa", "answer_verification"],  # Only these
    generated_questions=[...]
)

🎛️ CLI Flags Reference

Core Flags

  • --new - Use new evaluator (v2.0.0) instead of legacy (v1.5.5)
  • --advanced - Advanced mode for raw file/folder input (requires --new)
  • --max-threads N - Maximum parallel evaluation threads (default: 10)
  • -o, --output FILE - Save results to file
  • -v, --verbose - Show progress messages
  • --full - Return full detailed results (legacy system only)

Legacy System Only

  • --subject TEXT - Subject area for routing (math, ela, science, etc.)
  • --grade TEXT - Grade level (K, 3, 6-8, 9-12, etc.)
  • --type TEXT - Content type (mcq, fill-in, passage, article, etc.)

Examples

# New evaluator - standard mode
inceptbench evaluate qs.json --new

# New evaluator - advanced mode (raw file)
inceptbench evaluate article.md --new --advanced

# New evaluator - batch processing
inceptbench evaluate ./lessons/ --new --advanced --max-threads 20

# Legacy evaluator
inceptbench evaluate qs.json --subject math --grade 6

Version Detection

The inceptbench_version field in the output indicates which system was used:

  • "1.5.5" - Legacy evaluator
  • "2.0.0" - New evaluator

📚 Additional Documentation

📜 License

Proprietary - Copyright Trilogy Education Services

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inceptbench-2.0.0.tar.gz (267.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inceptbench-2.0.0-py3-none-any.whl (324.3 kB view details)

Uploaded Python 3

File details

Details for the file inceptbench-2.0.0.tar.gz.

File metadata

  • Download URL: inceptbench-2.0.0.tar.gz
  • Upload date:
  • Size: 267.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-2.0.0.tar.gz
Algorithm Hash digest
SHA256 3d66e64737fd93285f3c8c008561f880bcd7251ea3c146602147deae7c0d824b
MD5 2badf0ff408bac6ec8f1fde36d2cf640
BLAKE2b-256 03af2a871f79e556f4bd33fed76b0441c51a63d7c334b669db4e87fd72df716a

See more details on using hashes here.

File details

Details for the file inceptbench-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: inceptbench-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 324.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 79b5c2ac3a0e50033111a0120b965fb9478fa80b0f8f533b849e6c815ab63a3d
MD5 37bebc4ed594df55492dd301e0c2a7fb
BLAKE2b-256 7f18668d6bdd40ce8ef702abe4082db458594d74aead7cad275daa0f860841bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page