Comprehensive benchmark and evaluation framework for educational AI question generation

These details have not been verified by PyPI

Project links

Project description

InceptBench

Educational content evaluation framework with multiple AI-powered assessment modules.

📖 Documentation

Official Sites

Website • Benchmarks • Glossary • Docs • API Endpoint • API Docs

User Guides

USAGE.md - Installation, configuration, CLI & Python API
INPUT_OUTPUT.md - Input schemas and output formats
EVALUATORS.md - Complete evaluator reference

Developer Guides

WIKI.md - Documentation hub and workflows
MAINTAINERS.md - Submodule maintainer guide
PUBLISHING.md - Package publishing workflow
VERSION_LOCATIONS.md - Version file reference

Resources

Google Drive - Test data and examples
GitHub Repo - Source code

🚀 Quick Start

# Install from PyPI (latest published release)
pip install inceptbench

# Or install from source (current repo snapshot)
git clone https://github.com/incept-ai/inceptbench.git
cd inceptbench
python3 -m venv venv && source venv/bin/activate
pip install -e .

# Create .env file (optional - for API-based evaluation)
echo "OPENAI_API_KEY=your_key" >> .env
echo "ANTHROPIC_API_KEY=your_key" >> .env

# Generate example
inceptbench example

# Run evaluation via CLI
inceptbench evaluate qs.json --full

# Or call the CLI module directly (no install needed)
PYTHONPATH="$(pwd)/src:$PYTHONPATH" python -m inceptbench.cli evaluate qs.json --full

✨ Features

6 Specialized Evaluators - Quality assessment across multiple dimensions
Automatic Image Evaluation - Context-aware DI rubric scoring
Parallel Processing - 47+ tasks running concurrently
Multi-language Support - Evaluate content in any language
Dual Content Types - Questions (MCQ/fill-in) and text content (passages/explanations)
Production-Ready - Full demo in qs.json (~3-4 minutes)

📊 Evaluators

Evaluator	Type	Auto
ti_question_qa	Question quality (10 dimensions)	Yes
answer_verification	Answer correctness	Yes
reading_question_qc	MCQ distractor analysis	Yes
math_content_evaluator	Content quality (9 criteria)	Yes
text_content_evaluator	Pedagogical text assessment	Yes
image_quality_di_evaluator	DI rubric image quality	Auto
external_edubench	Educational benchmark (6 tasks)	No

See EVALUATORS.md for details.

📦 Architecture

inceptbench/
├── src/inceptbench/          # Unified package (src/ layout)
│   ├── orchestrator.py        # Main evaluation orchestrator
│   ├── cli.py                 # Command-line interface
│   ├── core/                  # Core evaluators and utilities
│   ├── agents/                # Agent-based evaluators
│   ├── qc/                    # Quality control modules
│   ├── evaluation/            # Evaluation templates
│   └── image/                 # Image quality evaluation
├── submodules/                # External dependencies
│   ├── reading-question-qc/
│   ├── EduBench/
│   ├── agentic-incept-reasoning/
│   └── image_generation_package/
└── pyproject.toml             # Package configuration

🎯 Demo

The qs.json file demonstrates all capabilities:

8 questions (MCQ/fill-in, Arabic/English)
4 text content items
7 images (auto-evaluated)
All 6 evaluators active
~3-4 minute runtime

✅ Local Smoke Test

Use the bundled demo file to validate your environment before making changes:

# Using CLI (recommended)
inceptbench evaluate qs.json --full

# Or run locally without installing the package
PYTHONPATH="$(pwd)/src:$PYTHONPATH" python -m inceptbench.cli evaluate qs.json --full

# Or using Python API
python -c "from inceptbench import universal_unified_benchmark, UniversalEvaluationRequest; import json; data = json.load(open('qs.json')); request = UniversalEvaluationRequest(**data); result = universal_unified_benchmark(request); print(result.model_dump_json(indent=2))"

These commands exercise every evaluator (including localization + DI image checks) and report per-item scores plus the combined inceptbench_version. Sample data leaves some image_url fields set to null, so the DI image checker will log FileNotFoundError: 'null' entries—those are expected for the placeholders and can be ignored during the smoke test.

🌐 Locale-Aware Localization

UniversalEvaluationRequest now accepts a locale such as ar-AE, en-AE, or en-IN. The format is:

First segment (ar, en, etc.): language of the text
Second segment (AE, IN, etc.): cultural/regional guardrails to apply

When locale is provided, all localization checks use the corresponding language + cultural context. If it is omitted, we fall back to the legacy language field and heuristics (auto-detecting non-ASCII text when necessary).

Localization now runs for every item (including English) so cultural guardrails are always enforced; locale/language metadata simply control which prompts fire. Localized prompts run through a dedicated localization_evaluator, making cultural QA a first-class signal rather than a side-effect of other evaluators. Technical checks (schema fidelity, grammar, etc.) live in other modules—this evaluator focuses only on cultural neutrality and regional appropriateness.

Rule-based regionalization checks (ITD guidance):

Familiarity & relevance: keep contexts understandable for the target region/grade (no “filing taxes” for Grade 3, no hyper-local fruit for remote regions).
Regional reference limit: at most one explicit local prop—multiple props often create caricatures.
Instruction-aligned language: only switch languages when the student’s classroom instruction uses that language (respect bilingual/international settings).
Respectful tone & content: references must not mock, stereotype, or oversimplify cultures; neutral fallbacks beat risky flair.
Rule-first transparency: every failure cites the violated rule, favoring deterministic guardrails over fuzzy similarity scores.

All localization guardrails live in src/inceptbench/agents/localization_guidelines.json, so future tweaks are data-only—add new cultural rules/prompts in JSON and the evaluator automatically picks them up without code changes.

Each rule is scored via its own compact prompt that returns 0 (fail) or 1 (pass); section and overall scores are simply the percentage of guardrail rules satisfied, so localization quality is now a transparent, deterministic checklist.

📝 Example Usage

CLI

inceptbench evaluate qs.json --full
inceptbench evaluate qs.json -o results.json

Python API

from inceptbench import universal_unified_benchmark, UniversalEvaluationRequest

request = UniversalEvaluationRequest(
    submodules_to_run=["ti_question_qa", "answer_verification"],
    generated_questions=[{
        "id": "q1",
        "type": "mcq",
        "question": "What is 2+2?",
        "answer": "4",
        "answer_options": {"A": "3", "B": "4", "C": "5"},
        "answer_explanation": "2+2 equals 4",
        "skill": {
            "title": "Basic Addition",
            "grade": "1",
            "subject": "mathematics",
            "difficulty": "easy"
        }
    }]
)

response = universal_unified_benchmark(request)
print(response.evaluations["q1"].score)

See USAGE.md for complete examples.

🖼️ Image Evaluation

Add image_url to any question or content:

{
  "id": "q1",
  "question": "How many apples?",
  "image_url": "https://example.com/apples.png"
}

The image_quality_di_evaluator runs automatically with:

Context-aware evaluation (accompaniment vs standalone)
DI rubric scoring (0-100, normalized to 0-1)
Hard-fail gates (answer leakage, wrong representations)
Canonical DI representation checks

📥 Input Format

Questions:

{
  "submodules_to_run": ["ti_question_qa"],
  "generated_questions": [{
    "id": "q1",
    "type": "mcq",
    "question": "...",
    "answer": "...",
    "image_url": "..."  // Optional
  }]
}

Text Content:

{
  "submodules_to_run": ["text_content_evaluator"],
  "generated_content": [{
    "id": "text1",
    "type": "text",
    "content": "...",
    "image_url": "..."  // Optional
  }]
}

See INPUT_OUTPUT.md for complete schema.

📤 Output Format

Simplified (default):

{
  "evaluations": {
    "q1": {"score": 0.89}
  }
}

Full (verbose=True):

{
  "evaluations": {
    "q1": {
      "ti_question_qa": {
        "overall": 0.95,
        "scores": {...},
        "issues": [...],
        "strengths": [...]
      },
      "score": 0.89
    }
  }
}

🔄 Module Selection

Automatic (if submodules_to_run not specified):

Questions → ti_question_qa, answer_verification, math_content_evaluator, reading_question_qc
Text → text_content_evaluator, math_content_evaluator
Images → image_quality_di_evaluator (auto-added)
Localization → localization_evaluator (auto for all languages; uses locale/language metadata to pick prompts)

Manual:

request = UniversalEvaluationRequest(
    submodules_to_run=["ti_question_qa", "answer_verification"],  # Only these
    generated_questions=[...]
)

📜 License

Proprietary - Copyright Trilogy Education Services

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.4.0

Apr 16, 2026

2.3.10

Apr 8, 2026

2.3.9 yanked

Apr 8, 2026

Reason this release was yanked:

Gemini SDK needs >=1.70 version to work with this.

2.3.8

Apr 2, 2026

2.3.7 yanked

Mar 30, 2026

Reason this release was yanked:

Regressed evaluations due to incorrect code merges.

2.3.6

Mar 17, 2026

2.3.5

Mar 10, 2026

2.3.4

Mar 3, 2026

2.3.3

Feb 10, 2026

2.3.2

Feb 6, 2026

2.3.1

Jan 29, 2026

2.3.0

Jan 10, 2026

2.2.1

Jan 7, 2026

2.2.0

Jan 2, 2026

2.1.0

Dec 16, 2025

2.0.0

Nov 27, 2025

1.5.5

Nov 24, 2025

This version

1.5.4

Nov 20, 2025

1.5.3

Nov 19, 2025

1.5.2

Nov 18, 2025

1.5.0

Nov 18, 2025

1.4.5

Nov 17, 2025

1.4.4

Nov 12, 2025

1.4.3

Nov 10, 2025

1.4.2

Nov 10, 2025

1.4.1

Oct 30, 2025

1.4.0

Oct 28, 2025

1.3.5

Oct 27, 2025

1.3.4

Oct 27, 2025

1.3.3

Oct 27, 2025

1.3.2

Oct 24, 2025

1.3.1

Oct 23, 2025

1.3.0

Oct 23, 2025

1.2.4

Oct 22, 2025

1.2.3

Oct 22, 2025

1.2.2

Oct 22, 2025

1.2.1

Oct 22, 2025

1.2.0

Oct 21, 2025

1.1.8

Oct 21, 2025

1.1.7

Oct 20, 2025

1.1.6

Oct 20, 2025

1.1.5

Oct 20, 2025

1.1.4

Oct 20, 2025

1.1.3

Oct 20, 2025

1.1.2

Oct 20, 2025

1.1.1

Oct 20, 2025

1.1.0

Oct 20, 2025

1.0.9

Oct 20, 2025

1.0.8

Oct 20, 2025

1.0.7

Oct 17, 2025

1.0.6

Oct 17, 2025

1.0.5

Oct 17, 2025

1.0.4

Oct 17, 2025

1.0.3

Oct 17, 2025

1.0.2

Oct 17, 2025

1.0.1

Oct 17, 2025

1.0.0

Oct 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inceptbench-1.5.4.tar.gz (191.7 kB view details)

Uploaded Nov 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inceptbench-1.5.4-py3-none-any.whl (210.8 kB view details)

Uploaded Nov 20, 2025 Python 3

File details

Details for the file inceptbench-1.5.4.tar.gz.

File metadata

Download URL: inceptbench-1.5.4.tar.gz
Upload date: Nov 20, 2025
Size: 191.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-1.5.4.tar.gz
Algorithm	Hash digest
SHA256	`9dcc256419f2288b5b50f33dd2f06e9bfe10a16b082ebb838773dd399a9c995f`
MD5	`a0075815e35517398eace8ae56ba7569`
BLAKE2b-256	`f31933b2cc1c52c44bcf1ce0756ea74a21552b8fa06b2443b983ad2a9ceea3e7`

See more details on using hashes here.

File details

Details for the file inceptbench-1.5.4-py3-none-any.whl.

File metadata

Download URL: inceptbench-1.5.4-py3-none-any.whl
Upload date: Nov 20, 2025
Size: 210.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.13.7 Darwin/24.3.0

File hashes

Hashes for inceptbench-1.5.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63eb9f1dbdd72fd754c617aa706691fb1aed6dc33b38af3bc45bf9a865f662dc`
MD5	`f59cc6671c647a086e0aa0d730b55f79`
BLAKE2b-256	`559986504dd7c9c4caf1a4cc895595175f9ebdb19bd81b28132d0c7ce134164a`

See more details on using hashes here.

inceptbench 1.5.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

InceptBench

📖 Documentation

Official Sites

User Guides

Developer Guides

Resources

🚀 Quick Start

✨ Features

📊 Evaluators

📦 Architecture

🎯 Demo

✅ Local Smoke Test

🌐 Locale-Aware Localization

📝 Example Usage

CLI

Python API

🖼️ Image Evaluation

📥 Input Format

📤 Output Format

🔄 Module Selection

📜 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes