Skip to main content

Convert PDF files (text, scanned, mixed) into MCQ questions using AI

Project description

pdf2mcq

Convert PDF files — text PDFs, scanned books, mixed documents — into high-quality MCQ questions using AI.

Built on top of html2mcq's PDF pipeline, extracted as a standalone library focused purely on PDF-to-MCQ generation.


Features

  • Smart PDF detection — automatically detects text PDFs, scanned PDFs, and mixed documents
  • Text PDFs — fast extraction via PyMuPDF with chunking at sentence boundaries
  • Scanned PDFs — renders pages as images → vision API OCR (or pytesseract fallback)
  • Mixed PDFs — text pages via PyMuPDF + scanned pages via OCR, combined intelligently
  • Multiple AI providers: OpenRouter, Anthropic, OpenAI, Ollama
  • Auto model failover for MCQ generation
  • CLI & Python API

Quick Start

CLI

# Single PDF
pdf2mcq --pdf-path textbook.pdf -n 10

# Multiple PDF URLs
pdf2mcq --pdf-url https://example.com/chapter1.pdf --pdf-url https://example.com/chapter2.pdf

# Scan a folder of PDFs
pdf2mcq --pdf-folder ./textbooks/

# Output as JSON
pdf2mcq --pdf-path notes.pdf -o questions.json --format json

Python API

from pdf2mcq import PDFMCQGenerator

gen = PDFMCQGenerator(
    api_key="sk-or-v1-...",
    provider="openrouter",
    mcq_model="google/gemini-2.5-flash-lite",
)

# From local PDF
mcq = gen.from_pdf_paths("textbook.pdf", n=5)
print(mcq.to_pretty_str())

# From URL
mcq = gen.from_pdf_urls("https://example.com/notes.pdf", n=3)
print(mcq.to_json())

# Multiple PDFs
mcq = gen.from_pdf_paths(["chapter1.pdf", "chapter2.pdf", "chapter3.pdf"])

Custom Instructions

mcq = gen.from_pdf_paths(
    "lecture-notes.pdf",
    n=10,
    difficulty_mix="50% easy, 50% hard",
    focus_topics=["machine learning", "neural networks"],
    custom_instructions="Focus on mathematical derivations",
)

Auto Model Selection

gen = PDFMCQGenerator(
    api_key="sk-or-v1-...",
    mcq_model="auto",
    mcq_model_list=[
        "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free",
        "google/gemma-4-31b-it:free",
    ],
)

Environment Variables

Variable Purpose
OPENROUTER_API_KEY Default API key for OpenRouter
ANTHROPIC_API_KEY API key for Anthropic
OPENAI_API_KEY API key for OpenAI
PDF2MCQ_MCQ_MODELS Comma-separated MCQ model priority list for mcq_model="auto"
PDF2MCQ_OCR_MODELS Comma-separated OCR model priority list for scanned PDFs

Output Format

# Pretty-print
print(mcq.to_pretty_str())

# JSON
print(mcq.to_json())
# {
#   "total_exam_time": 20,
#   "questions": [
#     {
#       "question_html": "What is gradient descent?",
#       "options": ["...", "...", "...", "..."],
#       "answers": [0],
#       "multi": false,
#       "marks": 1.0,
#       "negative_marks": 0.25,
#       "difficulty": "easy",
#       "explaination": "..."
#     }
#   ]
# }

Installation

pip install pdf2mcq

Requires PyMuPDF (fitz) — installed automatically as a dependency.

For scanned PDF OCR, also install Tesseract.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mcq-1.0.0.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2mcq-1.0.0-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file pdf2mcq-1.0.0.tar.gz.

File metadata

  • Download URL: pdf2mcq-1.0.0.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for pdf2mcq-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4742650582362f063067da520c6a6786ff8c2f3bf8d2631ce88a51c36d6dcb52
MD5 fe44431d5bfe2ed4675a5245ad6aebf6
BLAKE2b-256 1e1a2f81bc757137ac7873ef33f33ec43c7e99708ca60c770948e7f1a2f7bab9

See more details on using hashes here.

File details

Details for the file pdf2mcq-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2mcq-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for pdf2mcq-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ea49117a8bd52c749a82c0a9cfbf421fcad4f9b669822aac863669e33f15531
MD5 46ea0afc30e3bba5a54ffd94ba600785
BLAKE2b-256 80774e45e07ef4153ef94ec8fa5bca71ffd1ffad921e1810a3e091fb6dd26cac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page