Skip to main content

Convert PDF files (text, scanned, mixed) into MCQ questions using AI

Project description

pdf2mcq

Convert PDF files — text PDFs, scanned books, mixed documents — into high-quality MCQ questions using AI.

Built on top of html2mcq's PDF pipeline, extracted as a standalone library focused purely on PDF-to-MCQ generation.


Features

  • Smart PDF detection — automatically detects text PDFs, scanned PDFs, and mixed documents
  • Text PDFs — fast extraction via PyMuPDF with chunking at sentence boundaries
  • Scanned PDFs — renders pages as images → vision API OCR (or pytesseract fallback)
  • Mixed PDFs — text pages via PyMuPDF + scanned pages via OCR, combined intelligently
  • Multiple AI providers: OpenRouter, Anthropic, OpenAI, Ollama
  • Auto model failover for MCQ generation
  • CLI & Python API

Quick Start

CLI

# Single PDF
pdf2mcq --pdf-path textbook.pdf -n 10

# Multiple PDF URLs
pdf2mcq --pdf-url https://example.com/chapter1.pdf --pdf-url https://example.com/chapter2.pdf

# Scan a folder of PDFs
pdf2mcq --pdf-folder ./textbooks/

# Output as JSON
pdf2mcq --pdf-path notes.pdf -o questions.json --format json

Python API

from pdf2mcq import PDFMCQGenerator

gen = PDFMCQGenerator(
    api_key="sk-or-v1-...",
    provider="openrouter",
    mcq_model="google/gemini-2.5-flash-lite",
)

# From local PDF
mcq = gen.from_pdf_paths("textbook.pdf", n=5)
print(mcq.to_pretty_str())

# From URL
mcq = gen.from_pdf_urls("https://example.com/notes.pdf", n=3)
print(mcq.to_json())

# Multiple PDFs
mcq = gen.from_pdf_paths(["chapter1.pdf", "chapter2.pdf", "chapter3.pdf"])

Custom Instructions

mcq = gen.from_pdf_paths(
    "lecture-notes.pdf",
    n=10,
    difficulty_mix="50% easy, 50% hard",
    focus_topics=["machine learning", "neural networks"],
    custom_instructions="Focus on mathematical derivations",
)

Auto Model Selection

gen = PDFMCQGenerator(
    api_key="sk-or-v1-...",
    mcq_model="auto",
    mcq_model_list=[
        "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free",
        "google/gemma-4-31b-it:free",
    ],
)

Environment Variables

Variable Purpose
OPENROUTER_API_KEY Default API key for OpenRouter
ANTHROPIC_API_KEY API key for Anthropic
OPENAI_API_KEY API key for OpenAI
PDF2MCQ_MCQ_MODELS Comma-separated MCQ model priority list for mcq_model="auto"
PDF2MCQ_OCR_MODELS Comma-separated OCR model priority list for scanned PDFs

Output Format

# Pretty-print
print(mcq.to_pretty_str())

# JSON
print(mcq.to_json())
# {
#   "total_exam_time": 20,
#   "questions": [
#     {
#       "question_html": "What is gradient descent?",
#       "options": ["...", "...", "...", "..."],
#       "answers": [0],
#       "multi": false,
#       "marks": 1.0,
#       "negative_marks": 0.25,
#       "difficulty": "easy",
#       "explaination": "..."
#     }
#   ]
# }

Installation

pip install pdf2mcq

Requires PyMuPDF (fitz) — installed automatically as a dependency.

For scanned PDF OCR, also install Tesseract.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mcq-1.1.0.tar.gz (23.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2mcq-1.1.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file pdf2mcq-1.1.0.tar.gz.

File metadata

  • Download URL: pdf2mcq-1.1.0.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for pdf2mcq-1.1.0.tar.gz
Algorithm Hash digest
SHA256 0f92badac559c02da84075e1a1d775c23256286a62ddcb898f5b6a650efdbea3
MD5 144fcacca25302e1dd341e3c397157f7
BLAKE2b-256 7c8eb7b4413ce2100d6d9ced5e2a5bdf3cfb2af33a34d2d5a784cb58cf963dbf

See more details on using hashes here.

File details

Details for the file pdf2mcq-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2mcq-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for pdf2mcq-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0208e765b05c3b4d7b645bd35e18afec620fb56ad62804505a36c6e924b72cdf
MD5 3f4fa484dd042ead1e3987df9e2eaa05
BLAKE2b-256 c649f88bb1648947c838737450ddd2bbd8898aaf5c34d0354c35296f04914b5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page