Skip to main content

Convert PDF files (text, scanned, mixed) into MCQ questions using AI

Project description

pdf2mcq

Convert PDF files — text PDFs, scanned books, mixed documents — into high-quality MCQ questions using AI.

Built on top of html2mcq's PDF pipeline, extracted as a standalone library focused purely on PDF-to-MCQ generation.


Features

  • Smart PDF detection — automatically detects text PDFs, scanned PDFs, and mixed documents
  • Text PDFs — fast extraction via PyMuPDF with chunking at sentence boundaries
  • Scanned PDFs — renders pages as images → vision API OCR (or pytesseract fallback)
  • Mixed PDFs — text pages via PyMuPDF + scanned pages via OCR, combined intelligently
  • Multiple AI providers: OpenRouter, Anthropic, OpenAI, Ollama
  • Auto model failover for MCQ generation
  • CLI & Python API

Quick Start

CLI

# Single PDF
pdf2mcq --pdf-path textbook.pdf -n 10

# Multiple PDF URLs
pdf2mcq --pdf-url https://example.com/chapter1.pdf --pdf-url https://example.com/chapter2.pdf

# Scan a folder of PDFs
pdf2mcq --pdf-folder ./textbooks/

# Output as JSON
pdf2mcq --pdf-path notes.pdf -o questions.json --format json

Python API

from pdf2mcq import PDFMCQGenerator

gen = PDFMCQGenerator(
    api_key="sk-or-v1-...",
    provider="openrouter",
    mcq_model="google/gemini-2.5-flash-lite",
)

# From local PDF
mcq = gen.from_pdf_paths("textbook.pdf", n=5)
print(mcq.to_pretty_str())

# From URL
mcq = gen.from_pdf_urls("https://example.com/notes.pdf", n=3)
print(mcq.to_json())

# Multiple PDFs
mcq = gen.from_pdf_paths(["chapter1.pdf", "chapter2.pdf", "chapter3.pdf"])

Custom Instructions

mcq = gen.from_pdf_paths(
    "lecture-notes.pdf",
    n=10,
    difficulty_mix="50% easy, 50% hard",
    focus_topics=["machine learning", "neural networks"],
    custom_instructions="Focus on mathematical derivations",
)

Auto Model Selection

gen = PDFMCQGenerator(
    api_key="sk-or-v1-...",
    mcq_model="auto",
    mcq_model_list=[
        "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free",
        "google/gemma-4-31b-it:free",
    ],
)

Environment Variables

Variable Purpose
OPENROUTER_API_KEY Default API key for OpenRouter
ANTHROPIC_API_KEY API key for Anthropic
OPENAI_API_KEY API key for OpenAI
PDF2MCQ_MCQ_MODELS Comma-separated MCQ model priority list for mcq_model="auto"
PDF2MCQ_OCR_MODELS Comma-separated OCR model priority list for scanned PDFs

Output Format

# Pretty-print
print(mcq.to_pretty_str())

# JSON
print(mcq.to_json())
# {
#   "total_exam_time": 20,
#   "questions": [
#     {
#       "question_html": "What is gradient descent?",
#       "options": ["...", "...", "...", "..."],
#       "answers": [0],
#       "multi": false,
#       "marks": 1.0,
#       "negative_marks": 0.25,
#       "difficulty": "easy",
#       "explaination": "..."
#     }
#   ]
# }

Installation

pip install pdf2mcq

Requires PyMuPDF (fitz) — installed automatically as a dependency.

For scanned PDF OCR, also install Tesseract.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mcq-1.2.0.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2mcq-1.2.0-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf2mcq-1.2.0.tar.gz.

File metadata

  • Download URL: pdf2mcq-1.2.0.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for pdf2mcq-1.2.0.tar.gz
Algorithm Hash digest
SHA256 a9a16d999e6c9add54efedcfdbbfcae9da402161cda718bd2606904efe685427
MD5 6e369a75cdbed36b8ef4f6f072bbce72
BLAKE2b-256 34fb075f365f1912a987ac52080ebef1bef0839b611ef26650a54626bea793e3

See more details on using hashes here.

File details

Details for the file pdf2mcq-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2mcq-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for pdf2mcq-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1594e0cc54716619fb4a4e9a0349c207cff514d48682505cf1e5bb92a14f7b6b
MD5 0580eaad93f7a2b42a03672f94d0125d
BLAKE2b-256 2529073d8d1aa15aa3d3ec12f107a9fd325b6d4d4445a9248303f503ed476862

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page