Skip to main content

A Python package for generating quizzes from text, URLs, or YouTube transcripts using OpenAI.

Project description

QuizGPT

PyPI version

QuizGPT is a production-grade Python package for generating high-quality quizzes from raw text, web pages, or YouTube video transcripts using OpenAI models. Supports scalable generation of 10 to 10,000 questions with intelligent batching, semantic deduplication, and dimension-based organization.

🌟 Key Features

Core Functionality

  • Content Extraction: Automatically extracts text from URLs, web pages, YouTube videos, or plain text
  • Multiple Choice Quiz Generation: Creates MCQ quizzes with configurable difficulty levels (easy/medium/hard)
  • Content Summarization: Generates concise summaries of source content
  • Flexible Interface: Python API and command-line interface

Advanced Features (v3.4.0+)

  • Scalable Generation: Generate 10 to 10,000 questions in a single request
  • Intelligent Batching: Automatically splits large question requests into LLM-friendly batches (10 questions per request)
  • Thread-Safe Concurrency: Uses ThreadPoolExecutor for efficient parallel processing (configurable workers)
  • Semantic Deduplication: FAISS-based embeddings remove semantic duplicates while preserving variety
  • Dimension-Based Organization: Automatically extracts and organizes questions by topic:
    • Flat: For 10-20 questions
    • Dimensional: For 20-100 questions (organizes by main topics)
    • Hierarchical: For 100-10,000 questions (organizes by topics and subtopics)
  • Automatic Retry Logic: Intelligently retries if deduplication reduces count below target
  • Over-Generation Strategy: Generates 20% extra questions to account for duplicate removal

📦 Installation

pip install quizgpt

Required Dependencies

Optional Dependencies for Advanced Features

# FAISS is included in dependencies, but for GPU acceleration:
pip install faiss-gpu  # instead of faiss-cpu

🚀 Quick Start

Generate a Simple Quiz (10 Questions)

from quizgpt import QuizGPT

# Initialize generator
generator = QuizGPT(api_key="sk-...")

# Generate quiz from text
result = generator.generate_quiz(
    input_text="The mitochondria is the powerhouse of the cell...",
    difficulty="medium"
)

print(f"Questions: {len(result['quiz'])}")
print(f"Summary: {result['short']}")

for question in result['quiz'][:3]:
    print(f"\n{question['question']}")
    print(f"   Options: {question['options']}")
    print(f"   Answer: {question['answer']} (#{question['answer_number']})")

Generate from a URL

result = generator.generate_quiz(
    input_text="https://en.wikipedia.org/wiki/Python_(programming_language)"
)

Generate from YouTube

result = generator.generate_quiz(
    input_text="https://www.youtube.com/watch?v=dQw4w9WgXcQ"
)

🔧 Advanced Usage

Generate Large-Scale Quizzes (100+ Questions)

from quizgpt import QuizGPT

# Generate 500 questions from content
generator = QuizGPT(
    api_key="sk-...",
    n_questions=500,  # Will auto-organize into dimensions
    max_workers=4     # Use 4 concurrent threads
)

result = generator.generate_quiz(
    input_text="Long article content...",
    difficulty="hard",
    enable_deduplication=True  # Remove semantic duplicates
)

print(f"Generated {len(result['quiz'])} unique questions")

Using Dimension Extractor Directly

from quizgpt import DimensionExtractor

extractor = DimensionExtractor(api_key="sk-...")

# Extract main topics from content
dimensions = extractor.extract_dimensions(
    content="Your long article...",
    n_dimensions=5
)
print(f"Main topics: {dimensions}")

# Extract subtopics within a dimension
subtopics = extractor.extract_subdimensions(
    content="Your long article...",
    dimension="History",
    n_subdimensions=10
)
print(f"Subtopics in History: {subtopics}")

Using Deduplicator Directly

from quizgpt import QuestionDeduplicator

deduplicator = QuestionDeduplicator(
    api_key="sk-...",
    similarity_threshold=0.90  # 0-1, higher = stricter dedup
)

# Remove duplicates from questions
unique_questions = deduplicator.deduplicate(
    questions=your_questions_list,
    target_count=100  # Try to keep exactly 100 questions
)

print(f"Removed {len(your_questions_list) - len(unique_questions)} duplicates")

CLI Usage

# Basic usage
quizgpt --input "Your text content" --api-key sk-...

# Advanced options
quizgpt \
  --input "https://example.com/article" \
  --api-key sk-... \
  --difficulty hard \
  --question-type mcq \
  --output results.json

# Generate 100 questions and save to file
quizgpt \
  --input "Long article..." \
  --api-key sk-... \
  --output quiz_100_questions.json

📋 API Reference

QuizGPT Class

from quizgpt import QuizGPT

generator = QuizGPT(
    api_key=None,           # OpenAI API key (defaults to env var)
    model="gpt-4o-mini",    # LLM model to use
    max_tokens=1200,        # Max tokens per response
    n_questions=10,         # Target number of questions (10-10,000)
    max_workers=4           # Number of concurrent threads
)

.generate_quiz()

result = generator.generate_quiz(
    input_text,              # URL, YouTube URL, or text content (required)
    difficulty="medium",     # "easy", "medium", or "hard"
    question_type="mcq",     # "mcq" for multiple choice
    enable_deduplication=True # Use FAISS deduplication
)

# Returns: {"quiz": [...], "short": "summary"}
# quiz: List of question dicts with keys: question, options, answer, answer_number, description
# short: Concise summary of content

QuizParser Class

from quizgpt import QuizParser

# Parse LLM response into quiz structure
parsed = QuizParser.parse_response(llm_response_text)

# Extract JSON from text
json_str = QuizParser.extract_json(text)

# Find answer index in options list
idx = QuizParser.find_answer_index("Paris", ["London", "Paris", "Rome"])
# Returns: 1

DimensionExtractor Class

from quizgpt import DimensionExtractor, extract_dimensions, extract_subdimensions

extractor = DimensionExtractor(api_key="sk-...")

# Extract main dimensions/topics from content
dimensions = extractor.extract_dimensions(content, n_dimensions=5)

# Extract sub-dimensions within a dimension
subdimensions = extractor.extract_subdimensions(
    content, 
    dimension="Science",
    n_subdimensions=10
)

# Get full hierarchical structure (auto-detects based on question count)
structure = extractor.get_dimension_structure(
    content,
    n_questions=500
)

QuestionDeduplicator Class

from quizgpt import QuestionDeduplicator

deduplicator = QuestionDeduplicator(
    api_key=None,
    embed_dim=1536,
    similarity_threshold=0.90,
    embed_model="text-embedding-3-small"
)

# Remove semantic duplicates
unique = deduplicator.deduplicate(
    questions,
    target_count=100
)

📊 Question Structure

Each generated question follows this format:

{
    "question": "What is the capital of France?",
    "options": ["London", "Paris", "Berlin", "Madrid"],
    "answer": "Paris",
    "answer_number": 1,  # 0-based index into options
    "description": "The largest city and capital of France"
}

🏗️ Architecture & Algorithms

Generation Pipeline

Input Content
    ↓
Extract Text
    ↓
Plan Distribution (how to organize questions)
    ├─ n_questions ≤ 20: Flat (single batch)
    ├─ 20 < n_questions ≤ 100: Dimensional (by topic)
    └─ n_questions > 100: Hierarchical (by topic + subtopic)
    ↓
Generate Batches (ThreadPoolExecutor, max 10 Q per request)
    ↓
Collect Questions
    ↓
Deduplicate (FAISS + cosine similarity)
    ↓
Ensure Exact Count (retry if needed)
    ↓
Generate Summary
    ↓
Return Results

Deduplication Algorithm

  1. Generate Embeddings: Use OpenAI's text-embedding-3-small (1536-dim)
  2. Build FAISS Index: Create IndexFlatIP for cosine similarity
  3. Similarity Search: For each question, find similar questions
  4. Remove Duplicates: Mark questions similar above threshold as duplicates
  5. Keep First Occurrence: Preserve diversity

Over-Generation Strategy

  • Target: 100 questions
  • Generate: 120 questions (20% extra)
  • After Dedup: ~100 unique questions
  • Result: Always deliver exactly the requested count

⚡ Performance

Benchmarks (Typical Hardware)

Questions Time Threads Cost (est.)
10 5-10s 1 $0.02
50 15-20s 4 $0.10
100 25-35s 4 $0.20
500 90-120s 4 $1.00
1000 180-240s 4 $2.00

Optimization Tips

  1. Increase max_workers for more parallelism (up to 10)
  2. Enable deduplication to ensure quality (slight time cost)
  3. Use shorter content for faster processing
  4. Batch multiple requests instead of single large request

🧪 Examples

Example 1: Educational Quiz from Article

from quizgpt import QuizGPT

content = """
Machine learning is a subset of artificial intelligence. 
It enables computers to learn from data without explicit programming.
Types include supervised learning, unsupervised learning, and reinforcement learning.
"""

generator = QuizGPT(api_key="sk-...", n_questions=20)
result = generator.generate_quiz(content, difficulty="easy")

for q in result["quiz"]:
    print(f"Q: {q['question']}")
    print(f"A: {q['answer']} - {q['description']}")
    print()

Example 2: Large-Scale Assessment

# Generate 1000 questions for a assessment bank
generator = QuizGPT(api_key="sk-...", n_questions=1000, max_workers=8)

result = generator.generate_quiz(
    input_text="https://en.wikipedia.org/wiki/Biology",
    difficulty="hard",
    enable_deduplication=True
)

# Save to file
import json
with open("biology_questions.json", "w") as f:
    json.dump(result["quiz"], f, indent=2)

print(f"Generated {len(result['quiz'])} unique questions")
print(f"Summary: {result['short']}")

Example 3: Multi-Difficulty Quiz Set

from quizgpt import QuizGPT

content = open("textbook_chapter.txt").read()

for difficulty in ["easy", "medium", "hard"]:
    generator = QuizGPT(api_key="sk-...", n_questions=50)
    result = generator.generate_quiz(content, difficulty=difficulty)
    
    with open(f"quiz_{difficulty}.json", "w") as f:
        json.dump(result["quiz"], f)

🔐 Security

  • API Keys: Never hardcode API keys. Use environment variables:

    export OPENAI_API_KEY="sk-..."
    python your_script.py
    

    Or configure in .env:

    OPENAI_API_KEY=sk-...
    
  • Rate Limiting: OpenAI API has rate limits. Use exponential backoff for retries.

🆘 Troubleshooting

Issue: ModuleNotFoundError: No module named 'faiss'

Solution: Install with pip install faiss-cpu or system dependencies

Issue: OPENAI_API_KEY not found

Solution: Set environment variable: export OPENAI_API_KEY="sk-..."

Issue: Low quality questions

Solution:

  • Reduce n_questions (fewer = more focused)
  • Use better source content
  • Increase similarity_threshold in deduplicator for stricter dedup

Issue: Too slow for large requests

Solution:

  • Increase max_workers (up to 10)
  • Use shorter content chunks
  • Consider caching for repeated content

📚 Requirements

  • Python 3.9 or higher
  • OpenAI API key (gpt-4o-mini or gpt-4 compatible)
  • FAISS (CPU or GPU)
  • scikit-learn
  • numpy
  • beautifulsoup4
  • requests
  • validators
  • youtube-transcript-api

🛠️ Development

Setup Development Environment

git clone https://github.com/your-org/quizgpt.git
cd quizgpt
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e ".[dev]"

Run Tests

python -m pytest tests/ -v

Build & Publish

python -m build
python -m twine upload dist/*

📄 License

MIT License - see LICENSE file for details

🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests
  4. Submit a pull request

📞 Support

🎉 Changelog

v3.4.0 (Latest)

  • Scalable Generation: Support for 10-10,000 questions
  • Intelligent Batching: Automatic LLM-friendly batch sizes
  • FAISS Deduplication: Semantic duplicate removal
  • Dimension Extraction: Topic-based organization
  • Thread-Safe Concurrency: ThreadPoolExecutor-based parallelism
  • 🔧 Enhanced Prompt Contextualization: Dimension-aware prompts
  • 🐛 Bug Fixes: Parser improvements

v3.3.1

  • 📖 README improvements
  • 🔧 Minor fixes

v3.3.0

  • 🔄 OpenAI API v1.0+ migration

Made with ❤️ for educators and learners worldwide

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quizgpt-3.4.0.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quizgpt-3.4.0-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file quizgpt-3.4.0.tar.gz.

File metadata

  • Download URL: quizgpt-3.4.0.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for quizgpt-3.4.0.tar.gz
Algorithm Hash digest
SHA256 8be7e49e199a8fa1466526db91d117a8a18013d03a96bcd6de8f7089929c524d
MD5 5f12d178037861931c4b7643159ceaea
BLAKE2b-256 ba660be380bf7420cd5a83ed91b7e86018bd8d638342b401d930dde65bde22ce

See more details on using hashes here.

File details

Details for the file quizgpt-3.4.0-py3-none-any.whl.

File metadata

  • Download URL: quizgpt-3.4.0-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for quizgpt-3.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e741ae51a074510b02c8d721005c4a35bcc7222aac20195a459326b17b13e7a2
MD5 d67b0d1b603d35fe522e571be9aef8e4
BLAKE2b-256 60af67929a2035d597f013e44ddc3f0d7f89f8c2b56f20b9086e4cba607f0212

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page