Skip to main content

A Python package for generating unique quizzes (upto 10,000) from text, URLs, or YouTube transcripts using OpenAI.

Project description

QuizGPT

PyPI version

QuizGPT is a production-grade Python package for generating high-quality quizzes from raw text, web pages, or YouTube video transcripts using OpenAI models. Supports scalable generation of 10 to 10,000 questions with intelligent batching, semantic deduplication, and dimension-based organization.

🌟 Key Features

Core Functionality

  • Content Extraction: Automatically extracts text from URLs, web pages, YouTube videos, or plain text
  • Multiple Choice Quiz Generation: Creates MCQ quizzes with configurable difficulty levels (easy/medium/hard)
  • Content Summarization: Generates concise summaries of source content
  • Flexible Interface: Python API and command-line interface

Advanced Features (v3.4.0+)

  • Scalable Generation: Generate 10 to 10,000 questions in a single request
  • Intelligent Batching: Automatically splits large question requests into LLM-friendly batches (10 questions per request)
  • Thread-Safe Concurrency: Uses ThreadPoolExecutor for efficient parallel processing (configurable workers)
  • Semantic Deduplication: FAISS-based embeddings remove semantic duplicates while preserving variety
  • Dimension-Based Organization: Automatically extracts and organizes questions by topic:
    • Flat: For 10-20 questions
    • Dimensional: For 20-100 questions (organizes by main topics)
    • Hierarchical: For 100-10,000 questions (organizes by topics and subtopics)
  • Automatic Retry Logic: Intelligently retries if deduplication reduces count below target
  • Over-Generation Strategy: Generates 20% extra questions to account for duplicate removal

📦 Installation

pip install quizgpt

Required Dependencies

Optional Dependencies for Advanced Features

# FAISS is included in dependencies, but for GPU acceleration:
pip install faiss-gpu  # instead of faiss-cpu

🚀 Quick Start

Generate a Simple Quiz (10 Questions)

from quizgpt import QuizGPT

# Initialize generator
generator = QuizGPT(api_key="sk-...")

# Generate quiz from text
result = generator.generate_quiz(
    input_text="The mitochondria is the powerhouse of the cell...",
    difficulty="medium"
)

print(f"Questions: {len(result['quiz'])}")
print(f"Summary: {result['short']}")

for question in result['quiz'][:3]:
    print(f"\n{question['question']}")
    print(f"   Options: {question['options']}")
    print(f"   Answer: {question['answer']} (#{question['answer_number']})")

Generate from a URL

result = generator.generate_quiz(
    input_text="https://en.wikipedia.org/wiki/Python_(programming_language)"
)

Generate from YouTube

result = generator.generate_quiz(
    input_text="https://www.youtube.com/watch?v=dQw4w9WgXcQ"
)

🔧 Advanced Usage

Generate Large-Scale Quizzes (100+ Questions)

from quizgpt import QuizGPT

# Generate 500 questions from content
generator = QuizGPT(
    api_key="sk-...",
    n_questions=500,  # Will auto-organize into dimensions
    max_workers=4     # Use 4 concurrent threads
)

result = generator.generate_quiz(
    input_text="Long article content...",
    difficulty="hard",
    enable_deduplication=True  # Remove semantic duplicates
)

print(f"Generated {len(result['quiz'])} unique questions")

Using Dimension Extractor Directly

from quizgpt import DimensionExtractor

extractor = DimensionExtractor(api_key="sk-...")

# Extract main topics from content
dimensions = extractor.extract_dimensions(
    content="Your long article...",
    n_dimensions=5
)
print(f"Main topics: {dimensions}")

# Extract subtopics within a dimension
subtopics = extractor.extract_subdimensions(
    content="Your long article...",
    dimension="History",
    n_subdimensions=10
)
print(f"Subtopics in History: {subtopics}")

Using Deduplicator Directly

from quizgpt import QuestionDeduplicator

deduplicator = QuestionDeduplicator(
    api_key="sk-...",
    similarity_threshold=0.90  # 0-1, higher = stricter dedup
)

# Remove duplicates from questions
unique_questions = deduplicator.deduplicate(
    questions=your_questions_list,
    target_count=100  # Try to keep exactly 100 questions
)

print(f"Removed {len(your_questions_list) - len(unique_questions)} duplicates")

CLI Usage

# Basic usage
quizgpt --input "Your text content" --api-key sk-...

# Advanced options
quizgpt \
  --input "https://example.com/article" \
  --api-key sk-... \
  --difficulty hard \
  --question-type mcq \
  --output results.json

# Generate 100 questions and save to file
quizgpt \
  --input "Long article..." \
  --api-key sk-... \
  --output quiz_100_questions.json

📋 API Reference

QuizGPT Class

from quizgpt import QuizGPT

generator = QuizGPT(
    api_key=None,           # OpenAI API key (defaults to env var)
    model="gpt-4o-mini",    # LLM model to use
    max_tokens=1200,        # Max tokens per response
    n_questions=10,         # Target number of questions (10-10,000)
    max_workers=4           # Number of concurrent threads
)

.generate_quiz()

result = generator.generate_quiz(
    input_text,              # URL, YouTube URL, or text content (required)
    difficulty="medium",     # "easy", "medium", or "hard"
    question_type="mcq",     # "mcq" for multiple choice
    enable_deduplication=True # Use FAISS deduplication
)

# Returns: {"quiz": [...], "short": "summary"}
# quiz: List of question dicts with keys: question, options, answer, answer_number, description
# short: Concise summary of content

QuizParser Class

from quizgpt import QuizParser

# Parse LLM response into quiz structure
parsed = QuizParser.parse_response(llm_response_text)

# Extract JSON from text
json_str = QuizParser.extract_json(text)

# Find answer index in options list
idx = QuizParser.find_answer_index("Paris", ["London", "Paris", "Rome"])
# Returns: 1

Note: Options are automatically shuffled during parsing to ensure the correct answer appears in random positions across different quiz generations.

DimensionExtractor Class

from quizgpt import DimensionExtractor, extract_dimensions, extract_subdimensions

extractor = DimensionExtractor(api_key="sk-...")

# Extract main dimensions/topics from content
dimensions = extractor.extract_dimensions(content, n_dimensions=5)

# Extract sub-dimensions within a dimension
subdimensions = extractor.extract_subdimensions(
    content, 
    dimension="Science",
    n_subdimensions=10
)

# Get full hierarchical structure (auto-detects based on question count)
structure = extractor.get_dimension_structure(
    content,
    n_questions=500
)

QuestionDeduplicator Class

from quizgpt import QuestionDeduplicator

deduplicator = QuestionDeduplicator(
    api_key=None,
    embed_dim=1536,
    similarity_threshold=0.90,
    embed_model="text-embedding-3-small"
)

# Remove semantic duplicates
unique = deduplicator.deduplicate(
    questions,
    target_count=100
)

📊 Question Structure

Each generated question follows this format:

{
    "question": "What is the capital of France?",
    "options": ["London", "Paris", "Berlin", "Madrid"],  # Randomly shuffled order
    "answer": "Paris",
    "answer_number": 1,  # 0-based index into shuffled options
    "description": "The largest city and capital of France",
    "dimension": "Geography",        # Optional: topic category (for large quizzes)
    "subdimension": "European Capitals"  # Optional: sub-topic (for hierarchical quizzes)
}

Note: Options are shuffled randomly during processing, so the correct answer (answer) appears at different positions (answer_number) in each quiz generation for unpredictability.

Dimension fields: When generating large quizzes (100+ questions), questions are organized by topics (dimension) and subtopics (subdimension) for better content coverage and organization.

🏗️ Architecture & Algorithms

Generation Pipeline

Input Content
    ↓
Extract Text
    ↓
Plan Distribution (how to organize questions)
    ├─ n_questions ≤ 20: Flat (single batch)
    ├─ 20 < n_questions ≤ 100: Dimensional (by topic)
    └─ n_questions > 100: Hierarchical (by topic + subtopic)
    ↓
Generate Batches (ThreadPoolExecutor, max 10 Q per request)
    ↓
Collect Questions
    ↓
Deduplicate (FAISS + cosine similarity)
    ↓
Ensure Exact Count (retry if needed)
    ↓
Generate Summary
    ↓
Return Results

Deduplication Algorithm

  1. Generate Embeddings: Use OpenAI's text-embedding-3-small (1536-dim)
  2. Build FAISS Index: Create IndexFlatIP for cosine similarity
  3. Similarity Search: For each question, find similar questions
  4. Remove Duplicates: Mark questions similar above threshold as duplicates
  5. Keep First Occurrence: Preserve diversity

Over-Generation Strategy

  • Target: 100 questions
  • Generate: 120 questions (20% extra)
  • After Dedup: ~100 unique questions
  • Result: Always deliver exactly the requested count

⚡ Performance

Benchmarks (Typical Hardware)

Questions Time Threads Cost (est.)
10 5-10s 1 $0.02
50 15-20s 4 $0.10
100 25-35s 4 $0.20
500 90-120s 4 $1.00
1000 180-240s 4 $2.00

Optimization Tips

  1. Increase max_workers for more parallelism (up to 10)
  2. Enable deduplication to ensure quality (slight time cost)
  3. Use shorter content for faster processing
  4. Batch multiple requests instead of single large request

🧪 Examples

Example 1: Educational Quiz from Article

from quizgpt import QuizGPT

content = """
Machine learning is a subset of artificial intelligence. 
It enables computers to learn from data without explicit programming.
Types include supervised learning, unsupervised learning, and reinforcement learning.
"""

generator = QuizGPT(api_key="sk-...", n_questions=20)
result = generator.generate_quiz(content, difficulty="easy")

for q in result["quiz"]:
    print(f"Q: {q['question']}")
    print(f"A: {q['answer']} - {q['description']}")
    print()

Example 2: Large-Scale Assessment

# Generate 1000 questions for a assessment bank
generator = QuizGPT(api_key="sk-...", n_questions=1000, max_workers=8)

result = generator.generate_quiz(
    input_text="https://en.wikipedia.org/wiki/Biology",
    difficulty="hard",
    enable_deduplication=True
)

# Save to file
import json
with open("biology_questions.json", "w") as f:
    json.dump(result["quiz"], f, indent=2)

print(f"Generated {len(result['quiz'])} unique questions")
print(f"Summary: {result['short']}")

Example 3: Multi-Difficulty Quiz Set

from quizgpt import QuizGPT

content = open("textbook_chapter.txt").read()

for difficulty in ["easy", "medium", "hard"]:
    generator = QuizGPT(api_key="sk-...", n_questions=50)
    result = generator.generate_quiz(content, difficulty=difficulty)
    
    with open(f"quiz_{difficulty}.json", "w") as f:
        json.dump(result["quiz"], f)

🔐 Security

  • API Keys: Never hardcode API keys. Use environment variables:

    export OPENAI_API_KEY="sk-..."
    python your_script.py
    

    Or configure in .env:

    OPENAI_API_KEY=sk-...
    
  • Rate Limiting: OpenAI API has rate limits. Use exponential backoff for retries.

🆘 Troubleshooting

Issue: ModuleNotFoundError: No module named 'faiss'

Solution: Install with pip install faiss-cpu or system dependencies

Issue: OPENAI_API_KEY not found

Solution: Set environment variable: export OPENAI_API_KEY="sk-..."

Issue: Low quality questions

Solution:

  • Reduce n_questions (fewer = more focused)
  • Use better source content
  • Increase similarity_threshold in deduplicator for stricter dedup

Issue: Too slow for large requests

Solution:

  • Increase max_workers (up to 10)
  • Use shorter content chunks
  • Consider caching for repeated content

📚 Requirements

  • Python 3.9 or higher
  • OpenAI API key (gpt-4o-mini or gpt-4 compatible)
  • FAISS (CPU or GPU)
  • scikit-learn
  • numpy
  • beautifulsoup4
  • requests
  • validators
  • youtube-transcript-api

🛠️ Development

Setup Development Environment

git clone https://github.com/your-org/quizgpt.git
cd quizgpt
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e ".[dev]"

Run Tests

python -m pytest tests/ -v

Build & Publish

python -m build
python -m twine upload dist/*

📄 License

MIT License - see LICENSE file for details

🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests
  4. Submit a pull request

📞 Support

🎉 Changelog

v3.4.0 (Latest)

  • Scalable Generation: Support for 10-10,000 questions
  • Intelligent Batching: Automatic LLM-friendly batch sizes
  • FAISS Deduplication: Semantic duplicate removal
  • Dimension Extraction: Topic-based organization
  • Thread-Safe Concurrency: ThreadPoolExecutor-based parallelism
  • 🔧 Enhanced Prompt Contextualization: Dimension-aware prompts
  • 🐛 Bug Fixes: Parser improvements

v3.3.1

  • 📖 README improvements
  • 🔧 Minor fixes

v3.3.0

  • 🔄 OpenAI API v1.0+ migration

Made with ❤️ for educators and learners worldwide

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quizgpt-3.4.2.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quizgpt-3.4.2-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file quizgpt-3.4.2.tar.gz.

File metadata

  • Download URL: quizgpt-3.4.2.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for quizgpt-3.4.2.tar.gz
Algorithm Hash digest
SHA256 5fbbc7fdcc0a696fb2d946fa308f9f996e32edc520b9a62cc5907d0aaa34c176
MD5 eb13e57b88595d3e1f1dc4f255e33183
BLAKE2b-256 a8a693743350da36b9889cc03ecdf0cfea9a289c30ef204b707f21ffa8565666

See more details on using hashes here.

File details

Details for the file quizgpt-3.4.2-py3-none-any.whl.

File metadata

  • Download URL: quizgpt-3.4.2-py3-none-any.whl
  • Upload date:
  • Size: 23.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for quizgpt-3.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b41c28d016fbe9ce9c601d4e495d1669cf8a6e9e55b9c0dfed0fb7acfe5a638b
MD5 da271820baf567ff8c5054f9df399101
BLAKE2b-256 e87867e327e79e59c4edf50f841fb7ea5379630db278aafdd830dac928488809

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page