A Python package for generating unique quizzes (upto 10,000) from text, URLs, or YouTube transcripts using OpenAI.
Project description
QuizGPT
QuizGPT is a production-grade Python package for generating high-quality quizzes from raw text, web pages, or YouTube video transcripts using OpenAI models. Supports scalable generation of 10 to 10,000 questions with intelligent batching, semantic deduplication, and dimension-based organization.
🌟 Key Features
Core Functionality
- Content Extraction: Automatically extracts text from URLs, web pages, YouTube videos, or plain text
- Multiple Choice Quiz Generation: Creates MCQ quizzes with configurable difficulty levels (easy/medium/hard)
- Content Summarization: Generates concise summaries of source content
- Flexible Interface: Python API and command-line interface
Advanced Features (v3.4.0+)
- Scalable Generation: Generate 10 to 10,000 questions in a single request
- Intelligent Batching: Automatically splits large question requests into LLM-friendly batches (10 questions per request)
- Thread-Safe Concurrency: Uses ThreadPoolExecutor for efficient parallel processing (configurable workers)
- Semantic Deduplication: FAISS-based embeddings remove semantic duplicates while preserving variety
- Dimension-Based Organization: Automatically extracts and organizes questions by topic:
- Flat: For 10-20 questions
- Dimensional: For 20-100 questions (organizes by main topics)
- Hierarchical: For 100-10,000 questions (organizes by topics and subtopics)
- Automatic Retry Logic: Intelligently retries if deduplication reduces count below target
- Over-Generation Strategy: Generates 20% extra questions to account for duplicate removal
📦 Installation
pip install quizgpt
Required Dependencies
- Python 3.9+
- OpenAI API key (get one at https://platform.openai.com/api-keys)
Optional Dependencies for Advanced Features
# FAISS is included in dependencies, but for GPU acceleration:
pip install faiss-gpu # instead of faiss-cpu
🚀 Quick Start
Generate a Simple Quiz (10 Questions)
from quizgpt import QuizGPT
# Initialize generator
generator = QuizGPT(api_key="sk-...")
# Generate quiz from text
result = generator.generate_quiz(
input_text="The mitochondria is the powerhouse of the cell...",
difficulty="medium"
)
print(f"Questions: {len(result['quiz'])}")
print(f"Summary: {result['short']}")
for question in result['quiz'][:3]:
print(f"\n❓ {question['question']}")
print(f" Options: {question['options']}")
print(f" Answer: {question['answer']} (#{question['answer_number']})")
Generate from a URL
result = generator.generate_quiz(
input_text="https://en.wikipedia.org/wiki/Python_(programming_language)"
)
Generate from YouTube
result = generator.generate_quiz(
input_text="https://www.youtube.com/watch?v=dQw4w9WgXcQ"
)
🔧 Advanced Usage
Generate Large-Scale Quizzes (100+ Questions)
from quizgpt import QuizGPT
# Generate 500 questions from content
generator = QuizGPT(
api_key="sk-...",
n_questions=500, # Will auto-organize into dimensions
max_workers=4 # Use 4 concurrent threads
)
result = generator.generate_quiz(
input_text="Long article content...",
difficulty="hard",
enable_deduplication=True # Remove semantic duplicates
)
print(f"Generated {len(result['quiz'])} unique questions")
Using Dimension Extractor Directly
from quizgpt import DimensionExtractor
extractor = DimensionExtractor(api_key="sk-...")
# Extract main topics from content
dimensions = extractor.extract_dimensions(
content="Your long article...",
n_dimensions=5
)
print(f"Main topics: {dimensions}")
# Extract subtopics within a dimension
subtopics = extractor.extract_subdimensions(
content="Your long article...",
dimension="History",
n_subdimensions=10
)
print(f"Subtopics in History: {subtopics}")
Using Deduplicator Directly
from quizgpt import QuestionDeduplicator
deduplicator = QuestionDeduplicator(
api_key="sk-...",
similarity_threshold=0.90 # 0-1, higher = stricter dedup
)
# Remove duplicates from questions
unique_questions = deduplicator.deduplicate(
questions=your_questions_list,
target_count=100 # Try to keep exactly 100 questions
)
print(f"Removed {len(your_questions_list) - len(unique_questions)} duplicates")
CLI Usage
# Basic usage
quizgpt --input "Your text content" --api-key sk-...
# Advanced options
quizgpt \
--input "https://example.com/article" \
--api-key sk-... \
--difficulty hard \
--question-type mcq \
--output results.json
# Generate 100 questions and save to file
quizgpt \
--input "Long article..." \
--api-key sk-... \
--output quiz_100_questions.json
📋 API Reference
QuizGPT Class
from quizgpt import QuizGPT
generator = QuizGPT(
api_key=None, # OpenAI API key (defaults to env var)
model="gpt-4o-mini", # LLM model to use
max_tokens=1200, # Max tokens per response
n_questions=10, # Target number of questions (10-10,000)
max_workers=4 # Number of concurrent threads
)
.generate_quiz()
result = generator.generate_quiz(
input_text, # URL, YouTube URL, or text content (required)
difficulty="medium", # "easy", "medium", or "hard"
question_type="mcq", # "mcq" for multiple choice
enable_deduplication=True # Use FAISS deduplication
)
# Returns: {"quiz": [...], "short": "summary"}
# quiz: List of question dicts with keys: question, options, answer, answer_number, description
# short: Concise summary of content
QuizParser Class
from quizgpt import QuizParser
# Parse LLM response into quiz structure
parsed = QuizParser.parse_response(llm_response_text)
# Extract JSON from text
json_str = QuizParser.extract_json(text)
# Find answer index in options list
idx = QuizParser.find_answer_index("Paris", ["London", "Paris", "Rome"])
# Returns: 1
Note: Options are automatically shuffled during parsing to ensure the correct answer appears in random positions across different quiz generations.
DimensionExtractor Class
from quizgpt import DimensionExtractor, extract_dimensions, extract_subdimensions
extractor = DimensionExtractor(api_key="sk-...")
# Extract main dimensions/topics from content
dimensions = extractor.extract_dimensions(content, n_dimensions=5)
# Extract sub-dimensions within a dimension
subdimensions = extractor.extract_subdimensions(
content,
dimension="Science",
n_subdimensions=10
)
# Get full hierarchical structure (auto-detects based on question count)
structure = extractor.get_dimension_structure(
content,
n_questions=500
)
QuestionDeduplicator Class
from quizgpt import QuestionDeduplicator
deduplicator = QuestionDeduplicator(
api_key=None,
embed_dim=1536,
similarity_threshold=0.90,
embed_model="text-embedding-3-small"
)
# Remove semantic duplicates
unique = deduplicator.deduplicate(
questions,
target_count=100
)
📊 Question Structure
Each generated question follows this format:
{
"question": "What is the capital of France?",
"options": ["London", "Paris", "Berlin", "Madrid"], # Randomly shuffled order
"answer": "Paris",
"answer_number": 1, # 0-based index into shuffled options
"description": "The largest city and capital of France",
"dimension": "Geography", # Optional: topic category (for large quizzes)
"subdimension": "European Capitals" # Optional: sub-topic (for hierarchical quizzes)
}
Note: Options are shuffled randomly during processing, so the correct answer (answer) appears at different positions (answer_number) in each quiz generation for unpredictability.
Dimension fields: When generating large quizzes (100+ questions), questions are organized by topics (dimension) and subtopics (subdimension) for better content coverage and organization.
🏗️ Architecture & Algorithms
Generation Pipeline
Input Content
↓
Extract Text
↓
Plan Distribution (how to organize questions)
├─ n_questions ≤ 20: Flat (single batch)
├─ 20 < n_questions ≤ 100: Dimensional (by topic)
└─ n_questions > 100: Hierarchical (by topic + subtopic)
↓
Generate Batches (ThreadPoolExecutor, max 10 Q per request)
↓
Collect Questions
↓
Deduplicate (FAISS + cosine similarity)
↓
Ensure Exact Count (retry if needed)
↓
Generate Summary
↓
Return Results
Deduplication Algorithm
- Generate Embeddings: Use OpenAI's
text-embedding-3-small(1536-dim) - Build FAISS Index: Create IndexFlatIP for cosine similarity
- Similarity Search: For each question, find similar questions
- Remove Duplicates: Mark questions similar above threshold as duplicates
- Keep First Occurrence: Preserve diversity
Over-Generation Strategy
- Target: 100 questions
- Generate: 120 questions (20% extra)
- After Dedup: ~100 unique questions
- Result: Always deliver exactly the requested count
⚡ Performance
Benchmarks (Typical Hardware)
| Questions | Time | Threads | Cost (est.) |
|---|---|---|---|
| 10 | 5-10s | 1 | $0.02 |
| 50 | 15-20s | 4 | $0.10 |
| 100 | 25-35s | 4 | $0.20 |
| 500 | 90-120s | 4 | $1.00 |
| 1000 | 180-240s | 4 | $2.00 |
Optimization Tips
- Increase
max_workersfor more parallelism (up to 10) - Enable deduplication to ensure quality (slight time cost)
- Use shorter content for faster processing
- Batch multiple requests instead of single large request
🧪 Examples
Example 1: Educational Quiz from Article
from quizgpt import QuizGPT
content = """
Machine learning is a subset of artificial intelligence.
It enables computers to learn from data without explicit programming.
Types include supervised learning, unsupervised learning, and reinforcement learning.
"""
generator = QuizGPT(api_key="sk-...", n_questions=20)
result = generator.generate_quiz(content, difficulty="easy")
for q in result["quiz"]:
print(f"Q: {q['question']}")
print(f"A: {q['answer']} - {q['description']}")
print()
Example 2: Large-Scale Assessment
# Generate 1000 questions for a assessment bank
generator = QuizGPT(api_key="sk-...", n_questions=1000, max_workers=8)
result = generator.generate_quiz(
input_text="https://en.wikipedia.org/wiki/Biology",
difficulty="hard",
enable_deduplication=True
)
# Save to file
import json
with open("biology_questions.json", "w") as f:
json.dump(result["quiz"], f, indent=2)
print(f"Generated {len(result['quiz'])} unique questions")
print(f"Summary: {result['short']}")
Example 3: Multi-Difficulty Quiz Set
from quizgpt import QuizGPT
content = open("textbook_chapter.txt").read()
for difficulty in ["easy", "medium", "hard"]:
generator = QuizGPT(api_key="sk-...", n_questions=50)
result = generator.generate_quiz(content, difficulty=difficulty)
with open(f"quiz_{difficulty}.json", "w") as f:
json.dump(result["quiz"], f)
🔐 Security
-
API Keys: Never hardcode API keys. Use environment variables:
export OPENAI_API_KEY="sk-..." python your_script.py
Or configure in
.env:OPENAI_API_KEY=sk-... -
Rate Limiting: OpenAI API has rate limits. Use exponential backoff for retries.
🆘 Troubleshooting
Issue: ModuleNotFoundError: No module named 'faiss'
Solution: Install with pip install faiss-cpu or system dependencies
Issue: OPENAI_API_KEY not found
Solution: Set environment variable: export OPENAI_API_KEY="sk-..."
Issue: Low quality questions
Solution:
- Reduce
n_questions(fewer = more focused) - Use better source content
- Increase
similarity_thresholdin deduplicator for stricter dedup
Issue: Too slow for large requests
Solution:
- Increase
max_workers(up to 10) - Use shorter content chunks
- Consider caching for repeated content
📚 Requirements
- Python 3.9 or higher
- OpenAI API key (gpt-4o-mini or gpt-4 compatible)
- FAISS (CPU or GPU)
- scikit-learn
- numpy
- beautifulsoup4
- requests
- validators
- youtube-transcript-api
🛠️ Development
Setup Development Environment
git clone https://github.com/your-org/quizgpt.git
cd quizgpt
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e ".[dev]"
Run Tests
python -m pytest tests/ -v
Build & Publish
python -m build
python -m twine upload dist/*
📄 License
MIT License - see LICENSE file for details
🤝 Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests
- Submit a pull request
📞 Support
- Issues: Report on GitHub
- Discussions: GitHub Discussions for Q&A
🎉 Changelog
v3.4.0 (Latest)
- ✨ Scalable Generation: Support for 10-10,000 questions
- ✨ Intelligent Batching: Automatic LLM-friendly batch sizes
- ✨ FAISS Deduplication: Semantic duplicate removal
- ✨ Dimension Extraction: Topic-based organization
- ✨ Thread-Safe Concurrency: ThreadPoolExecutor-based parallelism
- 🔧 Enhanced Prompt Contextualization: Dimension-aware prompts
- 🐛 Bug Fixes: Parser improvements
v3.3.1
- 📖 README improvements
- 🔧 Minor fixes
v3.3.0
- 🔄 OpenAI API v1.0+ migration
Made with ❤️ for educators and learners worldwide
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quizgpt-3.4.3.tar.gz.
File metadata
- Download URL: quizgpt-3.4.3.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da75ecb507b5dd98d81d67c015e69144e5bae84f0f5721037b2af2d5a2ca98bc
|
|
| MD5 |
2d1d101f6d9dbbda798cfcd54d85a23d
|
|
| BLAKE2b-256 |
a1aeb82ac1650c604f2340128c21e924631830bd158eebc07c88f04a4cb2f987
|
File details
Details for the file quizgpt-3.4.3-py3-none-any.whl.
File metadata
- Download URL: quizgpt-3.4.3-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1317d62c31e7c668c927b04443b537dc76e826dfc0a871b5161c24f13bddf1e4
|
|
| MD5 |
b057c735d7ba4b59b77a80cf1ac86958
|
|
| BLAKE2b-256 |
3750d989dbf5640dc30f03576d1e335fc13f8197e9bb58980e35b2cb7abbd1c2
|