Skip to main content

AI-powered quiz generator for regulatory, certification, and educational documentation

Project description

quiz-gen

Python 3.10+ License: MIT PyPI version GitHub last commit Downloads

AI-powered quiz generator for regulatory, certification, and educational documentation. Extract structured content from complex legal and technical documents to create comprehensive learning materials.

Features

  • EUR-Lex Document Parser: Parse and structure European Union legal documents with full table of contents extraction
  • Hierarchical Document Analysis: Automatically identify document structure including chapters, sections, articles, and recitals
  • Intelligent Chunking: Extract meaningful content chunks at appropriate granularity levels (articles and recitals)
  • Table of Contents Generation: Build complete document navigation structure with 3-level hierarchy
  • Regulatory Document Support: Specialized parsing for aviation regulations, directives, and other technical documentation

Installation

pip install quiz-gen

Quick Start

Parsing EUR-Lex Documents

from quiz_gen import EURLexParser

# Parse a regulation document
url = "https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689"
parser = EURLexParser(url=url)
chunks, toc = parser.parse()

# Access structured content
print(f"Extracted {len(chunks)} content chunks")
print(f"Document has {len(toc['sections'])} major sections")

# Save results
parser.save_chunks('output_chunks.json')
parser.save_toc('output_toc.json')

Document Structure

The parser extracts documents into a multi-level hierarchy:

Level 1: Major Sections

  • Preamble
  • Enacting Terms

Level 2/3: Structural Divisions

  • Chapters
  • Sections

Level 1/2/3/4: Content Elements

  • Title
  • Citation
  • Recitals
  • Articles
  • Concluding formulas
  • Annex
  • Appendix

Working with Chunks

# Iterate through extracted chunks
for chunk in chunks:
    print(f"{chunk.title}")
    print(f"Type: {chunk.section_type.value}")
    print(f"Number: {chunk.number}")
    print(f"Content: {chunk.content[:200]}...")
    print(f"Hierarchy: {' > '.join(chunk.hierarchy_path)}")
    print()

Displaying Table of Contents

# Print formatted TOC
parser.print_toc()

# Output:
# PREAMBLE
#   Citation 
#   Recital 1
#   Recital 2
#   ...
# 
# ENACTING TERMS
#   CHAPTER I - PRINCIPLES
#     Article 1 - Subject matter and objectives
#     Article 2 - Scope

Use Cases

Compliance and Legal

  • Analyze regulatory requirements systematically
  • Track changes across document versions
  • Build searchable knowledge bases from legal texts

Documentation Processing

  • Convert unstructured documents into structured data
  • Build citation networks and cross-references
  • Support automated document analysis workflows

Education and Training

  • Generate study materials from regulatory documents
  • Create structured learning paths for certification programs
  • Extract key concepts for examination preparation

Supported Document Types

Currently supports:

  • EUR-Lex HTML Documents: European Union regulations, directives, decisions
  • Legislative Acts: Structured legal documents with formal hierarchies

Document Format Requirements

  • Documents must use EUR-Lex HTML format
  • Must contain eli-subdivision elements for proper structure identification
  • Supports multi-level hierarchies with chapters, sections, and articles

Advanced Usage

Custom Parsing Workflows

from quiz_gen import EURLexParser

parser = EURLexParser(url=document_url)

# Parse specific sections
parser._parse_preamble()  # Extract citations and recitals
parser._parse_enacting_terms()  # Extract chapters and articles
parser._parse_annexes()  # Extract annexes

# Access intermediate results
toc = parser.toc  # Full table of contents
chunks = parser.chunks  # Content chunks only

Filtering Chunks by Type

from quiz_gen import SectionType

# Get only recitals
recitals = [c for c in chunks if c.section_type == SectionType.RECITAL]

# Get only articles
articles = [c for c in chunks if c.section_type == SectionType.ARTICLE]

# Filter by chapter
chapter_1_articles = [
    c for c in articles 
    if 'CHAPTER I' in ' > '.join(c.hierarchy_path)
]

Accessing Metadata

for chunk in chunks:
    # Access structured metadata
    print(chunk.metadata)  # {'id': 'art_1', 'subtitle': '...'}
    
    # Navigate hierarchy
    print(chunk.hierarchy_path)  # ['CHAPTER I - PRINCIPLES', 'Article 1']
    
    # Identify parent sections
    print(chunk.parent_section)

Project Structure

quiz-gen/
├── src/
│   └── quiz_gen/
│       ├── parsers/
│       │   └── html/
│       │       └── eu_lex_parser.py
│       ├── models/
│       │   ├── chunk.py
│       │   ├── document.py
│       │   └── quiz.py
│       └── utils/
├── examples/
│   └── eu_lex_toc_chunks.py
├── tests/
├── data/
│   ├── processed/
│   └── raw/
└── docs/

Development

Setting up Development Environment

# Clone the repository
git clone https://github.com/yauheniya-ai/quiz-gen.git
cd quiz-gen

# Install with development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .
black .

Contributing

Contributions are welcome! Please ensure:

  1. Code follows PEP 8 style guidelines
  2. All tests pass
  3. New features include appropriate tests
  4. Documentation is updated

API Reference

EURLexParser

Main parser class for EUR-Lex documents.

Methods:

  • parse() -> tuple[List[RegulationChunk], Dict]: Parse document and return chunks and TOC
  • fetch() -> str: Fetch HTML content from URL
  • save_chunks(filepath: str): Save chunks to JSON file
  • save_toc(filepath: str): Save table of contents to JSON file
  • print_toc(): Display formatted table of contents

RegulationChunk

Represents a parsed content chunk (article or recital).

Attributes:

  • section_type: Type of section (ARTICLE, RECITAL, etc.)
  • number: Section number (e.g., "1", "42")
  • title: Full title including subtitle
  • content: Text content
  • hierarchy_path: List of parent sections
  • metadata: Additional structured data

SectionType

Enumeration of document section types.

Values:

  • PREAMBLE: Preamble section
  • ENACTING_TERMS: Main regulatory content
  • CITATION: Citation in preamble
  • RECITAL: Recital in preamble
  • CHAPTER: Chapter division
  • SECTION: Section within chapter
  • ARTICLE: Article (main content unit)
  • ANNEX: Annex section

Roadmap

Future enhancements planned:

  • AI-powered quiz generation from extracted content
  • Support for additional document formats (PDF, DOCX, PPTX)
  • Multi-language support
  • Question validation and quality metrics
  • Integration with learning management systems
  • Version comparison and diff analysis

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citation

If you use this software in academic work, please cite:

Varabyova, Y. (2026). Quiz Gen AI: AI-powered quiz generator for regulatory documentation.
GitHub repository: https://github.com/yauheniya-ai/quiz-gen

Support

Acknowledgments

Built with:

  • BeautifulSoup4 for HTML parsing
  • lxml for XML processing
  • EUR-Lex for providing structured legal documents

Changelog

Version 0.1.0 (2026-01-17)

Initial release:

  • EUR-Lex document parser
  • Hierarchical document structure extraction
  • Table of contents generation
  • JSON export for chunks and TOC

Version 0.1.1 (2026-01-18)

Parser enhancements:

  • Added regulation title extraction and chunking
  • Support for flexible 3-4 level hierarchy with sections within chapters
  • Complete annexes extraction including table-based content
  • Combined citations into single chunk matching EU-Lex structure
  • Added concluding formulas parsing

Version 0.1.2 (2026-01-18)

Text formatting and tooling:

  • Implemented smart text cleaning for proper list formatting (removes extra newlines after list markers)
  • Fixed numbered paragraph spacing
  • Added professional command-line interface (CLI)
  • Created comprehensive documentation with MkDocs and Material theme

Version 0.1.3 (2026-01-19)

Parser robustness improvements:

  • Fixed parsing of articles directly under enacting terms (without chapter hierarchy)
  • Enhanced article content extraction to handle table-based list items (e.g., (a), (b), (c) in table cells)
  • Added proper appendix detection and parsing (distinguishes appendices from annexes)
  • Improved title extraction for multi-paragraph appendix titles

Version 0.1.4 (2026-01-19)

Annex parsing improvements:

  • Added intelligent detection and parsing of parts within annexes (PART 1, PART 2, etc.)
  • Improved part titles to include annex identifier (e.g., "ANNEX 1 - PART 1" instead of "ANNEX - PART 1")
  • Removed arbitrary content truncation in annexes and appendices - all content now preserved in full
  • Enhanced content collection for parts with proper boundary detection between sections

Version 0.1.5 (2026-01-19)

Bug fixes:

  • Fixed annex TOC title to display with identifier (e.g., "ANNEX 1" instead of "ANNEX")
  • Fixed empty content in annex parts by switching from sibling navigation to descendants iteration

Version 0.1.6 (2026-01-19)

Content extraction improvements:

  • Enhanced part content extraction to include all paragraph types (titles, headings, body text)
  • Fixed missing section titles and numbered headings in annex parts
  • Lowered text length threshold to capture short titles (5 chars instead of 10)
  • Added smart filtering to skip only PART headers while collecting all other content

Version 0.1.7 (2026-01-19)

List structure preservation:

  • Added detection and proper handling of list-item tables (numbered and lettered items)
  • Fixed extraction of nested list structures by processing direct content only
  • Preserved list markers like (8), (a), (b), (—) with their corresponding text
  • Separated handling of list tables vs data tables for appropriate formatting

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quiz_gen-0.1.7.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quiz_gen-0.1.7-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file quiz_gen-0.1.7.tar.gz.

File metadata

  • Download URL: quiz_gen-0.1.7.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for quiz_gen-0.1.7.tar.gz
Algorithm Hash digest
SHA256 9bd3ee4f17ba03d250067cf8adc03eb8bbaba377436d61db378e866a2ba2d87b
MD5 e15378ddaa62f0884d2cd41d421af6a9
BLAKE2b-256 af0fddbcc056d4d68c3707425e9ad91949b45e392691af5912080aa47fe5c032

See more details on using hashes here.

File details

Details for the file quiz_gen-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: quiz_gen-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 21.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for quiz_gen-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 fb6c52c88d690c01be67812d535e6f1862dd07ed2da5b296a541d5ec920c6147
MD5 02b9e5fb14539d638b0ec460134a551a
BLAKE2b-256 55b93827d1a412f6dd23aa96cd347ef2761a7eed2b417a3a11dc8ab893db03a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page