AI-powered quiz generator for regulatory, certification, and educational documentation

These details have not been verified by PyPI

Project links

Project description

quiz-gen

AI-powered quiz generator for regulatory, certification, and educational documentation. Extract structured content from complex legal and technical documents to create comprehensive learning materials.

Features

EUR-Lex Document Parser: Parse and structure European Union legal documents with full table of contents extraction
Hierarchical Document Analysis: Automatically identify document structure including chapters, sections, articles, and recitals
Intelligent Chunking: Extract meaningful content chunks at appropriate granularity levels (articles and recitals)
Table of Contents Generation: Build complete document navigation structure with 3-level hierarchy
Regulatory Document Support: Specialized parsing for aviation regulations, directives, and other technical documentation

Installation

pip install quiz-gen

Quick Start

Parsing EUR-Lex Documents

from quiz_gen import EURLexParser

# Parse a regulation document
url = "https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689"
parser = EURLexParser(url=url)
chunks, toc = parser.parse()

# Access structured content
print(f"Extracted {len(chunks)} content chunks")
print(f"Document has {len(toc['sections'])} major sections")

# Save results
parser.save_chunks('output_chunks.json')
parser.save_toc('output_toc.json')

Document Structure

The parser extracts documents into a multi-level hierarchy:

Level 1: Major Sections

Preamble
Enacting Terms

Level 2/3: Structural Divisions

Chapters
Sections

Level 1/2/3/4: Content Elements

Title
Citation
Recitals
Articles
Concluding formulas
Annex
Appendix

Working with Chunks

# Iterate through extracted chunks
for chunk in chunks:
    print(f"{chunk.title}")
    print(f"Type: {chunk.section_type.value}")
    print(f"Number: {chunk.number}")
    print(f"Content: {chunk.content[:200]}...")
    print(f"Hierarchy: {' > '.join(chunk.hierarchy_path)}")
    print()

Displaying Table of Contents

# Print formatted TOC
parser.print_toc()

# Output:
# PREAMBLE
#   Citation 
#   Recital 1
#   Recital 2
#   ...
# 
# ENACTING TERMS
#   CHAPTER I - PRINCIPLES
#     Article 1 - Subject matter and objectives
#     Article 2 - Scope

Use Cases

Compliance and Legal

Analyze regulatory requirements systematically
Track changes across document versions
Build searchable knowledge bases from legal texts

Documentation Processing

Convert unstructured documents into structured data
Build citation networks and cross-references
Support automated document analysis workflows

Education and Training

Generate study materials from regulatory documents
Create structured learning paths for certification programs
Extract key concepts for examination preparation

Supported Document Types

Currently supports:

EUR-Lex HTML Documents: European Union regulations, directives, decisions
Legislative Acts: Structured legal documents with formal hierarchies

Document Format Requirements

Documents must use EUR-Lex HTML format
Must contain eli-subdivision elements for proper structure identification
Supports multi-level hierarchies with chapters, sections, and articles

Advanced Usage

Custom Parsing Workflows

from quiz_gen import EURLexParser

parser = EURLexParser(url=document_url)

# Parse specific sections
parser._parse_preamble()  # Extract citations and recitals
parser._parse_enacting_terms()  # Extract chapters and articles
parser._parse_annexes()  # Extract annexes

# Access intermediate results
toc = parser.toc  # Full table of contents
chunks = parser.chunks  # Content chunks only

Filtering Chunks by Type

from quiz_gen import SectionType

# Get only recitals
recitals = [c for c in chunks if c.section_type == SectionType.RECITAL]

# Get only articles
articles = [c for c in chunks if c.section_type == SectionType.ARTICLE]

# Filter by chapter
chapter_1_articles = [
    c for c in articles 
    if 'CHAPTER I' in ' > '.join(c.hierarchy_path)
]

Accessing Metadata

for chunk in chunks:
    # Access structured metadata
    print(chunk.metadata)  # {'id': 'art_1', 'subtitle': '...'}
    
    # Navigate hierarchy
    print(chunk.hierarchy_path)  # ['CHAPTER I - PRINCIPLES', 'Article 1']
    
    # Identify parent sections
    print(chunk.parent_section)

Project Structure

quiz-gen/
├── src/
│   └── quiz_gen/
│       ├── parsers/
│       │   └── html/
│       │       └── eu_lex_parser.py
│       ├── models/
│       │   ├── chunk.py
│       │   ├── document.py
│       │   └── quiz.py
│       └── utils/
├── examples/
│   └── eu_lex_toc_chunks.py
├── tests/
├── data/
│   ├── processed/
│   └── raw/
└── docs/

Development

Setting up Development Environment

# Clone the repository
git clone https://github.com/yauheniya-ai/quiz-gen.git
cd quiz-gen

# Install with development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .
black .

Contributing

Contributions are welcome! Please ensure:

Code follows PEP 8 style guidelines
All tests pass
New features include appropriate tests
Documentation is updated

API Reference

EURLexParser

Main parser class for EUR-Lex documents.

Methods:

parse() -> tuple[List[RegulationChunk], Dict]: Parse document and return chunks and TOC
fetch() -> str: Fetch HTML content from URL
save_chunks(filepath: str): Save chunks to JSON file
save_toc(filepath: str): Save table of contents to JSON file
print_toc(): Display formatted table of contents

RegulationChunk

Represents a parsed content chunk (article or recital).

Attributes:

section_type: Type of section (ARTICLE, RECITAL, etc.)
number: Section number (e.g., "1", "42")
title: Full title including subtitle
content: Text content
hierarchy_path: List of parent sections
metadata: Additional structured data

SectionType

Enumeration of document section types.

Values:

PREAMBLE: Preamble section
ENACTING_TERMS: Main regulatory content
CITATION: Citation in preamble
RECITAL: Recital in preamble
CHAPTER: Chapter division
SECTION: Section within chapter
ARTICLE: Article (main content unit)
ANNEX: Annex section

Roadmap

Future enhancements planned:

AI-powered quiz generation from extracted content
Support for additional document formats (PDF, DOCX, PPTX)
Multi-language support
Question validation and quality metrics
Integration with learning management systems
Version comparison and diff analysis

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citation

If you use this software in academic work, please cite:

Varabyova, Y. (2026). Quiz Gen AI: AI-powered quiz generator for regulatory documentation.
GitHub repository: https://github.com/yauheniya-ai/quiz-gen

Support

Documentation: https://quiz-gen.readthedocs.io
Issue Tracker: https://github.com/yauheniya-ai/quiz-gen/issues

Acknowledgments

Built with:

BeautifulSoup4 for HTML parsing
lxml for XML processing
EUR-Lex for providing structured legal documents

Changelog

Version 0.1.0 (2026-01-17)

Initial release:

EUR-Lex document parser
Hierarchical document structure extraction
Table of contents generation
JSON export for chunks and TOC

Version 0.1.1 (2026-01-18)

Parser enhancements:

Added regulation title extraction and chunking
Support for flexible 3-4 level hierarchy with sections within chapters
Complete annexes extraction including table-based content
Combined citations into single chunk matching EU-Lex structure
Added concluding formulas parsing

Version 0.1.2 (2026-01-18)

Text formatting and tooling:

Implemented smart text cleaning for proper list formatting (removes extra newlines after list markers)
Fixed numbered paragraph spacing
Added professional command-line interface (CLI)
Created comprehensive documentation with MkDocs and Material theme

Version 0.1.3 (2026-01-19)

Parser robustness improvements:

Fixed parsing of articles directly under enacting terms (without chapter hierarchy)
Enhanced article content extraction to handle table-based list items (e.g., (a), (b), (c) in table cells)
Added proper appendix detection and parsing (distinguishes appendices from annexes)
Improved title extraction for multi-paragraph appendix titles

Version 0.1.4 (2026-01-19)

Annex parsing improvements:

Added intelligent detection and parsing of parts within annexes (PART 1, PART 2, etc.)
Improved part titles to include annex identifier (e.g., "ANNEX 1 - PART 1" instead of "ANNEX - PART 1")
Removed arbitrary content truncation in annexes and appendices - all content now preserved in full
Enhanced content collection for parts with proper boundary detection between sections

Version 0.1.5 (2026-01-19)

Bug fixes:

Fixed annex TOC title to display with identifier (e.g., "ANNEX 1" instead of "ANNEX")
Fixed empty content in annex parts by switching from sibling navigation to descendants iteration

Version 0.1.6 (2026-01-19)

Content extraction improvements:

Enhanced part content extraction to include all paragraph types (titles, headings, body text)
Fixed missing section titles and numbered headings in annex parts
Lowered text length threshold to capture short titles (5 chars instead of 10)
Added smart filtering to skip only PART headers while collecting all other content

Version 0.1.7 (2026-01-19)

List structure preservation:

Added detection and proper handling of list-item tables (numbered and lettered items)
Fixed extraction of nested list structures by processing direct content only
Preserved list markers like (8), (a), (b), (—) with their corresponding text
Separated handling of list tables vs data tables for appropriate formatting

Version 0.1.8 (2026-01-19)

Complete text extraction:

Simplified part content extraction to use natural text flow from HTML structure
Fixed content duplication caused by nested table processing
Fixed missing content (e.g., item (8)) by extracting all sibling elements between PART headers
Switched from selective element processing to comprehensive text extraction using get_text()
Ensures complete and accurate extraction without repetition for legal document compliance

Version 0.1.9 (2026-01-20)

Annex section parsing enhancements:

Added support for detecting and extracting annex sections (Section A, Section B, etc.) in addition to parts
Fixed line break formatting in numbered lists within annex sections to keep numbers and content on same line
Fixed content extraction for annex sections by searching for tables within container elements rather than direct table siblings
Enhanced section pattern matching to support both "PART" and "Section" patterns with letter/number identifiers

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.7

May 10, 2026

0.8.6

Apr 15, 2026

0.8.5

Apr 14, 2026

0.8.4

Apr 14, 2026

0.8.3

Apr 14, 2026

0.8.2

Apr 14, 2026

0.8.1

Apr 14, 2026

0.8.0

Apr 12, 2026

0.7.0

Apr 11, 2026

0.6.1

Apr 11, 2026

0.6.0

Apr 11, 2026

0.5.3

Mar 27, 2026

0.5.2

Mar 26, 2026

0.5.0

Mar 26, 2026

0.4.3

Feb 17, 2026

0.4.2

Feb 15, 2026

0.4.1

Feb 15, 2026

0.4.0

Feb 14, 2026

0.3.8

Feb 14, 2026

0.3.7

Feb 14, 2026

0.3.6

Feb 14, 2026

0.3.5

Feb 9, 2026

0.3.4

Feb 8, 2026

0.3.3

Feb 8, 2026

0.3.2

Feb 8, 2026

0.3.1

Feb 8, 2026

0.3.0

Feb 8, 2026

0.2.8

Jan 29, 2026

0.2.7

Jan 29, 2026

0.2.6

Jan 29, 2026

0.2.5

Jan 29, 2026

0.2.4

Jan 28, 2026

0.2.3

Jan 28, 2026

0.2.2

Jan 28, 2026

0.2.1

Jan 27, 2026

0.2.0

Jan 26, 2026

0.1.11

Jan 26, 2026

0.1.10

Jan 26, 2026

This version

0.1.9

Jan 20, 2026

0.1.8

Jan 19, 2026

0.1.7

Jan 19, 2026

0.1.6

Jan 19, 2026

0.1.5

Jan 19, 2026

0.1.4

Jan 19, 2026

0.1.3

Jan 19, 2026

0.1.2

Jan 18, 2026

0.1.1

Jan 18, 2026

0.1.0

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quiz_gen-0.1.9.tar.gz (23.1 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quiz_gen-0.1.9-py3-none-any.whl (21.9 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file quiz_gen-0.1.9.tar.gz.

File metadata

Download URL: quiz_gen-0.1.9.tar.gz
Upload date: Jan 20, 2026
Size: 23.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for quiz_gen-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`eccbc9f4d25f46991622ab5b74681668e8e32a3d4dc0eab8eebf79650f7555a8`
MD5	`1c244c60ce19aebc80b3fe0f290dd348`
BLAKE2b-256	`dcc995c5f5e8d2d1ca3ddb79dbcc01153b50b71a360803aa17be864fa6b33bde`

See more details on using hashes here.

File details

Details for the file quiz_gen-0.1.9-py3-none-any.whl.

File metadata

Download URL: quiz_gen-0.1.9-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 21.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for quiz_gen-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7cf305ef0b5c849e14db36db64c9a91bd895ebdc06fb8fe170b9906153fccfb0`
MD5	`bf9b92c4a35262e7125302594fc460cf`
BLAKE2b-256	`b689e889677abe965eae2ba9b127e2a8166e0795b261eccb86c77be87f44a06c`

See more details on using hashes here.

quiz-gen 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quiz-gen

Features

Installation

Quick Start

Parsing EUR-Lex Documents

Document Structure

Working with Chunks

Displaying Table of Contents

Use Cases

Compliance and Legal

Documentation Processing

Education and Training

Supported Document Types

Document Format Requirements

Advanced Usage

Custom Parsing Workflows

Filtering Chunks by Type

Accessing Metadata

Project Structure

Development

Setting up Development Environment

Contributing

API Reference

EURLexParser

RegulationChunk

SectionType

Roadmap

License

Citation

Support

Acknowledgments

Changelog

Version 0.1.0 (2026-01-17)

Version 0.1.1 (2026-01-18)

Version 0.1.2 (2026-01-18)

Version 0.1.3 (2026-01-19)

Version 0.1.4 (2026-01-19)

Version 0.1.5 (2026-01-19)

Version 0.1.6 (2026-01-19)

Version 0.1.7 (2026-01-19)

Version 0.1.8 (2026-01-19)

Version 0.1.9 (2026-01-20)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes