Skip to main content

A Python library for parsing Confluence Storage Format content into structured data

Project description

Confluence Content Parser

Important: This is an early-stage release. The API may change and using it in production carries risk. Pin versions and evaluate carefully before deployment.

PyPI version Python versions CI Coverage License Code Style

A powerful and comprehensive Python library for parsing Confluence Storage Format content into structured data models using Pydantic.

Features

Comprehensive Coverage: Supports 40+ Confluence Storage Format elements and macros
🚀 High Performance: Built with lxml for fast XML parsing
🏗️ Structured Data: Uses Pydantic models for type-safe, validated data structures
📝 Modern Python: Built for Python 3.12+ with full type hints
🔧 Extensible: Clean architecture makes it easy to add new element types

Installation

# Using uv (recommended)
uv add confluence-content-parser

# Using pip
pip install confluence-content-parser

Quick Start

from confluence_content_parser import ConfluenceParser

# Initialize the parser
parser = ConfluenceParser()

# Parse Confluence Storage Format content
content = """
<ac:layout>
    <ac:layout-section ac:type="fixed-width">
        <ac:layout-cell>
            <h2>My Document</h2>
            <p>This is a <strong>bold</strong> paragraph.</p>
            <ac:structured-macro ac:name="info">
                <ac:rich-text-body>
                    <p>This is an info panel.</p>
                </ac:rich-text-body>
            </ac:structured-macro>
        </ac:layout-cell>
    </ac:layout-section>
</ac:layout>
"""

# Parse the content
document = parser.parse(content)

# Access the structured data
print(f"Document text: {document.text}")

# Find all nodes of specific types
from confluence_content_parser import HeadingElement, PanelMacro

headings = document.find_all(HeadingElement)
panels = document.find_all(PanelMacro)

print(f"Found {len(headings)} headings and {len(panels)} panels")

# Navigate the structure
for node in document.walk():
    print(f"Node type: {type(node).__name__}")

Examples

  • examples/basic_usage.py: Basic parsing, text extraction, and element traversal
  • examples/advanced_usage.py: Complex layouts, macros, nested content analysis
  • examples/diagnostics_usage.py: Error handling, unknown elements, and parsing diagnostics

Supported Elements & Macros

Text Elements

Element Node Class Description
<p> TextBreakElement Paragraph with text and formatting
<h1>-<h6> HeadingElement Heading levels 1-6
<strong>, <em>, <u> TextEffectElement Bold, italic, underline
<sub>, <sup>, <del> TextEffectElement Subscript, superscript, strikethrough
<blockquote> TextEffectElement Block quotations
<span> TextEffectElement Inline text with styling
<code> TextEffectElement Inline code formatting
Text content Text Plain text nodes

Lists & Structure

Element Node Class Description
<ul>, <ol> ListElement Unordered and ordered lists
<li> ListItem List items (regular and tasks)
<ac:task-list> ListElement Task lists
<ac:task> ListItem Individual task items
<table> Table Tables with headers and data
<tr> TableRow Table rows
<td>, <th> TableCell Table cells
<hr> TextBreakElement Horizontal dividers
<br> TextBreakElement Line breaks

Layout Elements

Element Node Class Description
<ac:layout> LayoutElement Page layout container
<ac:layout-section> LayoutSection Layout section with columns
<ac:layout-cell> LayoutCell Individual layout cell

Media Elements

Element Node Class Description
<ac:image> Image Images with attachments or URLs

Interactive Elements

Element Node Class Description
<ac:link> LinkElement Links to pages, users, attachments
<a> LinkElement External links and mailto
<ac:emoticon> Emoticon Confluence emoticons and emojis
<ac:placeholder> PlaceholderElement Dynamic content placeholders
<time> Time Date and time elements
<ri:*> ResourceIdentifier Resource identifiers (pages, attachments, etc.)

Macros

Macro Node Class Description
info, warning, note, tip PanelMacro Notification panels
panel PanelMacro Custom styled panels
code CodeMacro Syntax-highlighted code blocks
status StatusMacro Status indicators
jira JiraMacro JIRA issue integration
expand ExpandMacro Expandable content sections
details DetailsMacro Collapsible content sections
toc TocMacro Auto-generated table of contents
view-file ViewFileMacro File preview macro
viewpdf ViewPdfMacro PDF viewer macro
excerpt ExcerptMacro Content excerpts
excerpt-include ExcerptIncludeMacro Include content excerpts
include IncludeMacro Include other pages
attachments AttachmentsMacro List page attachments
profile ProfileMacro User profile display
anchor AnchorMacro Page anchors
tasks-report-macro TasksReportMacro Task reports

Advanced Elements

Element Node Class Description
<ac:adf-extension> PanelMacro, DecisionList ADF panel and decision list extensions
Decision lists DecisionList Decision tracking lists
Decision items DecisionListItem Individual decision items
Fragment Fragment Container for multiple top-level nodes

Advanced Usage

Working with Structured Data

from confluence_content_parser import ConfluenceParser, Image, ListElement, ListType

parser = ConfluenceParser()
document = parser.parse(confluence_content)

# Find all images in the document
images = document.find_all(Image)
for image in images:
    print(f"Image: {image.alt or 'No alt text'} ({image.width}x{image.height})")

# Find all task lists
all_lists = document.find_all(ListElement)
task_lists = [lst for lst in all_lists if lst.type == ListType.TASK]
for task_list in task_lists:
    print(f"Task list with {len(task_list.children)} tasks")

# Walk through all nodes in the document
for node in document.walk():
    if hasattr(node, 'text') and node.text:
        print(f"Text node: {node.text[:50]}...")

Custom Processing

from confluence_content_parser import ConfluenceParser, Text

parser = ConfluenceParser()
document = parser.parse(content)

# Extract all text content (built-in method)
full_text = document.text
print(f"Document text: {full_text}")

# Or manually collect text nodes
text_nodes = document.find_all(Text)
all_text = " ".join(node.text for node in text_nodes)
print(f"All text: {all_text}")

# Custom traversal
def find_nodes_with_condition(document, condition_func):
    """Find all nodes matching a custom condition."""
    matching_nodes = []
    for node in document.walk():
        if condition_func(node):
            matching_nodes.append(node)
    return matching_nodes

# Example: Find all nodes that contain specific text
nodes_with_api = find_nodes_with_condition(
    document, 
    lambda node: hasattr(node, 'text') and 'API' in getattr(node, 'text', '')
)

Error Handling

from confluence_content_parser import ConfluenceParser, ParsingError
import xml.etree.ElementTree as ET

# Default behavior: collect diagnostics without raising errors
parser = ConfluenceParser(raise_on_finish=False)

try:
    document = parser.parse(malformed_content)
    # Check diagnostics for any issues
    diagnostics = document.metadata.get("diagnostics", [])
    if diagnostics:
        print(f"Parsing issues found: {diagnostics}")
except ET.ParseError as e:
    print(f"XML parsing error: {e}")

# Strict parsing: raise errors for unknown elements
strict_parser = ConfluenceParser(raise_on_finish=True)
try:
    document = strict_parser.parse(content_with_unknown_elements)
except ParsingError as e:
    print(f"Parsing failed with diagnostics: {e.diagnostics}")

Diagnostics

The parser collects non-fatal parsing notes (e.g., unknown macros) in document.metadata["diagnostics"].

from confluence_content_parser import ConfluenceParser

parser = ConfluenceParser(raise_on_finish=False)
doc = parser.parse('<ac:structured-macro ac:name="unknown-macro"/>')
diagnostics = doc.metadata.get("diagnostics", [])
for diagnostic in diagnostics:
    print(diagnostic)  # Outputs: unknown_macro:unknown-macro
# See examples/diagnostics_usage.py for a complete example

Development

Setup

# Clone the repository
git clone https://github.com/Unificon/confluence-content-parser.git
cd confluence-content-parser

# Install dependencies with uv
uv sync --dev

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=confluence_content_parser --cov-report=html

Project Structure

src/confluence_content_parser/
├── __init__.py           # Main exports
├── parser.py            # Core parser implementation
├── document.py          # ConfluenceDocument model
└── nodes.py             # All node types and models

Running Tests

# Run all tests
uv run pytest

# Run with coverage report
uv run pytest --cov=confluence_content_parser --cov-report=term-missing

# Run specific test file
uv run pytest tests/test_parser.py

# Run with verbose output
uv run pytest -v

Code Quality

# Format code
uv run black src/ tests/

# Lint code  
uv run ruff check src/ tests/

# Type checking
uv run mypy src/

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes with tests
  4. Ensure all tests pass: uv run pytest
  5. Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

  • Built with lxml for robust XML parsing
  • Uses Pydantic for data validation and serialization
  • Uses types-lxml for lxml type annotations
  • Inspired by the Confluence Storage Format specification

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confluence_content_parser-0.2.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

confluence_content_parser-0.2.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file confluence_content_parser-0.2.0.tar.gz.

File metadata

File hashes

Hashes for confluence_content_parser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2eb9756d5b6b7ca732c447403d52e12ef53305bc8784592eaf35a98e803ed41d
MD5 87b686709f0f334b040be205b631e56b
BLAKE2b-256 8dc47daed87da8a9c8ee681c2cc86688d39e01b7abd719992d17f86f07c93c8c

See more details on using hashes here.

File details

Details for the file confluence_content_parser-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for confluence_content_parser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf62fb5b40fcdbe95975325cb6b9e971758c771828c114ca722fe8661428a62d
MD5 389c7f697f2de3fcd6eb4aeb22dde922
BLAKE2b-256 79263b75f01acbbb0987c58e01959aae3a61e81b73a820d565c194ee477cef57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page