Skip to main content

A Python library for parsing Confluence Storage Format content into structured data

Project description

Confluence Content Parser

Important: This is an early-stage release. The API may change and using it in production carries risk. Pin versions and evaluate carefully before deployment.

PyPI version Python versions CI Coverage License Code Style

A powerful and comprehensive Python library for parsing Confluence Storage Format content into structured data models using Pydantic.

Features

Comprehensive Coverage: Supports 40+ Confluence Storage Format elements and macros
🚀 High Performance: Built with lxml for fast XML parsing
🏗️ Structured Data: Uses Pydantic models for type-safe, validated data structures
📝 Modern Python: Built for Python 3.12+ with full type hints
🔧 Extensible: Clean architecture makes it easy to add new element types

Installation

# Using uv (recommended)
uv add confluence-content-parser

# Using pip
pip install confluence-content-parser

Quick Start

from confluence_content_parser import ConfluenceParser

# Initialize the parser
parser = ConfluenceParser()

# Parse Confluence Storage Format content
content = """
<ac:layout>
    <ac:layout-section ac:type="fixed-width">
        <ac:layout-cell>
            <h2>My Document</h2>
            <p>This is a <strong>bold</strong> paragraph.</p>
            <ac:structured-macro ac:name="info">
                <ac:rich-text-body>
                    <p>This is an info panel.</p>
                </ac:rich-text-body>
            </ac:structured-macro>
        </ac:layout-cell>
    </ac:layout-section>
</ac:layout>
"""

# Parse the content
document = parser.parse(content)

# Access the structured data
print(f"Document text: {document.text}")

# Find all nodes of specific types
from confluence_content_parser import HeadingElement, PanelMacro

headings = document.find_all(HeadingElement)
panels = document.find_all(PanelMacro)

# Or find multiple types at once
headings, panels = document.find_all(HeadingElement, PanelMacro)

print(f"Found {len(headings)} headings and {len(panels)} panels")

# Navigate the structure
for node in document.walk():
    print(f"Node type: {type(node).__name__}")

Examples

  • examples/basic_usage.py: Basic parsing, text extraction, and element traversal
  • examples/advanced_usage.py: Complex layouts, macros, nested content analysis
  • examples/diagnostics_usage.py: Error handling, unknown elements, and parsing diagnostics

Supported Elements & Macros

Text Elements

Element Node Class Description
<p> TextBreakElement Paragraph with text and formatting
<h1>-<h6> HeadingElement Heading levels 1-6
<strong>, <em>, <u> TextEffectElement Bold, italic, underline
<sub>, <sup>, <del> TextEffectElement Subscript, superscript, strikethrough
<blockquote> TextEffectElement Block quotations
<span> TextEffectElement Inline text with styling
<code> TextEffectElement Inline code formatting
Text content Text Plain text nodes

Lists & Structure

Element Node Class Description
<ul>, <ol> ListElement Unordered and ordered lists
<li> ListItem List items (regular and tasks)
<ac:task-list> ListElement Task lists
<ac:task> ListItem Individual task items
<table> Table Tables with headers and data
<tr> TableRow Table rows
<td>, <th> TableCell Table cells
<hr> TextBreakElement Horizontal dividers
<br> TextBreakElement Line breaks

Layout Elements

Element Node Class Description
<ac:layout> LayoutElement Page layout container
<ac:layout-section> LayoutSection Layout section with columns
<ac:layout-cell> LayoutCell Individual layout cell

Media Elements

Element Node Class Description
<ac:image> Image Images with attachments or URLs

Interactive Elements

Element Node Class Description
<ac:link> LinkElement Links to pages, users, attachments
<a> LinkElement External links and mailto
<ac:emoticon> Emoticon Confluence emoticons and emojis
<ac:placeholder> PlaceholderElement Dynamic content placeholders
<time> Time Date and time elements
<ri:*> ResourceIdentifier Resource identifiers (pages, attachments, etc.)

Macros

Macro Node Class Description
info, warning, note, tip PanelMacro Notification panels
panel PanelMacro Custom styled panels
code CodeMacro Syntax-highlighted code blocks
status StatusMacro Status indicators
jira JiraMacro JIRA issue integration
expand ExpandMacro Expandable content sections
details DetailsMacro Collapsible content sections
toc TocMacro Auto-generated table of contents
view-file ViewFileMacro File preview macro
viewpdf ViewPdfMacro PDF viewer macro
excerpt ExcerptMacro Content excerpts
excerpt-include ExcerptIncludeMacro Include content excerpts
include IncludeMacro Include other pages
attachments AttachmentsMacro List page attachments
profile ProfileMacro User profile display
anchor AnchorMacro Page anchors
tasks-report-macro TasksReportMacro Task reports

Advanced Elements

Element Node Class Description
<ac:adf-extension> PanelMacro, DecisionList ADF panel and decision list extensions
Decision lists DecisionList Decision tracking lists
Decision items DecisionListItem Individual decision items
Fragment Fragment Container for multiple top-level nodes

Advanced Usage

Working with Structured Data

from confluence_content_parser import ConfluenceParser, Image, ListElement, ListType

parser = ConfluenceParser()
document = parser.parse(confluence_content)

# Find all images in the document
images = document.find_all(Image)
for image in images:
    print(f"Image: {image.alt or 'No alt text'} ({image.width}x{image.height})")

# Find all task lists
all_lists = document.find_all(ListElement)
task_lists = [lst for lst in all_lists if lst.type == ListType.TASK]
for task_list in task_lists:
    print(f"Task list with {len(task_list.children)} tasks")

# Analyze content structure efficiently
images, tables, links = document.find_all(Image, Table, LinkElement)
print(f"Document contains: {len(images)} images, {len(tables)} tables, {len(links)} links")

# Walk through all nodes in the document
for node in document.walk():
    if hasattr(node, 'text') and node.text:
        print(f"Text node: {node.text[:50]}...")

Custom Processing

from confluence_content_parser import ConfluenceParser, Text

parser = ConfluenceParser()
document = parser.parse(content)

# Extract all text content (built-in method)
full_text = document.text
print(f"Document text: {full_text}")

# Or manually collect text nodes
text_nodes = document.find_all(Text)
all_text = " ".join(node.text for node in text_nodes)
print(f"All text: {all_text}")

# Custom traversal
def find_nodes_with_condition(document, condition_func):
    """Find all nodes matching a custom condition."""
    matching_nodes = []
    for node in document.walk():
        if condition_func(node):
            matching_nodes.append(node)
    return matching_nodes

# Example: Find all nodes that contain specific text
nodes_with_api = find_nodes_with_condition(
    document, 
    lambda node: hasattr(node, 'text') and 'API' in getattr(node, 'text', '')
)

Error Handling

from confluence_content_parser import ConfluenceParser, ParsingError
import xml.etree.ElementTree as ET

# Default behavior: collect diagnostics without raising errors
parser = ConfluenceParser(raise_on_finish=False)

try:
    document = parser.parse(malformed_content)
    # Check diagnostics for any issues
    diagnostics = document.metadata.get("diagnostics", [])
    if diagnostics:
        print(f"Parsing issues found: {diagnostics}")
except ET.ParseError as e:
    print(f"XML parsing error: {e}")

# Strict parsing: raise errors for unknown elements
strict_parser = ConfluenceParser(raise_on_finish=True)
try:
    document = strict_parser.parse(content_with_unknown_elements)
except ParsingError as e:
    print(f"Parsing failed with diagnostics: {e.diagnostics}")

Diagnostics

The parser collects non-fatal parsing notes (e.g., unknown macros) in document.metadata["diagnostics"].

from confluence_content_parser import ConfluenceParser

parser = ConfluenceParser(raise_on_finish=False)
doc = parser.parse('<ac:structured-macro ac:name="unknown-macro"/>')
diagnostics = doc.metadata.get("diagnostics", [])
for diagnostic in diagnostics:
    print(diagnostic)  # Outputs: unknown_macro:unknown-macro
# See examples/diagnostics_usage.py for a complete example

Development

Setup

# Clone the repository
git clone https://github.com/Unificon/confluence-content-parser.git
cd confluence-content-parser

# Install dependencies with uv
uv sync --dev

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=confluence_content_parser --cov-report=html

Project Structure

src/confluence_content_parser/
├── __init__.py           # Main exports
├── parser.py            # Core parser implementation
├── document.py          # ConfluenceDocument model
└── nodes.py             # All node types and models

Running Tests

# Run all tests
uv run pytest

# Run with coverage report
uv run pytest --cov=confluence_content_parser --cov-report=term-missing

# Run specific test file
uv run pytest tests/test_parser.py

# Run with verbose output
uv run pytest -v

Code Quality

# Format code
uv run black src/ tests/

# Lint code  
uv run ruff check src/ tests/

# Type checking
uv run mypy src/

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes with tests
  4. Ensure all tests pass: uv run pytest
  5. Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

  • Built with lxml for robust XML parsing
  • Uses Pydantic for data validation and serialization
  • Uses types-lxml for lxml type annotations
  • Inspired by the Confluence Storage Format specification

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confluence_content_parser-0.2.1.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

confluence_content_parser-0.2.1-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file confluence_content_parser-0.2.1.tar.gz.

File metadata

File hashes

Hashes for confluence_content_parser-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b8f11dfc66e43ca0cd9c55a33456b5f3cbc4d891059a2904f74d9844fa7a6ccd
MD5 0f9efcf242ecc1a8b70d9831e39b3c76
BLAKE2b-256 f9610be7ba69168e38807a0da3810865e5b3887d17ad0a0f3434412969196bca

See more details on using hashes here.

File details

Details for the file confluence_content_parser-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for confluence_content_parser-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 950a50a0367444316a8d44cbbe0cc9c4f756908f12d9a7896b5c42a5920108c8
MD5 1ecc1e7badb410cf591f65e3249bb80f
BLAKE2b-256 7406a444cd86260ee9c756f5723bff31eacf86e358f3275fec85c4643c4b0f84

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page