A Python library for parsing Confluence Storage Format content into structured data

These details have not been verified by PyPI

Project description

Confluence Content Parser

Important: This is an early-stage release. The API may change and using it in production carries risk. Pin versions and evaluate carefully before deployment.

A powerful and comprehensive Python library for parsing Confluence Storage Format content into structured data models using Pydantic.

Features

✨ Comprehensive Coverage: Supports 40+ Confluence Storage Format elements and macros
🚀 High Performance: Built with lxml for fast XML parsing
🏗️ Structured Data: Uses Pydantic models for type-safe, validated data structures
📝 Modern Python: Built for Python 3.12+ with full type hints
🔧 Extensible: Clean architecture makes it easy to add new element types

Installation

# Using uv (recommended)
uv add confluence-content-parser

# Using pip
pip install confluence-content-parser

Quick Start

from confluence_content_parser import ConfluenceParser

# Initialize the parser
parser = ConfluenceParser()

# Parse Confluence Storage Format content
content = """
<ac:layout>
    <ac:layout-section ac:type="fixed-width">
        <ac:layout-cell>
            <h2>My Document</h2>
            <p>This is a <strong>bold</strong> paragraph.</p>
            <ac:structured-macro ac:name="info">
                <ac:rich-text-body>
                    <p>This is an info panel.</p>
                </ac:rich-text-body>
            </ac:structured-macro>
        </ac:layout-cell>
    </ac:layout-section>
</ac:layout>
"""

# Parse the content
document = parser.parse(content)

# Access the structured data
print(f"Document contains {len(document.content)} top-level elements")

# Navigate the structure
layout = document.content[0]
section = layout.children[0] 
cell_content = section.layout_section.cells[0].content

for element in cell_content:
    print(f"Element type: {element.type}")

Examples

examples/basic_usage.py: minimal parsing and traversal
examples/advanced_usage.py: ids, paths, kinds, scopes, canonical URIs, table cells, helpers
examples/diagnostics_usage.py: reading document.metadata["diagnostics"] and link normalization

Supported Elements & Macros

Text Elements

Element	Type	Description
`<p>`	paragraph	Paragraph with text and formatting
`<h1>`-`<h6>`	heading	Heading levels 1-6
`<strong>`, `<em>`, `<u>`	text formatting	Bold, italic, underline
`<sub>`, `<sup>`, `<del>`	text formatting	Subscript, superscript, strikethrough
`<blockquote>`	quote	Block quotations
`<span>`	text span	Inline text with styling

Lists & Structure

Element	Type	Description
`<ul>`, `<ol>`, `<li>`	lists	Unordered and ordered lists
`<table>`, `<tr>`, `<td>`, `<th>`	table	Tables with headers and data
`<hr>`	horizontal rule	Horizontal dividers
`<br>`	line break	Line breaks

Layout Elements

Element	Type	Description
`<ac:layout>`	layout	Page layout container
`<ac:layout-section>`	layout section	Layout section with columns
`<ac:layout-cell>`	layout cell	Individual layout cell

Media Elements

Element	Type	Description
`<ac:image>`	image	Images with attachments or URLs

Interactive Elements

Element	Type	Description
`<ac:link>`	link	Links to pages, users, attachments
`<ac:task>`	task	Individual task elements
`<ac:task-list>`	task list	Task list containers
`<ac:emoticon>`	emoticon	Confluence emoticons and emojis
`<ac:placeholder>`	placeholder	Dynamic content placeholders
`<ac:inline-comment-marker>`	comment	Inline comment markers
`<time>`	date	Date and time elements

Macros

Macro	Type	Description
`info`, `warning`, `note`, `tip`	notification	Notification panels
`panel`	panel	Custom styled panels
`code`	code block	Syntax-highlighted code blocks
`status`	status	Status indicators
`jira`	jira	JIRA issue integration
`expand`	expand	Expandable content sections
`toc`	table of contents	Auto-generated table of contents
`view-file`	file viewer	File preview macro
`page-properties`, `page-properties-report`	page properties	Metadata tables and reports
`excerpt`, `excerpt-include`	excerpt	Reusable content snippets
`children-display`	children	List child pages
`attachments`	attachments	List page attachments
`gadget`	gadget	JIRA gadgets and widgets

Advanced Elements

Element	Type	Description
`<ac:adf-extension>`	ADF extension	Atlassian Document Format extensions
`<ac:adf-node>`	ADF node	ADF node structures
`<at:i18n>`	internationalization	I18n elements

Advanced Usage

Working with Structured Data

from confluence_content_parser import ConfluenceParser
from confluence_content_parser.models import ContentElement

parser = ConfluenceParser()
document = parser.parse(confluence_content)

def find_elements_by_type(elements: list[ContentElement], element_type: str):
    """Recursively find all elements of a specific type."""
    found = []
    for element in elements:
        if element.type == element_type:
            found.append(element)
        if hasattr(element, 'children') and element.children:
            found.extend(find_elements_by_type(element.children, element_type))
    return found

# Find all images in the document
images = find_elements_by_type(document.content, "image")
for image in images:
    print(f"Image: {image.image.alt} ({image.image.width}x{image.image.height})")

# Find all task lists
task_lists = find_elements_by_type(document.content, "task_list_container")
for task_list in task_lists:
    print(f"Task list with {len(task_list.task_list_container.tasks)} tasks")

Custom Processing

from confluence_content_parser import ConfluenceParser

def extract_text_content(element):
    """Extract plain text from any element."""
    text_parts = []
    
    if element.text:
        text_parts.append(element.text)
    
    if hasattr(element, 'children') and element.children:
        for child in element.children:
            text_parts.append(extract_text_content(child))
    
    return ' '.join(filter(None, text_parts))

parser = ConfluenceParser()
document = parser.parse(content)

# Extract all text content
full_text = ' '.join(extract_text_content(elem) for elem in document.content)
print(f"Document text: {full_text}")

Error Handling

from confluence_content_parser import ConfluenceParser
from lxml.etree import XMLSyntaxError

parser = ConfluenceParser()

try:
    document = parser.parse(malformed_content)
except XMLSyntaxError as e:
    print(f"XML parsing error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Diagnostics

The parser collects non-fatal parsing notes (e.g., unknown macros) in document.metadata["diagnostics"].

from confluence_content_parser import ConfluenceParser

parser = ConfluenceParser()
doc = parser.parse('<ac:structured-macro ac:name="xyz"/>')
diagnostics = doc.metadata.get("diagnostics") or []
for d in diagnostics:
    print(d)
# See examples/diagnostics_usage.py for a complete example

Development

Setup

# Clone the repository
git clone https://github.com/your-repo/confluence-content-parser.git
cd confluence-content-parser

# Install dependencies with uv
uv sync --dev

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=confluence_content_parser --cov-report=html

Project Structure

src/confluence_content_parser/
├── __init__.py           # Main exports
├── parser.py            # Core parser implementation  
└── models/              # Pydantic data models
    ├── __init__.py      # Model exports
    ├── base.py          # Core ContentElement model
    ├── extensions.py    # Extension models (Panel, Task, etc.)
    ├── layout.py        # Layout models
    ├── links.py         # Link models  
    ├── macros.py        # Macro models
    ├── media.py         # Media models (Image)
    ├── metadata.py      # Metadata models
    ├── misc.py          # Miscellaneous models
    ├── tables.py        # Table models
    └── tasks.py         # Task models

Running Tests

# Run all tests
uv run pytest

# Run with coverage report
uv run pytest --cov=confluence_content_parser --cov-report=term-missing

# Run specific test file
uv run pytest tests/test_parser.py

# Run with verbose output
uv run pytest -v

Code Quality

# Format code
uv run black src/ tests/

# Lint code  
uv run ruff check src/ tests/

# Type checking
uv run mypy src/

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes with tests
Ensure all tests pass: uv run pytest
Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Built with lxml for robust XML parsing
Uses Pydantic for data validation and serialization
Inspired by the Confluence Storage Format specification

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Sep 23, 2025

0.2.0

Sep 16, 2025

This version

0.1.0

Sep 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confluence_content_parser-0.1.0.tar.gz (15.7 kB view details)

Uploaded Sep 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

confluence_content_parser-0.1.0-py3-none-any.whl (21.3 kB view details)

Uploaded Sep 9, 2025 Python 3

File details

Details for the file confluence_content_parser-0.1.0.tar.gz.

File metadata

Download URL: confluence_content_parser-0.1.0.tar.gz
Upload date: Sep 9, 2025
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for confluence_content_parser-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0a8eb684d2d2df0d4b975e5ac5e018593c0332d85624ec2f47f7b6c984a9e6f2`
MD5	`0c2134ccfe194eb6ce8d41aa045b562c`
BLAKE2b-256	`c794517dad7849efdbc7ec79b55dc6a2c6f295243ede9d86d033771faa824ed7`

See more details on using hashes here.

File details

Details for the file confluence_content_parser-0.1.0-py3-none-any.whl.

File metadata

Download URL: confluence_content_parser-0.1.0-py3-none-any.whl
Upload date: Sep 9, 2025
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for confluence_content_parser-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4396a42b55a7bd4bdbd77f265a02ae5050c8eb5db7b590d12c1cf47a9e5a4d14`
MD5	`7579a6ab6317df28f30e7457b0fb70e4`
BLAKE2b-256	`6246efb2d9e8ec6d9937b78bfd48c68983e2faf3d40052d2ead6552a6796e289`

See more details on using hashes here.

confluence-content-parser 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Confluence Content Parser

Features

Installation

Quick Start

Examples

Supported Elements & Macros

Text Elements

Lists & Structure

Layout Elements

Media Elements

Interactive Elements

Macros

Advanced Elements

Advanced Usage

Working with Structured Data

Custom Processing

Error Handling

Diagnostics

Development

Setup

Project Structure

Running Tests

Code Quality

Contributing

Development Workflow

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes