A Python library for parsing Confluence Storage Format content into structured data
Project description
Confluence Content Parser
Important: This is an early-stage release. The API may change and using it in production carries risk. Pin versions and evaluate carefully before deployment.
A powerful and comprehensive Python library for parsing Confluence Storage Format content into structured data models using Pydantic.
Features
✨ Comprehensive Coverage: Supports 40+ Confluence Storage Format elements and macros
🚀 High Performance: Built with lxml for fast XML parsing
🏗️ Structured Data: Uses Pydantic models for type-safe, validated data structures
📝 Modern Python: Built for Python 3.12+ with full type hints
🔧 Extensible: Clean architecture makes it easy to add new element types
Installation
# Using uv (recommended)
uv add confluence-content-parser
# Using pip
pip install confluence-content-parser
Quick Start
from confluence_content_parser import ConfluenceParser
# Initialize the parser
parser = ConfluenceParser()
# Parse Confluence Storage Format content
content = """
<ac:layout>
<ac:layout-section ac:type="fixed-width">
<ac:layout-cell>
<h2>My Document</h2>
<p>This is a <strong>bold</strong> paragraph.</p>
<ac:structured-macro ac:name="info">
<ac:rich-text-body>
<p>This is an info panel.</p>
</ac:rich-text-body>
</ac:structured-macro>
</ac:layout-cell>
</ac:layout-section>
</ac:layout>
"""
# Parse the content
document = parser.parse(content)
# Access the structured data
print(f"Document contains {len(document.content)} top-level elements")
# Navigate the structure
layout = document.content[0]
section = layout.children[0]
cell_content = section.layout_section.cells[0].content
for element in cell_content:
print(f"Element type: {element.type}")
Examples
examples/basic_usage.py: minimal parsing and traversalexamples/advanced_usage.py: ids, paths, kinds, scopes, canonical URIs, table cells, helpersexamples/diagnostics_usage.py: readingdocument.metadata["diagnostics"]and link normalization
Supported Elements & Macros
Text Elements
| Element | Type | Description |
|---|---|---|
<p> |
paragraph | Paragraph with text and formatting |
<h1>-<h6> |
heading | Heading levels 1-6 |
<strong>, <em>, <u> |
text formatting | Bold, italic, underline |
<sub>, <sup>, <del> |
text formatting | Subscript, superscript, strikethrough |
<blockquote> |
quote | Block quotations |
<span> |
text span | Inline text with styling |
Lists & Structure
| Element | Type | Description |
|---|---|---|
<ul>, <ol>, <li> |
lists | Unordered and ordered lists |
<table>, <tr>, <td>, <th> |
table | Tables with headers and data |
<hr> |
horizontal rule | Horizontal dividers |
<br> |
line break | Line breaks |
Layout Elements
| Element | Type | Description |
|---|---|---|
<ac:layout> |
layout | Page layout container |
<ac:layout-section> |
layout section | Layout section with columns |
<ac:layout-cell> |
layout cell | Individual layout cell |
Media Elements
| Element | Type | Description |
|---|---|---|
<ac:image> |
image | Images with attachments or URLs |
Interactive Elements
| Element | Type | Description |
|---|---|---|
<ac:link> |
link | Links to pages, users, attachments |
<ac:task> |
task | Individual task elements |
<ac:task-list> |
task list | Task list containers |
<ac:emoticon> |
emoticon | Confluence emoticons and emojis |
<ac:placeholder> |
placeholder | Dynamic content placeholders |
<ac:inline-comment-marker> |
comment | Inline comment markers |
<time> |
date | Date and time elements |
Macros
| Macro | Type | Description |
|---|---|---|
info, warning, note, tip |
notification | Notification panels |
panel |
panel | Custom styled panels |
code |
code block | Syntax-highlighted code blocks |
status |
status | Status indicators |
jira |
jira | JIRA issue integration |
expand |
expand | Expandable content sections |
toc |
table of contents | Auto-generated table of contents |
view-file |
file viewer | File preview macro |
page-properties, page-properties-report |
page properties | Metadata tables and reports |
excerpt, excerpt-include |
excerpt | Reusable content snippets |
children-display |
children | List child pages |
attachments |
attachments | List page attachments |
gadget |
gadget | JIRA gadgets and widgets |
Advanced Elements
| Element | Type | Description |
|---|---|---|
<ac:adf-extension> |
ADF extension | Atlassian Document Format extensions |
<ac:adf-node> |
ADF node | ADF node structures |
<at:i18n> |
internationalization | I18n elements |
Advanced Usage
Working with Structured Data
from confluence_content_parser import ConfluenceParser
from confluence_content_parser.models import ContentElement
parser = ConfluenceParser()
document = parser.parse(confluence_content)
def find_elements_by_type(elements: list[ContentElement], element_type: str):
"""Recursively find all elements of a specific type."""
found = []
for element in elements:
if element.type == element_type:
found.append(element)
if hasattr(element, 'children') and element.children:
found.extend(find_elements_by_type(element.children, element_type))
return found
# Find all images in the document
images = find_elements_by_type(document.content, "image")
for image in images:
print(f"Image: {image.image.alt} ({image.image.width}x{image.image.height})")
# Find all task lists
task_lists = find_elements_by_type(document.content, "task_list_container")
for task_list in task_lists:
print(f"Task list with {len(task_list.task_list_container.tasks)} tasks")
Custom Processing
from confluence_content_parser import ConfluenceParser
def extract_text_content(element):
"""Extract plain text from any element."""
text_parts = []
if element.text:
text_parts.append(element.text)
if hasattr(element, 'children') and element.children:
for child in element.children:
text_parts.append(extract_text_content(child))
return ' '.join(filter(None, text_parts))
parser = ConfluenceParser()
document = parser.parse(content)
# Extract all text content
full_text = ' '.join(extract_text_content(elem) for elem in document.content)
print(f"Document text: {full_text}")
Error Handling
from confluence_content_parser import ConfluenceParser
from lxml.etree import XMLSyntaxError
parser = ConfluenceParser()
try:
document = parser.parse(malformed_content)
except XMLSyntaxError as e:
print(f"XML parsing error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Diagnostics
The parser collects non-fatal parsing notes (e.g., unknown macros) in document.metadata["diagnostics"].
from confluence_content_parser import ConfluenceParser
parser = ConfluenceParser()
doc = parser.parse('<ac:structured-macro ac:name="xyz"/>')
diagnostics = doc.metadata.get("diagnostics") or []
for d in diagnostics:
print(d)
# See examples/diagnostics_usage.py for a complete example
Development
Setup
# Clone the repository
git clone https://github.com/your-repo/confluence-content-parser.git
cd confluence-content-parser
# Install dependencies with uv
uv sync --dev
# Run tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=confluence_content_parser --cov-report=html
Project Structure
src/confluence_content_parser/
├── __init__.py # Main exports
├── parser.py # Core parser implementation
└── models/ # Pydantic data models
├── __init__.py # Model exports
├── base.py # Core ContentElement model
├── extensions.py # Extension models (Panel, Task, etc.)
├── layout.py # Layout models
├── links.py # Link models
├── macros.py # Macro models
├── media.py # Media models (Image)
├── metadata.py # Metadata models
├── misc.py # Miscellaneous models
├── tables.py # Table models
└── tasks.py # Task models
Running Tests
# Run all tests
uv run pytest
# Run with coverage report
uv run pytest --cov=confluence_content_parser --cov-report=term-missing
# Run specific test file
uv run pytest tests/test_parser.py
# Run with verbose output
uv run pytest -v
Code Quality
# Format code
uv run black src/ tests/
# Lint code
uv run ruff check src/ tests/
# Type checking
uv run mypy src/
Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Workflow
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes with tests
- Ensure all tests pass:
uv run pytest - Submit a pull request
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file confluence_content_parser-0.1.0.tar.gz.
File metadata
- Download URL: confluence_content_parser-0.1.0.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a8eb684d2d2df0d4b975e5ac5e018593c0332d85624ec2f47f7b6c984a9e6f2
|
|
| MD5 |
0c2134ccfe194eb6ce8d41aa045b562c
|
|
| BLAKE2b-256 |
c794517dad7849efdbc7ec79b55dc6a2c6f295243ede9d86d033771faa824ed7
|
File details
Details for the file confluence_content_parser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: confluence_content_parser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4396a42b55a7bd4bdbd77f265a02ae5050c8eb5db7b590d12c1cf47a9e5a4d14
|
|
| MD5 |
7579a6ab6317df28f30e7457b0fb70e4
|
|
| BLAKE2b-256 |
6246efb2d9e8ec6d9937b78bfd48c68983e2faf3d40052d2ead6552a6796e289
|