A Python library for parsing Confluence Storage Format content into structured data
Project description
Confluence Content Parser
Important: This is an early-stage release. The API may change and using it in production carries risk. Pin versions and evaluate carefully before deployment.
A powerful and comprehensive Python library for parsing Confluence Storage Format content into structured data models using Pydantic.
Features
✨ Comprehensive Coverage: Supports 40+ Confluence Storage Format elements and macros
🚀 High Performance: Built with lxml for fast XML parsing
🏗️ Structured Data: Uses Pydantic models for type-safe, validated data structures
📝 Modern Python: Built for Python 3.12+ with full type hints
🔧 Extensible: Clean architecture makes it easy to add new element types
Installation
# Using uv (recommended)
uv add confluence-content-parser
# Using pip
pip install confluence-content-parser
Quick Start
from confluence_content_parser import ConfluenceParser
# Initialize the parser
parser = ConfluenceParser()
# Parse Confluence Storage Format content
content = """
<ac:layout>
<ac:layout-section ac:type="fixed-width">
<ac:layout-cell>
<h2>My Document</h2>
<p>This is a <strong>bold</strong> paragraph.</p>
<ac:structured-macro ac:name="info">
<ac:rich-text-body>
<p>This is an info panel.</p>
</ac:rich-text-body>
</ac:structured-macro>
</ac:layout-cell>
</ac:layout-section>
</ac:layout>
"""
# Parse the content
document = parser.parse(content)
# Access the structured data
print(f"Document text: {document.text}")
# Find all nodes of specific types
from confluence_content_parser import HeadingElement, PanelMacro
headings = document.find_all(HeadingElement)
panels = document.find_all(PanelMacro)
print(f"Found {len(headings)} headings and {len(panels)} panels")
# Navigate the structure
for node in document.walk():
print(f"Node type: {type(node).__name__}")
Examples
examples/basic_usage.py: Basic parsing, text extraction, and element traversalexamples/advanced_usage.py: Complex layouts, macros, nested content analysisexamples/diagnostics_usage.py: Error handling, unknown elements, and parsing diagnostics
Supported Elements & Macros
Text Elements
| Element | Node Class | Description |
|---|---|---|
<p> |
TextBreakElement |
Paragraph with text and formatting |
<h1>-<h6> |
HeadingElement |
Heading levels 1-6 |
<strong>, <em>, <u> |
TextEffectElement |
Bold, italic, underline |
<sub>, <sup>, <del> |
TextEffectElement |
Subscript, superscript, strikethrough |
<blockquote> |
TextEffectElement |
Block quotations |
<span> |
TextEffectElement |
Inline text with styling |
<code> |
TextEffectElement |
Inline code formatting |
| Text content | Text |
Plain text nodes |
Lists & Structure
| Element | Node Class | Description |
|---|---|---|
<ul>, <ol> |
ListElement |
Unordered and ordered lists |
<li> |
ListItem |
List items (regular and tasks) |
<ac:task-list> |
ListElement |
Task lists |
<ac:task> |
ListItem |
Individual task items |
<table> |
Table |
Tables with headers and data |
<tr> |
TableRow |
Table rows |
<td>, <th> |
TableCell |
Table cells |
<hr> |
TextBreakElement |
Horizontal dividers |
<br> |
TextBreakElement |
Line breaks |
Layout Elements
| Element | Node Class | Description |
|---|---|---|
<ac:layout> |
LayoutElement |
Page layout container |
<ac:layout-section> |
LayoutSection |
Layout section with columns |
<ac:layout-cell> |
LayoutCell |
Individual layout cell |
Media Elements
| Element | Node Class | Description |
|---|---|---|
<ac:image> |
Image |
Images with attachments or URLs |
Interactive Elements
| Element | Node Class | Description |
|---|---|---|
<ac:link> |
LinkElement |
Links to pages, users, attachments |
<a> |
LinkElement |
External links and mailto |
<ac:emoticon> |
Emoticon |
Confluence emoticons and emojis |
<ac:placeholder> |
PlaceholderElement |
Dynamic content placeholders |
<time> |
Time |
Date and time elements |
<ri:*> |
ResourceIdentifier |
Resource identifiers (pages, attachments, etc.) |
Macros
| Macro | Node Class | Description |
|---|---|---|
info, warning, note, tip |
PanelMacro |
Notification panels |
panel |
PanelMacro |
Custom styled panels |
code |
CodeMacro |
Syntax-highlighted code blocks |
status |
StatusMacro |
Status indicators |
jira |
JiraMacro |
JIRA issue integration |
expand |
ExpandMacro |
Expandable content sections |
details |
DetailsMacro |
Collapsible content sections |
toc |
TocMacro |
Auto-generated table of contents |
view-file |
ViewFileMacro |
File preview macro |
viewpdf |
ViewPdfMacro |
PDF viewer macro |
excerpt |
ExcerptMacro |
Content excerpts |
excerpt-include |
ExcerptIncludeMacro |
Include content excerpts |
include |
IncludeMacro |
Include other pages |
attachments |
AttachmentsMacro |
List page attachments |
profile |
ProfileMacro |
User profile display |
anchor |
AnchorMacro |
Page anchors |
tasks-report-macro |
TasksReportMacro |
Task reports |
Advanced Elements
| Element | Node Class | Description |
|---|---|---|
<ac:adf-extension> |
PanelMacro, DecisionList |
ADF panel and decision list extensions |
| Decision lists | DecisionList |
Decision tracking lists |
| Decision items | DecisionListItem |
Individual decision items |
| Fragment | Fragment |
Container for multiple top-level nodes |
Advanced Usage
Working with Structured Data
from confluence_content_parser import ConfluenceParser, Image, ListElement, ListType
parser = ConfluenceParser()
document = parser.parse(confluence_content)
# Find all images in the document
images = document.find_all(Image)
for image in images:
print(f"Image: {image.alt or 'No alt text'} ({image.width}x{image.height})")
# Find all task lists
all_lists = document.find_all(ListElement)
task_lists = [lst for lst in all_lists if lst.type == ListType.TASK]
for task_list in task_lists:
print(f"Task list with {len(task_list.children)} tasks")
# Walk through all nodes in the document
for node in document.walk():
if hasattr(node, 'text') and node.text:
print(f"Text node: {node.text[:50]}...")
Custom Processing
from confluence_content_parser import ConfluenceParser, Text
parser = ConfluenceParser()
document = parser.parse(content)
# Extract all text content (built-in method)
full_text = document.text
print(f"Document text: {full_text}")
# Or manually collect text nodes
text_nodes = document.find_all(Text)
all_text = " ".join(node.text for node in text_nodes)
print(f"All text: {all_text}")
# Custom traversal
def find_nodes_with_condition(document, condition_func):
"""Find all nodes matching a custom condition."""
matching_nodes = []
for node in document.walk():
if condition_func(node):
matching_nodes.append(node)
return matching_nodes
# Example: Find all nodes that contain specific text
nodes_with_api = find_nodes_with_condition(
document,
lambda node: hasattr(node, 'text') and 'API' in getattr(node, 'text', '')
)
Error Handling
from confluence_content_parser import ConfluenceParser, ParsingError
import xml.etree.ElementTree as ET
# Default behavior: collect diagnostics without raising errors
parser = ConfluenceParser(raise_on_finish=False)
try:
document = parser.parse(malformed_content)
# Check diagnostics for any issues
diagnostics = document.metadata.get("diagnostics", [])
if diagnostics:
print(f"Parsing issues found: {diagnostics}")
except ET.ParseError as e:
print(f"XML parsing error: {e}")
# Strict parsing: raise errors for unknown elements
strict_parser = ConfluenceParser(raise_on_finish=True)
try:
document = strict_parser.parse(content_with_unknown_elements)
except ParsingError as e:
print(f"Parsing failed with diagnostics: {e.diagnostics}")
Diagnostics
The parser collects non-fatal parsing notes (e.g., unknown macros) in document.metadata["diagnostics"].
from confluence_content_parser import ConfluenceParser
parser = ConfluenceParser(raise_on_finish=False)
doc = parser.parse('<ac:structured-macro ac:name="unknown-macro"/>')
diagnostics = doc.metadata.get("diagnostics", [])
for diagnostic in diagnostics:
print(diagnostic) # Outputs: unknown_macro:unknown-macro
# See examples/diagnostics_usage.py for a complete example
Development
Setup
# Clone the repository
git clone https://github.com/Unificon/confluence-content-parser.git
cd confluence-content-parser
# Install dependencies with uv
uv sync --dev
# Run tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=confluence_content_parser --cov-report=html
Project Structure
src/confluence_content_parser/
├── __init__.py # Main exports
├── parser.py # Core parser implementation
├── document.py # ConfluenceDocument model
└── nodes.py # All node types and models
Running Tests
# Run all tests
uv run pytest
# Run with coverage report
uv run pytest --cov=confluence_content_parser --cov-report=term-missing
# Run specific test file
uv run pytest tests/test_parser.py
# Run with verbose output
uv run pytest -v
Code Quality
# Format code
uv run black src/ tests/
# Lint code
uv run ruff check src/ tests/
# Type checking
uv run mypy src/
Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Workflow
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes with tests
- Ensure all tests pass:
uv run pytest - Submit a pull request
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Acknowledgments
- Built with lxml for robust XML parsing
- Uses Pydantic for data validation and serialization
- Uses types-lxml for
lxmltype annotations - Inspired by the Confluence Storage Format specification
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file confluence_content_parser-0.2.0.tar.gz.
File metadata
- Download URL: confluence_content_parser-0.2.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2eb9756d5b6b7ca732c447403d52e12ef53305bc8784592eaf35a98e803ed41d
|
|
| MD5 |
87b686709f0f334b040be205b631e56b
|
|
| BLAKE2b-256 |
8dc47daed87da8a9c8ee681c2cc86688d39e01b7abd719992d17f86f07c93c8c
|
File details
Details for the file confluence_content_parser-0.2.0-py3-none-any.whl.
File metadata
- Download URL: confluence_content_parser-0.2.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf62fb5b40fcdbe95975325cb6b9e971758c771828c114ca722fe8661428a62d
|
|
| MD5 |
389c7f697f2de3fcd6eb4aeb22dde922
|
|
| BLAKE2b-256 |
79263b75f01acbbb0987c58e01959aae3a61e81b73a820d565c194ee477cef57
|