Skip to main content

Convert OOXML SmartArt diagrams to Markdown

Project description

smartart2md

Convert OOXML SmartArt diagrams to Markdown lists. Supports .pptx, .xlsx, and .docx files with no external dependencies.

Installation

pip install smartart2md

Quick Start

from smartart2md import convert_smartart, load_smartart_parts

for root, ctx in load_smartart_parts("presentation.pptx"):
    md, images = convert_smartart(root, ctx)
    print(md)

Output:

- Root item
  - Child item
  - Child item
- Root item
  - Child item

CLI

smartart2md input.pptx                  # print all SmartArt to stdout
smartart2md input.pptx -o output.md     # save to file
smartart2md diagram.xml                 # parse a dataModel XML directly

When saving to a file with -o output.md, images embedded in SmartArt nodes are automatically saved to an output_assets/ directory next to the output file, and placeholders in the Markdown are replaced with image links.

API

load_smartart_parts(path)

Scans an OOXML file and returns a list of (root, ctx) pairs, one per SmartArt diagram. root is an ET.Element (the dgm:dataModel XML root) and ctx is a ZipContext that the converter uses to access embedded images.

For .pptx files, slide order is preserved. For .xlsx and .docx, diagrams are returned in filename sort order.

from smartart2md import load_smartart_parts, convert_smartart

for root, ctx in load_smartart_parts("presentation.pptx"):
    md, images = convert_smartart(root, ctx)
    print(md)

convert_smartart(root, ctx)

Converts a SmartArt dgm:dataModel XML root element to a Markdown list.

Parameter Type Description
root ET.Element dgm:dataModel root returned by load_smartart_parts or resolved from a slide
ctx ZipContext | None Context object for the archive. Pass None to skip image extraction

Returns a (markdown_str, images) tuple:

  • markdown_str — indented bullet list reflecting the diagram hierarchy
  • images — list of (bytes, ext) tuples for images embedded in diagram nodes. Their positions in the Markdown string are marked with @@IMG:0@@, @@IMG:1@@, etc.

ZipContext

OOXML files (.pptx, .xlsx, .docx) are ZIP archives that contain many XML files inside. ZipContext pairs an open zipfile.ZipFile with the path of a specific XML file within the archive, so the converter can extract images embedded in SmartArt nodes.

When you use load_smartart_parts(), ZipContext objects are created and returned automatically. You only need to construct one manually when building a custom pipeline (see below).

import zipfile
from smartart2md import ZipContext

zf = zipfile.ZipFile("presentation.pptx")
ctx = ZipContext(zf, "ppt/diagrams/data1.xml")

Advanced: Full Pipeline Integration

load_smartart_parts() is convenient but returns diagrams without slide context. When you need to convert an entire PPTX in slide order, iterate the slides manually:

import posixpath
import zipfile
import xml.etree.ElementTree as ET
from smartart2md import convert_smartart, ZipContext

PML_NS = "http://schemas.openxmlformats.org/presentationml/2006/main"
DML_NS = "http://schemas.openxmlformats.org/drawingml/2006/main"
REL_NS = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"


def _read_rels(zf, xml_path):
    """Read the .rels file for a given XML part and return {rId: resolved_path}."""
    directory = posixpath.dirname(xml_path)
    filename = posixpath.basename(xml_path)
    rels_path = posixpath.join(directory, "_rels", filename + ".rels")
    result = {}
    try:
        for rel in ET.fromstring(zf.read(rels_path)):
            tag = rel.tag.split("}")[-1] if "}" in rel.tag else rel.tag
            if tag != "Relationship":
                continue
            rid = rel.get("Id", "")
            target = rel.get("Target", "")
            if rel.get("TargetMode") == "External" or not rid:
                continue
            if target.startswith("/"):
                resolved = target.lstrip("/")
            else:
                resolved = posixpath.normpath(
                    posixpath.join(directory, target)
                ).lstrip("/")
            result[rid] = resolved
    except KeyError:
        pass
    return result


with zipfile.ZipFile("presentation.pptx") as zf:
    # 1. Read slide order from presentation.xml
    prs = ET.fromstring(zf.read("ppt/presentation.xml"))
    prs_rels = _read_rels(zf, "ppt/presentation.xml")

    for sld_id_el in prs.findall(f".//{{{PML_NS}}}sldIdLst/{{{PML_NS}}}sldId"):
        rid = sld_id_el.get(f"{{{REL_NS}}}id")
        slide_path = prs_rels.get(rid or "")
        if not slide_path:
            continue

        slide = ET.fromstring(zf.read(slide_path))
        slide_rels = _read_rels(zf, slide_path)

        # 2. Find graphicFrame shapes that contain SmartArt
        for gf in slide.iter():
            if gf.tag.split("}")[-1] != "graphicFrame":
                continue

            graphic = gf.find(f".//{{{DML_NS}}}graphic")
            if graphic is None:
                continue
            graphic_data = graphic.find(f"{{{DML_NS}}}graphicData")
            if graphic_data is None:
                continue

            # 3. SmartArt is identified by "diagram" or "smartArt" in the uri
            uri = graphic_data.get("uri", "")
            if "diagram" not in uri and "smartArt" not in uri.lower():
                continue

            # 4. Find dgm:relIds element and extract the r:dm attribute
            #    r:dm points to the dataModel file that contains the diagram content
            dm_rid = None
            for child in graphic_data.iter():
                if child.tag.split("}")[-1] == "relIds":
                    for attr, val in child.attrib.items():
                        if attr.endswith("}dm"):
                            dm_rid = val
                            break
                    if dm_rid:
                        break
            if not dm_rid:
                continue

            # 5. Resolve the dataModel file path and convert
            data_path = slide_rels.get(dm_rid)
            if not data_path:
                continue

            data_root = ET.fromstring(zf.read(data_path))
            ctx = ZipContext(zf, data_path)
            md, images = convert_smartart(data_root, ctx)
            print(md)

Supported Input Formats

  • .pptx, .xlsx, .docx — automatically scans for SmartArt data XML inside the archive
  • .xml — parsed directly as a dgm:dataModel root

License

Apache 2.0 — Copyright 2026 INSEONG LEE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartart2md-0.1.1.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartart2md-0.1.1-py3-none-any.whl (29.5 kB view details)

Uploaded Python 3

File details

Details for the file smartart2md-0.1.1.tar.gz.

File metadata

  • Download URL: smartart2md-0.1.1.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for smartart2md-0.1.1.tar.gz
Algorithm Hash digest
SHA256 df0b8f13fd27c87bcabd59283eff66283182a7fae02d9ef182e26818aeb694d9
MD5 1ad58931526664bda9cdbd909da993b6
BLAKE2b-256 a1fd4820738271f537c5d5051f348718273896a38243a74182e7f7424a573749

See more details on using hashes here.

File details

Details for the file smartart2md-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: smartart2md-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 29.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for smartart2md-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 11abdf9a50dadf941287abfcd09ff21d10874b7bdd42231c642ed400241f8bf9
MD5 dedf821bce93d3790887e044c18b03a8
BLAKE2b-256 adfad3e29f41b6594eb501b3a86b35cf0ad16805bf312878749e7625d0c99267

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page