Skip to main content

Convert OOXML SmartArt diagrams to Markdown

Project description

smartart2md

Convert OOXML SmartArt diagrams to Markdown lists. Supports .pptx, .xlsx, and .docx files with no external dependencies.

Installation

pip install smartart2md

Quick Start

from smartart2md import convert_smartart, load_smartart_parts

for root, ctx in load_smartart_parts("presentation.pptx"):
    md, images = convert_smartart(root, ctx)
    print(md)

Output:

- Root item
  - Child item
  - Child item
- Root item
  - Child item

CLI

smartart2md input.pptx                  # print all SmartArt to stdout
smartart2md input.pptx -o output.md     # save to file
smartart2md diagram.xml                 # parse a dataModel XML directly

API

load_smartart_parts(path)

Scans an OOXML file and returns a list of (root, ctx) pairs, one per SmartArt diagram. root is an ET.Element (the dgm:dataModel XML root) and ctx is a ZipContext that the converter uses to access embedded images.

For .pptx files, slide order is preserved. For .xlsx and .docx, diagrams are returned in filename sort order.

from smartart2md import load_smartart_parts, convert_smartart

for root, ctx in load_smartart_parts("presentation.pptx"):
    md, images = convert_smartart(root, ctx)
    print(md)

convert_smartart(root, ctx)

Converts a SmartArt dgm:dataModel XML root element to a Markdown list.

Parameter Type Description
root ET.Element dgm:dataModel root returned by load_smartart_parts or resolved from a slide
ctx ZipContext | None Context object for the archive. Pass None to skip image extraction

Returns a (markdown_str, images) tuple:

  • markdown_str — indented bullet list reflecting the diagram hierarchy
  • images — list of (bytes, ext) tuples for images embedded in diagram nodes. Their positions in the Markdown string are marked with @@IMG:0@@, @@IMG:1@@, etc.

ZipContext

OOXML files (.pptx, .xlsx, .docx) are ZIP archives that contain many XML files inside. ZipContext pairs an open zipfile.ZipFile with the path of a specific XML file within the archive, so the converter can extract images embedded in SmartArt nodes.

When you use load_smartart_parts(), ZipContext objects are created and returned automatically. You only need to construct one manually when building a custom pipeline (see below).

import zipfile
from smartart2md import ZipContext

zf = zipfile.ZipFile("presentation.pptx")
ctx = ZipContext(zf, "ppt/diagrams/data1.xml")

Advanced: Full Pipeline Integration

load_smartart_parts() is convenient but returns diagrams without slide context. When you need to convert an entire PPTX in slide order, iterate the slides manually:

import posixpath
import zipfile
import xml.etree.ElementTree as ET
from smartart2md import convert_smartart, ZipContext

PML_NS = "http://schemas.openxmlformats.org/presentationml/2006/main"
DML_NS = "http://schemas.openxmlformats.org/drawingml/2006/main"
REL_NS = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"


def _read_rels(zf, xml_path):
    """Read the .rels file for a given XML part and return {rId: resolved_path}."""
    directory = posixpath.dirname(xml_path)
    filename = posixpath.basename(xml_path)
    rels_path = posixpath.join(directory, "_rels", filename + ".rels")
    result = {}
    try:
        for rel in ET.fromstring(zf.read(rels_path)):
            tag = rel.tag.split("}")[-1] if "}" in rel.tag else rel.tag
            if tag != "Relationship":
                continue
            rid = rel.get("Id", "")
            target = rel.get("Target", "")
            if rel.get("TargetMode") == "External" or not rid:
                continue
            if target.startswith("/"):
                resolved = target.lstrip("/")
            else:
                resolved = posixpath.normpath(
                    posixpath.join(directory, target)
                ).lstrip("/")
            result[rid] = resolved
    except KeyError:
        pass
    return result


with zipfile.ZipFile("presentation.pptx") as zf:
    # 1. Read slide order from presentation.xml
    prs = ET.fromstring(zf.read("ppt/presentation.xml"))
    prs_rels = _read_rels(zf, "ppt/presentation.xml")

    for sld_id_el in prs.findall(f".//{{{PML_NS}}}sldIdLst/{{{PML_NS}}}sldId"):
        rid = sld_id_el.get(f"{{{REL_NS}}}id")
        slide_path = prs_rels.get(rid or "")
        if not slide_path:
            continue

        slide = ET.fromstring(zf.read(slide_path))
        slide_rels = _read_rels(zf, slide_path)

        # 2. Find graphicFrame shapes that contain SmartArt
        for gf in slide.iter():
            if gf.tag.split("}")[-1] != "graphicFrame":
                continue

            graphic = gf.find(f".//{{{DML_NS}}}graphic")
            if graphic is None:
                continue
            graphic_data = graphic.find(f"{{{DML_NS}}}graphicData")
            if graphic_data is None:
                continue

            # 3. SmartArt is identified by "diagram" or "smartArt" in the uri
            uri = graphic_data.get("uri", "")
            if "diagram" not in uri and "smartArt" not in uri.lower():
                continue

            # 4. Find dgm:relIds element and extract the r:dm attribute
            #    r:dm points to the dataModel file that contains the diagram content
            dm_rid = None
            for child in graphic_data.iter():
                if child.tag.split("}")[-1] == "relIds":
                    for attr, val in child.attrib.items():
                        if attr.endswith("}dm"):
                            dm_rid = val
                            break
                    if dm_rid:
                        break
            if not dm_rid:
                continue

            # 5. Resolve the dataModel file path and convert
            data_path = slide_rels.get(dm_rid)
            if not data_path:
                continue

            data_root = ET.fromstring(zf.read(data_path))
            ctx = ZipContext(zf, data_path)
            md, images = convert_smartart(data_root, ctx)
            print(md)

Supported Input Formats

  • .pptx, .xlsx, .docx — automatically scans for SmartArt data XML inside the archive
  • .xml — parsed directly as a dgm:dataModel root

License

Apache 2.0 — Copyright 2026 INSEONG LEE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartart2md-0.1.0.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartart2md-0.1.0-py3-none-any.whl (29.1 kB view details)

Uploaded Python 3

File details

Details for the file smartart2md-0.1.0.tar.gz.

File metadata

  • Download URL: smartart2md-0.1.0.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for smartart2md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f2d599691abf5c4adf716d70134cfdc831f56aef7eba45b1b804dca6234d4197
MD5 b055206a77ef73cff1e61adb3ec8608f
BLAKE2b-256 2a535c977d329c925ef5c909409f16e2991c0b2f006356a3a5de8c136b49a5ee

See more details on using hashes here.

File details

Details for the file smartart2md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: smartart2md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for smartart2md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c017c44c9b6b2104b02aedf1931600aa23015c212800fc3dc9cab7e9abbb7f41
MD5 bd635e6ffc7a2041bcedd569b643c165
BLAKE2b-256 8a845bb75720143705bff166887e9cc1efa95b2c53420618a410c817975a2e2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page