Skip to main content

Convert PPTX files into LLM-friendly Markdown (VLM-first image understanding).

Project description

pptx-md

CI PyPI version Python

PPTX 파일을 LLM-friendly Markdown으로 변환하는 Python 라이브러리.

VLM(Vision Language Model) 기반 이미지 이해와 개인정보 마스킹을 지원합니다.


설치

Core (텍스트 변환만)

pip install pptx-md

VLM 지원 포함 (이미지 설명 생성)

pip install pptx-md[vlm]

VLM extras는 anthropicopenai SDK를 함께 설치합니다.


Quick Start

기본 변환 (core-only)

from pptx_md import convert

md = convert("deck.pptx")
print(md)

convert()는 파싱 → 이미지 분류 → Markdown 어셈블 파이프라인을 실행하고 Markdown 문자열을 반환합니다.

옵션과 함께

from pptx_md import convert, ConvertOptions

opts = ConvertOptions(validate=True)
md = convert("deck.pptx", options=opts)

VLM 이미지 설명

VLM을 사용하면 이미지 슬라이드에 자연어 설명을 자동으로 생성합니다. API 키는 반드시 환경변수로 전달합니다 (NFR-05).

import os
from pptx_md import convert, ConvertOptions, get_describer

describer = get_describer("anthropic", api_key=os.environ["ANTHROPIC_API_KEY"])
opts = ConvertOptions(describer=describer)
md = convert("deck.pptx", options=opts)

OpenAI를 사용하려면:

import os
from pptx_md import convert, ConvertOptions, get_describer

describer = get_describer("openai", api_key=os.environ["OPENAI_API_KEY"])
opts = ConvertOptions(describer=describer)
md = convert("deck.pptx", options=opts)

개인정보 마스킹 (opt-in)

이메일·전화번호 등 PII를 [REDACTED]로 치환합니다. 기본값은 비활성입니다.

기본 패턴 활성화

from pptx_md import convert, ConvertOptions, MaskingOptions

opts = ConvertOptions(masking=MaskingOptions(enabled=True))
md = convert("deck.pptx", options=opts)

커스텀 패턴 추가

import re
from pptx_md import convert, ConvertOptions, MaskingOptions

custom_masking = MaskingOptions(
    enabled=True,
    patterns=[
        re.compile(r"\d{6}-\d{7}"),   # 주민등록번호
        re.compile(r"사번\s*:\s*\d+"), # 사번
    ],
)
opts = ConvertOptions(masking=custom_masking)
md = convert("deck.pptx", options=opts)

Markdown 검증

from pptx_md import convert, validate_markdown

md = convert("deck.pptx")
result = validate_markdown(md)

if not result.valid:
    print("검증 실패:", result.warnings)
elif result.warnings:
    print("경고:", result.warnings)

convert(validate=True)를 사용하면 변환과 동시에 검증 결과를 로그로 출력합니다 (반환값은 항상 str).


커스텀 VLM 제공자 (플러그인)

ImageDescriber 프로토콜을 구현하면 어떤 VLM 제공자도 플러그인으로 사용할 수 있습니다.

from pptx_md import convert, ConvertOptions, ImageDescriber


class MyDescriber:
    def describe(
        self,
        image_bytes: bytes,
        image_ext: str,
        shape_hint: str | None,
    ) -> str:
        return "이미지에 대한 설명"


opts = ConvertOptions(describer=MyDescriber())
md = convert("deck.pptx", options=opts)

예외 처리

from pptx_md import convert, ParseError, DescribeError

try:
    md = convert("deck.pptx")
except ParseError as e:
    print(f"PPTX 파일을 읽을 수 없습니다: {e}")

전체 API 레퍼런스

docs/api.md 에 공개 심볼 전체 레퍼런스가 있습니다.

상세 사용 가이드는 docs/usage.md를 참고하세요.


요구사항

  • Python 3.11+
  • Core: python-pptx, Pillow
  • VLM 지원: pip install pptx-md[vlm] (anthropic 또는 openai SDK)

라이선스

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pptx_md-0.1.0.tar.gz (117.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pptx_md-0.1.0-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file pptx_md-0.1.0.tar.gz.

File metadata

  • Download URL: pptx_md-0.1.0.tar.gz
  • Upload date:
  • Size: 117.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pptx_md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5bf25554a3c6dbb867544cbad295b0bb420ccef0abcff41f5a5e37cf3545aa0d
MD5 bd8fb828332396595ab41fe707fd228d
BLAKE2b-256 41d3882f1a4276762d7d290a5045eae7b3dd92781c64b1df18af1955d5dfd5c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for pptx_md-0.1.0.tar.gz:

Publisher: release.yml on ms9648/pptx-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pptx_md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pptx_md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pptx_md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 724730e3a8713f9f7bbe9c8eab75d0b3eb9da4237ae6e5baf54e9c22f289e875
MD5 787fe0ab7f3355f7f1ca3f8d6148dbbf
BLAKE2b-256 74efb617c9e8d1f0c7d4a0166cf12c126846ef32dceffa01da9474c9595a01d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for pptx_md-0.1.0-py3-none-any.whl:

Publisher: release.yml on ms9648/pptx-md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page