Universal Document Format — parse, transform, and render HWP/HWPX/DOCX/PDF/MD documents through a unified Document Model

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Hoon_Kim

These details have not been verified by PyPI

Project description

한국어 | English

udfp — Universal Document Format Protocol

Parse, transform, and render HWP/HWPX/DOCX/PDF/MD documents through a unified Document Model.

UDF (Universal Document Format) is the format — a unified document model that normalizes heterogeneous file formats into a common block tree. UDFP (Universal Document Format Protocol) is the protocol layer — an MCP server that lets AI agents read, edit, and generate documents through UDF.

pip install udfp installs both:

udf — Core library. Parsers, renderers, Document Model, validation, CLI.
udfp — MCP server. Exposes udf to Claude and other LLM agents via the Model Context Protocol.

pip install udfp        →  import udf       (library)
pip install udfp[mcp]   →  udfp             (MCP server)

Features

Multi-format parsing — HWP (binary), HWPX (OOXML-like ZIP), DOCX, PDF, Markdown, HTML, XML
Lossless round-trip — HWP/HWPX/DOCX same-format conversions preserve content via verbatim layer
Cross-format conversion — Convert between supported format pairs (e.g., HWP → DOCX, PDF → MD)
Programmatic editing — Add, modify, or remove blocks/inlines via UdfDocument API
Two generation modes — Seed Patch (modify in-place) and From Scratch (full regeneration)
Structural validation — R-rules for HWP (R1–R4), HX-rules for HWPX (HX1–HX4), D-rules for DOCX (D1–D3) — all implemented
MCP server — Claude/LLM integration for reading, editing, and generating documents

Installation

pip install udfp

With MCP server:

pip install udfp[mcp]

For development:

pip install udfp[dev]

`udf` — Core Library

Parse a document

import udf

doc = udf.parse("report.hwp")
print(f"{len(doc.blocks)} blocks parsed")

Convert between formats

import udf

udf.convert("input.hwp", "output.docx")
udf.convert("paper.pdf", "paper.md")

Programmatic editing

import udf
from udf.schema.blocks import ParagraphBlock
from udf.schema.inlines import TextInline

doc = udf.parse("template.hwp")

doc.replace_text("PLACEHOLDER", "Actual Value")

new_block = ParagraphBlock(
    type="paragraph",
    id="new-1",
    inlines=[TextInline(type="text", text="New content")],
)
doc.add_block(new_block)

udf.render(doc, "hwp", output_path="filled.hwp")

CLI

udf convert input.hwp -o output.docx
udf inspect document.hwp
udf validate document.hwp
udf diff original.hwp modified.hwp

`udfp` — MCP Server

The MCP server lets LLMs read, edit, and generate documents through tool calls.

Start the server

udfp                                          # stdio (default)
udfp --transport streamable-http --port 8000  # HTTP

Available tools

Tool	Description
`read(path)`	Parse a document into simplified JSON with block IDs
`edit(path, edits)`	Modify text/formatting at specific block+inline positions
`render(path, format)`	Convert a document to another format
`create(blocks, format)`	Build a new document from a block array
`insert_blocks(path, blocks)`	Add blocks to an existing document
`remove_blocks(path, block_ids)`	Delete blocks by ID
`set_page(path, ...)`	Change page layout (paper size, margins, columns)
`export_md(path)`	Export document as editable Markdown with block IDs
`import_md(path, edited_md)`	Apply edited Markdown back, preserving original formatting
`describe(topic)`	Get schema documentation (start with `describe('overview')`)

Claude Desktop config

{
  "mcpServers": {
    "udfp": {
      "command": "udfp"
    }
  }
}

Document Model

All formats are normalized into a common block tree:

Block Type	Description
`ParagraphBlock`	Text with inline formatting
`HeadingBlock`	Heading levels 1–6
`TableBlock`	Rows, cells, merged spans
`ImageBlock`	Embedded or referenced images
`ListBlock`	Ordered/unordered lists
`EquationBlock`	Mathematical equations
`CodeBlock`	Source code blocks
`QuoteBlock`	Block quotations
`PageBreakBlock`	Explicit page breaks
`HorizontalRuleBlock`	Horizontal rules
`DrawingBlock`	Vector shapes
`TextBoxBlock`	Floating text containers
`FootnoteBlock` / `EndnoteBlock`	Notes
`HeaderBlock` / `FooterBlock`	Page header/footer content
`FieldBlock`	Form fields, hyperlinks, bookmarks
`BookmarkBlock`	Named bookmarks
`CommentBlock`	Review comments
`ChartBlock`	Embedded charts
`TextArtBlock`	Decorative text (WordArt)
`UnknownBlock`	Unrecognized format-specific content

Generation Modes

Seed Patch (default when original exists)

Preserves the original binary/ZIP, replacing only modified streams. Guarantees bit-perfect preservation of unmodified regions.

Best for: Form filling, text replacement, content updates without structural changes.

From Scratch (automatic fallback)

Regenerates the entire output file from the Document Model. Required when blocks are added, removed, or restructured.

Automatic detection: If any block lacks a verbatim_ref (i.e., was programmatically added), the renderer automatically falls back to From Scratch mode.

Supported Formats

Format	Parse	Render	Same-format Round-trip
HWP	Full	Full (Seed Patch + From Scratch)	Lossless (verbatim)
HWPX	Full	Full (Seed Patch + From Scratch)	Lossless (verbatim)
DOCX	Full	Full (Seed Patch + From Scratch)	Lossless (verbatim)
PDF	Full	—	Parse only
Markdown	Full	Full	Text-level
HTML	Full	Full	Text-level
XML	Full	—	Parse only

Cross-format Conversion Matrix

From \ To	HWP	HWPX	DOCX	MD	HTML
HWP	Lossless	Semantic	Semantic	Text-level	Text-level
HWPX	Semantic	Lossless	Semantic	Text-level	Text-level
DOCX	Semantic	Semantic	Lossless	Text-level	Text-level
PDF	—	—	—	Text-level	Text-level
MD	From Scratch	—	—	—	Full
HTML	From Scratch	—	—	Full	—

Lossless: Verbatim layer preserves all binary content (Seed Patch mode)
Semantic: Block structure and text preserved; format-specific styling may differ (From Scratch mode)
Text-level: Text content preserved; formatting, page layout, images lost
From Scratch: Generates new binary from Document Model; requires original for best results

Known Limitations

From Scratch mode (used for cross-format and structural edits):

DrawingBlock, ChartBlock, TextArtBlock cannot be regenerated without the original file — reported as FORMAT_LIMIT loss
Complex table structures (merged cells, nested tables) may not fully survive HWPX/DOCX → HWP conversion

Validation rules:

HWP: R1–R4 structural rules + I1–I3 integrity checks — fully implemented with auto-fixers
HWPX: HX1–HX4 structural rules — fully implemented
DOCX: D1–D3 structural rules — fully implemented
PDF: format-specific rules planned (not needed until PDF rendering is added)

Text-level formats (MD, HTML):

Formatting (fonts, colors, margins), images, and page layout are not preserved
Useful for text extraction and content editing, not visual fidelity

Architecture

Input File ──▶ Parser ──▶ UdfDocument ──▶ Renderer ──▶ Output File
                              │
                              ▼
                     Document Model (blocks/inlines)
                              +
                     Verbatim Layer (binary preservation)
                              +
                     Loss Report (what was dropped)

Development

pytest                        # all tests
pytest tests/roundtrip/       # round-trip tests
pytest tests/validation/      # R-rule validation
ruff check . && ruff format . # lint + format
mypy udf/                     # type check

License

Business Source License 1.1 (BUSL-1.1) — see LICENSE and NOTICE.

Non-commercial, academic, and personal use is free. For commercial or production use, contact h000000nkim@gmail.com.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Hoon_Kim

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.3

Jun 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

udfp-1.0.3.tar.gz (650.9 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

udfp-1.0.3-py3-none-any.whl (662.2 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file udfp-1.0.3.tar.gz.

File metadata

Download URL: udfp-1.0.3.tar.gz
Upload date: Jun 6, 2026
Size: 650.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for udfp-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`7cfd3f696c368722477a80a356528c872c9b47f3eaed307ab7171f02c6efdf01`
MD5	`e87597dff857be6ac04bf1c62da3b433`
BLAKE2b-256	`5babdce59dbf13763779ee8a89c60c469f116f74befcb7e485a28797366914b1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for udfp-1.0.3.tar.gz:

Publisher: publish.yml on h000000nkim/udfp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: udfp-1.0.3.tar.gz
- Subject digest: 7cfd3f696c368722477a80a356528c872c9b47f3eaed307ab7171f02c6efdf01
- Sigstore transparency entry: 1738433759
- Sigstore integration time: Jun 6, 2026
Source repository:
- Permalink: h000000nkim/udfp@83ca7f541091777108e3024c68381ab78d76e66c
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/h000000nkim
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@83ca7f541091777108e3024c68381ab78d76e66c
- Trigger Event: release

File details

Details for the file udfp-1.0.3-py3-none-any.whl.

File metadata

Download URL: udfp-1.0.3-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 662.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for udfp-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d3599b63a7877e55294a8c9fc79e1f8138468de80ce81b9360a541740c9f5e2`
MD5	`a0d856202814e69a8970f3f171afacc5`
BLAKE2b-256	`4837a8ca433d37c407c57e1f4fce3664d17d1ce6a0437164df45a7add3fc13e3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for udfp-1.0.3-py3-none-any.whl:

Publisher: publish.yml on h000000nkim/udfp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: udfp-1.0.3-py3-none-any.whl
- Subject digest: 3d3599b63a7877e55294a8c9fc79e1f8138468de80ce81b9360a541740c9f5e2
- Sigstore transparency entry: 1738433775
- Sigstore integration time: Jun 6, 2026
Source repository:
- Permalink: h000000nkim/udfp@83ca7f541091777108e3024c68381ab78d76e66c
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/h000000nkim
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@83ca7f541091777108e3024c68381ab78d76e66c
- Trigger Event: release

udfp 1.0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

udfp — Universal Document Format Protocol

Features

Installation

udf — Core Library

Parse a document

Convert between formats

Programmatic editing

CLI

udfp — MCP Server

Start the server

Available tools

Claude Desktop config

Document Model

Generation Modes

Seed Patch (default when original exists)

From Scratch (automatic fallback)

Supported Formats

Cross-format Conversion Matrix

Known Limitations

Architecture

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`udf` — Core Library

`udfp` — MCP Server