Universal Document Format — parse, transform, and render HWP/HWPX/DOCX/PDF/MD documents through a unified Document Model
Project description
한국어 | English
udfp — Universal Document Format Protocol
Parse, transform, and render HWP/HWPX/DOCX/PDF/MD documents through a unified Document Model.
UDF (Universal Document Format) is the format — a unified document model that normalizes heterogeneous file formats into a common block tree. UDFP (Universal Document Format Protocol) is the protocol layer — an MCP server that lets AI agents read, edit, and generate documents through UDF.
pip install udfp installs both:
udf— Core library. Parsers, renderers, Document Model, validation, CLI.udfp— MCP server. Exposesudfto Claude and other LLM agents via the Model Context Protocol.
pip install udfp → import udf (library)
pip install udfp[mcp] → udfp (MCP server)
Features
- Multi-format parsing — HWP (binary), HWPX (OOXML-like ZIP), DOCX, PDF, Markdown, HTML, XML
- Lossless round-trip — HWP/HWPX/DOCX same-format conversions preserve content via verbatim layer
- Cross-format conversion — Convert between supported format pairs (e.g., HWP → DOCX, PDF → MD)
- Programmatic editing — Add, modify, or remove blocks/inlines via
UdfDocumentAPI - Two generation modes — Seed Patch (modify in-place) and From Scratch (full regeneration)
- Structural validation — R-rules for HWP (R1–R4), HX-rules for HWPX (HX1–HX4), D-rules for DOCX (D1–D3) — all implemented
- MCP server — Claude/LLM integration for reading, editing, and generating documents
Installation
pip install udfp
With MCP server:
pip install udfp[mcp]
For development:
pip install udfp[dev]
udf — Core Library
Parse a document
import udf
doc = udf.parse("report.hwp")
print(f"{len(doc.blocks)} blocks parsed")
Convert between formats
import udf
udf.convert("input.hwp", "output.docx")
udf.convert("paper.pdf", "paper.md")
Programmatic editing
import udf
from udf.schema.blocks import ParagraphBlock
from udf.schema.inlines import TextInline
doc = udf.parse("template.hwp")
doc.replace_text("PLACEHOLDER", "Actual Value")
new_block = ParagraphBlock(
type="paragraph",
id="new-1",
inlines=[TextInline(type="text", text="New content")],
)
doc.add_block(new_block)
udf.render(doc, "hwp", output_path="filled.hwp")
CLI
udf convert input.hwp -o output.docx
udf inspect document.hwp
udf validate document.hwp
udf diff original.hwp modified.hwp
udfp — MCP Server
The MCP server lets LLMs read, edit, and generate documents through tool calls.
Start the server
udfp # stdio (default)
udfp --transport streamable-http --port 8000 # HTTP
Available tools
| Tool | Description |
|---|---|
read(path) |
Parse a document into simplified JSON with block IDs |
edit(path, edits) |
Modify text/formatting at specific block+inline positions |
render(path, format) |
Convert a document to another format |
create(blocks, format) |
Build a new document from a block array |
insert_blocks(path, blocks) |
Add blocks to an existing document |
remove_blocks(path, block_ids) |
Delete blocks by ID |
set_page(path, ...) |
Change page layout (paper size, margins, columns) |
export_md(path) |
Export document as editable Markdown with block IDs |
import_md(path, edited_md) |
Apply edited Markdown back, preserving original formatting |
describe(topic) |
Get schema documentation (start with describe('overview')) |
Claude Desktop config
{
"mcpServers": {
"udfp": {
"command": "udfp"
}
}
}
Document Model
All formats are normalized into a common block tree:
| Block Type | Description |
|---|---|
ParagraphBlock |
Text with inline formatting |
HeadingBlock |
Heading levels 1–6 |
TableBlock |
Rows, cells, merged spans |
ImageBlock |
Embedded or referenced images |
ListBlock |
Ordered/unordered lists |
EquationBlock |
Mathematical equations |
CodeBlock |
Source code blocks |
QuoteBlock |
Block quotations |
PageBreakBlock |
Explicit page breaks |
HorizontalRuleBlock |
Horizontal rules |
DrawingBlock |
Vector shapes |
TextBoxBlock |
Floating text containers |
FootnoteBlock / EndnoteBlock |
Notes |
HeaderBlock / FooterBlock |
Page header/footer content |
FieldBlock |
Form fields, hyperlinks, bookmarks |
BookmarkBlock |
Named bookmarks |
CommentBlock |
Review comments |
ChartBlock |
Embedded charts |
TextArtBlock |
Decorative text (WordArt) |
UnknownBlock |
Unrecognized format-specific content |
Generation Modes
Seed Patch (default when original exists)
Preserves the original binary/ZIP, replacing only modified streams. Guarantees bit-perfect preservation of unmodified regions.
Best for: Form filling, text replacement, content updates without structural changes.
From Scratch (automatic fallback)
Regenerates the entire output file from the Document Model. Required when blocks are added, removed, or restructured.
Automatic detection: If any block lacks a verbatim_ref (i.e., was programmatically added), the renderer automatically falls back to From Scratch mode.
Supported Formats
| Format | Parse | Render | Same-format Round-trip |
|---|---|---|---|
| HWP | Full | Full (Seed Patch + From Scratch) | Lossless (verbatim) |
| HWPX | Full | Full (Seed Patch + From Scratch) | Lossless (verbatim) |
| DOCX | Full | Full (Seed Patch + From Scratch) | Lossless (verbatim) |
| Full | — | Parse only | |
| Markdown | Full | Full | Text-level |
| HTML | Full | Full | Text-level |
| XML | Full | — | Parse only |
Cross-format Conversion Matrix
| From \ To | HWP | HWPX | DOCX | MD | HTML |
|---|---|---|---|---|---|
| HWP | Lossless | Semantic | Semantic | Text-level | Text-level |
| HWPX | Semantic | Lossless | Semantic | Text-level | Text-level |
| DOCX | Semantic | Semantic | Lossless | Text-level | Text-level |
| — | — | — | Text-level | Text-level | |
| MD | From Scratch | — | — | — | Full |
| HTML | From Scratch | — | — | Full | — |
- Lossless: Verbatim layer preserves all binary content (Seed Patch mode)
- Semantic: Block structure and text preserved; format-specific styling may differ (From Scratch mode)
- Text-level: Text content preserved; formatting, page layout, images lost
- From Scratch: Generates new binary from Document Model; requires original for best results
Known Limitations
From Scratch mode (used for cross-format and structural edits):
DrawingBlock,ChartBlock,TextArtBlockcannot be regenerated without the original file — reported asFORMAT_LIMITloss- Complex table structures (merged cells, nested tables) may not fully survive HWPX/DOCX → HWP conversion
Validation rules:
- HWP: R1–R4 structural rules + I1–I3 integrity checks — fully implemented with auto-fixers
- HWPX: HX1–HX4 structural rules — fully implemented
- DOCX: D1–D3 structural rules — fully implemented
- PDF: format-specific rules planned (not needed until PDF rendering is added)
Text-level formats (MD, HTML):
- Formatting (fonts, colors, margins), images, and page layout are not preserved
- Useful for text extraction and content editing, not visual fidelity
Architecture
Input File ──▶ Parser ──▶ UdfDocument ──▶ Renderer ──▶ Output File
│
▼
Document Model (blocks/inlines)
+
Verbatim Layer (binary preservation)
+
Loss Report (what was dropped)
Development
pytest # all tests
pytest tests/roundtrip/ # round-trip tests
pytest tests/validation/ # R-rule validation
ruff check . && ruff format . # lint + format
mypy udf/ # type check
License
Business Source License 1.1 (BUSL-1.1) — see LICENSE and NOTICE.
Non-commercial, academic, and personal use is free. For commercial or production use, contact h000000nkim@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file udfp-1.0.3.tar.gz.
File metadata
- Download URL: udfp-1.0.3.tar.gz
- Upload date:
- Size: 650.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cfd3f696c368722477a80a356528c872c9b47f3eaed307ab7171f02c6efdf01
|
|
| MD5 |
e87597dff857be6ac04bf1c62da3b433
|
|
| BLAKE2b-256 |
5babdce59dbf13763779ee8a89c60c469f116f74befcb7e485a28797366914b1
|
Provenance
The following attestation bundles were made for udfp-1.0.3.tar.gz:
Publisher:
publish.yml on h000000nkim/udfp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
udfp-1.0.3.tar.gz -
Subject digest:
7cfd3f696c368722477a80a356528c872c9b47f3eaed307ab7171f02c6efdf01 - Sigstore transparency entry: 1738433759
- Sigstore integration time:
-
Permalink:
h000000nkim/udfp@83ca7f541091777108e3024c68381ab78d76e66c -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/h000000nkim
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@83ca7f541091777108e3024c68381ab78d76e66c -
Trigger Event:
release
-
Statement type:
File details
Details for the file udfp-1.0.3-py3-none-any.whl.
File metadata
- Download URL: udfp-1.0.3-py3-none-any.whl
- Upload date:
- Size: 662.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d3599b63a7877e55294a8c9fc79e1f8138468de80ce81b9360a541740c9f5e2
|
|
| MD5 |
a0d856202814e69a8970f3f171afacc5
|
|
| BLAKE2b-256 |
4837a8ca433d37c407c57e1f4fce3664d17d1ce6a0437164df45a7add3fc13e3
|
Provenance
The following attestation bundles were made for udfp-1.0.3-py3-none-any.whl:
Publisher:
publish.yml on h000000nkim/udfp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
udfp-1.0.3-py3-none-any.whl -
Subject digest:
3d3599b63a7877e55294a8c9fc79e1f8138468de80ce81b9360a541740c9f5e2 - Sigstore transparency entry: 1738433775
- Sigstore integration time:
-
Permalink:
h000000nkim/udfp@83ca7f541091777108e3024c68381ab78d76e66c -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/h000000nkim
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@83ca7f541091777108e3024c68381ab78d76e66c -
Trigger Event:
release
-
Statement type: