Pack and unpack text files into machine-readable XML
Project description
mdbox
Pack and unpack text files into structured, plain-text archives with embedded XML. Perfectly formatted for LLM context windows.
Overview
The mdbox utility bundles text-based file directories into a single plain-text archive structured with an embedded XML block. It provides a zipfile-like Python API and a tar-style CLI for creating, extracting, modifying, and deleting entries.
Every archive embeds a visual directory tree and stores file contents safely in CDATA sections. The result is an archive that is completely Git-friendly and easily parseable by any standard XML tool.
🤖 Built for LLM Pipelines
Standard archive formats (.zip, .tar.gz) are binary and invisible to Large Language Models. mdbox solves this by creating self-contained, text-only bundles.
By utilizing the preamble and epilogue features (which allow you to attach arbitrary text before and after the XML data), a single mdbox archive becomes a complete prompt + context bundle. Just attach your system instructions in the preamble, pack the relevant codebase, and feed the single file directly into your LLM pipeline.
Key Features
Core Capabilities
- Human-readable XML: Archives are plain text, Git-friendly, and inspectable in any text editor.
- Embedded Directory Tree: Every archive includes a visual,
tree-style directory listing at a glance. - Preamble & Epilogue: Prepend and append arbitrary text (like Markdown prompts or system instructions) around the XML block.
- Secure Extraction: Sandboxed unpacking prevents malicious path traversal (e.g.,
../attacks). - Strict Content Validation: Built-in UTF-8 enforcement and XML 1.0 compatibility checks with precise error localization.
Developer Experience
zipfile-compatible Python API: Familiar methods likeopen(),write(),readstr(),extractall(), andnamelist().tar-style CLI: Quick and familiar command-line interface with bundled flags (-cvf).- Async Extraction Pipeline: Concurrent file writing powered by
asyncioandaiofilefor maximum I/O performance. - Transactional Safety: Atomic repacking for add/delete operations ensures no partial writes corrupt your archive if an exception occurs.
- Lazy File Reading: Disk sources are only read when content is explicitly accessed or the archive is flushed.
Installation
Requires Python 3.12+.
# Standard pip
pip install mdbox
# With uv (Recommended)
uv add mdbox
Quick Start
1. Create an Archive
Pack a directory and a specific file into a single .xml bundle.
CLI:
mdbox -cvf backup.xml src/ README.md
Python:
import mdbox
with mdbox.open("backup.xml", mode="w") as qf:
qf.write("src")
qf.write("README.md")
2. Extract an Archive
Unpack the bundle back to your local disk.
CLI:
mdbox -xf backup.xml output/
Python:
with mdbox.open("backup.xml", mode="r") as qf:
qf.extractall("output")
The Archive Format
An mdbox archive consists of three distinct sections:
- Preamble: Arbitrary text (Markdown, system prompts, prose).
<archive>XML block: The structured file data and directory tree.- Epilogue: Arbitrary trailing text (metadata, formatting closures).
Because file contents are stored in <![CDATA[...]]> blocks, all characters are preserved exactly without requiring strict entity encoding.
# Project Snapshot
> System Prompt: Review the following codebase for security vulnerabilities.
<archive version="1.0">
<directory_tree><![CDATA[
.
├── src/
│ ├── main.py
│ └── utils/
│ └── helpers.py
└── README.md
]]></directory_tree>
<file path="README.md">
<content><![CDATA[# My Project
A sample project.
]]></content>
</file>
<file path="src/main.py">
<content><![CDATA[print("hello")
]]></content>
</file>
</archive>
---
*End of context bundle.*
CLI Reference
The mdbox utility supports standard tar-style bundled flags. Note: The -f flag must always come last in a bundle.
Create (-c)
# Basic creation
mdbox -cvf archive.xml src/ docs/ README.md
# Creation with prompt injection (Preamble/Epilogue)
mdbox -cvf archive.xml --preamble "Build 2024-01-15" --epilogue license.txt src/
Extract (-x)
# Extracts to default (.) or specified output directory
mdbox -xf archive.xml output/
Add / Upsert (-a)
Creates the archive if it doesn't exist, or safely merges new entries into an existing one via atomic replacement.
mdbox -avf archive.xml new_module.py
Delete (--delete)
Removes files or whole directory prefixes. Uses atomic repacking to prevent corruption.
mdbox --delete -f archive.xml old_module.py src/deprecated/
Global Options
| Flag | Description |
|---|---|
-c |
Create a new archive |
-x |
Extract an archive |
-a |
Add/upsert files into an archive |
--delete |
Remove files from an archive |
-f <file> |
Archive file path (required) |
-v |
Verbose output |
--debug |
Structured debug logging |
--preamble <text|file> |
Text or file content to prepend before XML |
--epilogue <text|file> |
Text or file content to append after XML |
Python API Reference
Opening & Iterating
import mdbox
# Write mode (creates or overwrites)
with mdbox.open("archive.xml", mode="w") as qf:
qf.write("src")
# Read mode (parses existing archive)
with mdbox.open("archive.xml", mode="r") as qf:
for info in qf:
print(f"File: {info.name}, Size: {info.length} bytes")
Advanced Writing
with mdbox.open("archive.xml", mode="w") as qf:
qf.write("main.py") # Add single file
qf.write("src") # Add entire directory
qf.write("build/out.js", arcname="dist.js") # Override internal path
qf.writestr("virtual.txt", "hello world") # Write straight from memory
Advanced Reading
import io
with mdbox.open("archive.xml", mode="r") as qf:
names = qf.namelist() # ['src/main.py', ...]
text_content = qf.readstr("src/main.py") # Returns decoded string
raw_bytes = qf.read("src/main.py") # Returns raw bytes
# Access injected LLM prompts
print("Prompt:", qf.preamble)
print("Trailing:", qf.epilogue)
# mdbox fully supports in-memory file-like objects
with io.BytesIO() as buffer:
with mdbox.open(buffer, mode="w") as qf:
qf.writestr("test.txt", "hello")
Safe Extraction
with mdbox.open("archive.xml", mode="r") as qf:
# Extract everything
qf.extractall("output/")
# Extract conditionally
python_files = [info for info in qf if info.name.endswith(".py")]
qf.extractall("src_only/", members=python_files)
Exception Handling
The mdbox library provides strict validation. Malformed inputs or malicious extraction paths will throw explicit errors:
from mdbox import BinaryFileError, PathTraversalError
try:
with mdbox.open("archive.xml", mode="w") as qf:
qf.write("image.png")
except BinaryFileError as e:
print(f"Rejected: {e}") # Triggers if file fails UTF-8 checks
try:
with mdbox.open("archive.xml", mode="r") as qf:
qf.extractall()
except PathTraversalError as e:
print(f"Blocked malicious path: {e}") # Triggers on absolute paths or ../
Architecture & Design
- Security: Extraction paths are strictly validated using
Path.relative_to(). Absolute paths and..escape attempts are blocked outright. - Data Validation: Files must pass UTF-8 decoding, and content is scanned to ensure XML 1.0 compatibility (blocking
NULLandC0/C1control characters) to guarantee parseability. - Performance: * In read mode, data is extracted via
memoryviewslicing directly from raw bytes to skip redundant parsing overhead.extractall()leverages a bounded async queue and concurrent workers.
- Transactional Safety: If a
withblock encounters an exception,__exit__safely aborts without writing, avoiding corrupt output states.
Development
# Clone and sync dependencies
git clone [https://github.com/chgroeling/mdbox.git](https://github.com/chgroeling/mdbox.git)
cd mdbox
uv sync --all-extras
# Run full quality gate (format, lint, type-check, test)
uv run ruff format src/ tests/ && \
uv run ruff check src/ tests/ && \
uv run mypy src/ && \
uv run pytest
# Check coverage
uv run pytest --cov=mdbox --cov-report=html
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mdbox-0.2.0.tar.gz.
File metadata
- Download URL: mdbox-0.2.0.tar.gz
- Upload date:
- Size: 96.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abcb4dd81c0176226803299cbc287aaeb1c5b6cceb6b5d2cad5f5e3f023f3c77
|
|
| MD5 |
dbfba9641c0b66ea81d43bd7d24c1c91
|
|
| BLAKE2b-256 |
9c338accb93902f2a70843e670a7f1904bd68d84c335755ca71bf1a18116949c
|
Provenance
The following attestation bundles were made for mdbox-0.2.0.tar.gz:
Publisher:
publish.yml on chgroeling/mdbox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mdbox-0.2.0.tar.gz -
Subject digest:
abcb4dd81c0176226803299cbc287aaeb1c5b6cceb6b5d2cad5f5e3f023f3c77 - Sigstore transparency entry: 1238609144
- Sigstore integration time:
-
Permalink:
chgroeling/mdbox@135279c8015ff4368616a42271176d6c060e5b42 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/chgroeling
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@135279c8015ff4368616a42271176d6c060e5b42 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mdbox-0.2.0-py3-none-any.whl.
File metadata
- Download URL: mdbox-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bbb631f40c9be65ffde2df3ddc95aa11beb3986226b214a7054242a86debaed
|
|
| MD5 |
bc013f278ad4b8ff7845915a7f6663e8
|
|
| BLAKE2b-256 |
72c45c705184f95b850dad7dbd4e1f960da94094ed8e5ddc3bbc0604670df828
|
Provenance
The following attestation bundles were made for mdbox-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on chgroeling/mdbox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mdbox-0.2.0-py3-none-any.whl -
Subject digest:
1bbb631f40c9be65ffde2df3ddc95aa11beb3986226b214a7054242a86debaed - Sigstore transparency entry: 1238609158
- Sigstore integration time:
-
Permalink:
chgroeling/mdbox@135279c8015ff4368616a42271176d6c060e5b42 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/chgroeling
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@135279c8015ff4368616a42271176d6c060e5b42 -
Trigger Event:
push
-
Statement type: