Skip to main content

Pack and unpack text files into machine-readable XML

Project description

mdbox

Python 3.12+ License: MIT Status: Alpha PyPI

Pack and unpack text files into structured, plain-text archives with embedded XML. Perfectly formatted for LLM context windows.

Overview

The mdbox utility bundles text-based file directories into a single plain-text archive structured with an embedded XML block. It provides a zipfile-like Python API and a tar-style CLI for creating, extracting, modifying, and deleting entries.

Every archive embeds a visual directory tree and stores file contents safely in CDATA sections. The result is an archive that is completely Git-friendly and easily parseable by any standard XML tool.

🤖 Built for LLM Pipelines

Standard archive formats (.zip, .tar.gz) are binary and invisible to Large Language Models. mdbox solves this by creating self-contained, text-only bundles.

By utilizing the preamble and epilogue features (which allow you to attach arbitrary text before and after the XML data), a single mdbox archive becomes a complete prompt + context bundle. Just attach your system instructions in the preamble, pack the relevant codebase, and feed the single file directly into your LLM pipeline.


Key Features

Core Capabilities

  • Human-readable XML: Archives are plain text, Git-friendly, and inspectable in any text editor.
  • Embedded Directory Tree: Every archive includes a visual, tree-style directory listing at a glance.
  • Preamble & Epilogue: Prepend and append arbitrary text (like Markdown prompts or system instructions) around the XML block.
  • Secure Extraction: Sandboxed unpacking prevents malicious path traversal (e.g., ../ attacks).
  • Strict Content Validation: Built-in UTF-8 enforcement and XML 1.0 compatibility checks with precise error localization.

Developer Experience

  • zipfile-compatible Python API: Familiar methods like open(), write(), readstr(), extractall(), and namelist().
  • tar-style CLI: Quick and familiar command-line interface with bundled flags (-cvf).
  • Async Extraction Pipeline: Concurrent file writing powered by asyncio and aiofile for maximum I/O performance.
  • Transactional Safety: Atomic repacking for add/delete operations ensures no partial writes corrupt your archive if an exception occurs.
  • Lazy File Reading: Disk sources are only read when content is explicitly accessed or the archive is flushed.

Installation

Requires Python 3.12+.

# Standard pip
pip install mdbox

# With uv (Recommended)
uv add mdbox

Quick Start

1. Create an Archive

Pack a directory and a specific file into a single .xml bundle.

CLI:

mdbox -cvf backup.xml src/ README.md

Python:

import mdbox

with mdbox.open("backup.xml", mode="w") as qf:
    qf.write("src")
    qf.write("README.md")

2. Extract an Archive

Unpack the bundle back to your local disk.

CLI:

mdbox -xf backup.xml output/

Python:

with mdbox.open("backup.xml", mode="r") as qf:
    qf.extractall("output")

The Archive Format

An mdbox archive consists of three distinct sections:

  1. Preamble: Arbitrary text (Markdown, system prompts, prose).
  2. <archive> XML block: The structured file data and directory tree.
  3. Epilogue: Arbitrary trailing text (metadata, formatting closures).

Because file contents are stored in <![CDATA[...]]> blocks, all characters are preserved exactly without requiring strict entity encoding.

# Project Snapshot
> System Prompt: Review the following codebase for security vulnerabilities.

<archive version="1.0">
  <directory_tree><![CDATA[
.
├── src/
│   ├── main.py
│   └── utils/
│       └── helpers.py
└── README.md
]]></directory_tree>
  <file path="README.md">
    <content><![CDATA[# My Project
A sample project.
]]></content>
  </file>
  <file path="src/main.py">
    <content><![CDATA[print("hello")
]]></content>
  </file>
</archive>

---
*End of context bundle.*

CLI Reference

The mdbox utility supports standard tar-style bundled flags. Note: The -f flag must always come last in a bundle.

Create (-c)

# Basic creation
mdbox -cvf archive.xml src/ docs/ README.md

# Creation with prompt injection (Preamble/Epilogue)
mdbox -cvf archive.xml --preamble "Build 2024-01-15" --epilogue license.txt src/

Extract (-x)

# Extracts to default (.) or specified output directory
mdbox -xf archive.xml output/

Add / Upsert (-a)

Creates the archive if it doesn't exist, or safely merges new entries into an existing one via atomic replacement.

mdbox -avf archive.xml new_module.py

Delete (--delete)

Removes files or whole directory prefixes. Uses atomic repacking to prevent corruption.

mdbox --delete -f archive.xml old_module.py src/deprecated/

Global Options

Flag Description
-c Create a new archive
-x Extract an archive
-a Add/upsert files into an archive
--delete Remove files from an archive
-f <file> Archive file path (required)
-v Verbose output
--debug Structured debug logging
--preamble <text|file> Text or file content to prepend before XML
--epilogue <text|file> Text or file content to append after XML

Python API Reference

Opening & Iterating

import mdbox

# Write mode (creates or overwrites)
with mdbox.open("archive.xml", mode="w") as qf:
    qf.write("src")

# Read mode (parses existing archive)
with mdbox.open("archive.xml", mode="r") as qf:
    for info in qf:
        print(f"File: {info.name}, Size: {info.length} bytes")

Advanced Writing

with mdbox.open("archive.xml", mode="w") as qf:
    qf.write("main.py")                       # Add single file
    qf.write("src")                           # Add entire directory
    qf.write("build/out.js", arcname="dist.js") # Override internal path
    qf.writestr("virtual.txt", "hello world")   # Write straight from memory

Advanced Reading

import io

with mdbox.open("archive.xml", mode="r") as qf:
    names = qf.namelist()                  # ['src/main.py', ...]
    text_content = qf.readstr("src/main.py") # Returns decoded string
    raw_bytes = qf.read("src/main.py")       # Returns raw bytes
    
    # Access injected LLM prompts
    print("Prompt:", qf.preamble)
    print("Trailing:", qf.epilogue)

# mdbox fully supports in-memory file-like objects
with io.BytesIO() as buffer:
    with mdbox.open(buffer, mode="w") as qf:
        qf.writestr("test.txt", "hello")

Safe Extraction

with mdbox.open("archive.xml", mode="r") as qf:
    # Extract everything
    qf.extractall("output/")

    # Extract conditionally
    python_files = [info for info in qf if info.name.endswith(".py")]
    qf.extractall("src_only/", members=python_files)

Exception Handling

The mdbox library provides strict validation. Malformed inputs or malicious extraction paths will throw explicit errors:

from mdbox import BinaryFileError, PathTraversalError

try:
    with mdbox.open("archive.xml", mode="w") as qf:
        qf.write("image.png") 
except BinaryFileError as e:
    print(f"Rejected: {e}") # Triggers if file fails UTF-8 checks

try:
    with mdbox.open("archive.xml", mode="r") as qf:
        qf.extractall()
except PathTraversalError as e:
    print(f"Blocked malicious path: {e}") # Triggers on absolute paths or ../

Architecture & Design

  • Security: Extraction paths are strictly validated using Path.relative_to(). Absolute paths and .. escape attempts are blocked outright.
  • Data Validation: Files must pass UTF-8 decoding, and content is scanned to ensure XML 1.0 compatibility (blocking NULL and C0/C1 control characters) to guarantee parseability.
  • Performance: * In read mode, data is extracted via memoryview slicing directly from raw bytes to skip redundant parsing overhead.
    • extractall() leverages a bounded async queue and concurrent workers.
  • Transactional Safety: If a with block encounters an exception, __exit__ safely aborts without writing, avoiding corrupt output states.

Development

# Clone and sync dependencies
git clone [https://github.com/chgroeling/mdbox.git](https://github.com/chgroeling/mdbox.git)
cd mdbox
uv sync --all-extras

# Run full quality gate (format, lint, type-check, test)
uv run ruff format src/ tests/ && \
uv run ruff check src/ tests/ && \
uv run mypy src/ && \
uv run pytest

# Check coverage
uv run pytest --cov=mdbox --cov-report=html

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdbox-0.2.0.tar.gz (96.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mdbox-0.2.0-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file mdbox-0.2.0.tar.gz.

File metadata

  • Download URL: mdbox-0.2.0.tar.gz
  • Upload date:
  • Size: 96.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mdbox-0.2.0.tar.gz
Algorithm Hash digest
SHA256 abcb4dd81c0176226803299cbc287aaeb1c5b6cceb6b5d2cad5f5e3f023f3c77
MD5 dbfba9641c0b66ea81d43bd7d24c1c91
BLAKE2b-256 9c338accb93902f2a70843e670a7f1904bd68d84c335755ca71bf1a18116949c

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdbox-0.2.0.tar.gz:

Publisher: publish.yml on chgroeling/mdbox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mdbox-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: mdbox-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mdbox-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1bbb631f40c9be65ffde2df3ddc95aa11beb3986226b214a7054242a86debaed
MD5 bc013f278ad4b8ff7845915a7f6663e8
BLAKE2b-256 72c45c705184f95b850dad7dbd4e1f960da94094ed8e5ddc3bbc0604670df828

See more details on using hashes here.

Provenance

The following attestation bundles were made for mdbox-0.2.0-py3-none-any.whl:

Publisher: publish.yml on chgroeling/mdbox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page