Skip to main content

Generate contextual XML briefs for Git Commits and PRs.

Project description

git2xml

A zero-dependency CLI that generates structured XML briefs of your Git commits and pull requests - ready to paste into Claude, ChatGPT, or any LLM that benefits from clean context.

Why this exists

LLMs work better with structured context than with raw blobs of text. When you ask Claude to write a PR description, pasting git diff output is workable but lossy - it strips staging information, mixes binary and text files into a mess, and doesn't separate file contents from their diffs. git2xml solves this by formatting your git state into XML that LLMs can parse cleanly, with explicit file paths, statuses, diffs, and content sections.

The result: better-quality output from your AI assistant with less prompt engineering on your end.

Features

  • AI-ready output: Produces XML structured specifically for LLM consumption, with explicit file paths, change statuses, diffs, and content sections that models parse reliably.
  • One command per use case: git2xml commit for current changes, git2xml pr for branch-vs-base - no flag-juggling for common workflows.
  • Zero dependencies: Built entirely on the Python standard library. No supply-chain surface beyond Python itself.
  • Robust binary detection: Automatically excludes binary files using BOM detection and statistical character analysis to prevent XML corruption.
  • Smart XML escaping: Safely wraps code containing CDATA terminators using dynamic Markdown fencing.
  • Staging-aware: Differentiates between staged, unstaged, and untracked files for accurate commit briefs.
  • Context-budget controls: Per-file content and diff size caps (--max-size, --max-diff-size) keep oversized files and runaway diffs out of your prompt while still recording that the change happened.
  • Usable as a library: A small typed Python API returns the brief as a string (sync or async) for use inside scripts, agents, and LLM pipelines - not just the CLI.

Requirements

Python 3.9 or higher. No other dependencies.

Installation

pip install git2xml

Usage

Run from inside any local Git repository, or target one with --repo PATH

Generate a commit brief

Summarizes your currently modified files (or staged files if using the --staged flag) against HEAD.

git2xml commit

Outputs to commit_brief.xml by default, written to the directory you ran the command from.

Generate a pull request brief

Summarizes all changes on your current branch against a base branch (defaults to main).

git2xml pr --base main --output my_pr_summary.xml

Content-control flags

Four optional flags let you shape what ends up in the brief:

Flag Description
--no-untracked Exclude untracked (new, un-git added) files from a commit brief. No-op when --staged is set (staged mode already excludes them) or in PR mode (no untracked files exist there).
--max-size N Override the per-file content size threshold (in bytes), above which file content is omitted and replaced with a reason string. Does not apply to diffs - that is --max-diff-size (see "Size limits: content vs. diffs"). The file's <diff> is still emitted, so the change stays visible. Defaults to 5 MiB (5242880). Must be a positive integer; --max-size 0 or a negative value exits with an error.
--max-diff-size N Override the per-file diff size threshold (in bytes, UTF-8). A diff larger than this is dropped from the output - its <diff> slot renders status="omitted" with a reason while the <content> stays. Unlike --max-size, this is output-shaping, not a pre-fetch guard (a diff has no size git can report before computing it). Defaults to 1 MiB (1048576). Must be >= 0; --max-diff-size 0 disables the cap (diffs are always included in full).
--no-content Produce a diff-only brief - all <content> bodies are suppressed and every file is represented by its <diff>. For newly added and untracked files (which have no prior version to diff against), the diff is the full file content shown as added (+) lines - so a diff-only brief still captures new files completely.
--strict-xml Generate strict XML 1.0 output - escape control characters and split CDATA terminators. If False (default), prioritize exact file fidelity, falling back to markdown fencing when a CDATA terminator is present. See the XML Compliance vs. File Fidelity section below for more details.

These flags compose freely with each other and with --staged:

git2xml commit --no-untracked              # omit untracked files
git2xml commit --max-size 102400           # cap content at 100 KiB
git2xml commit --max-diff-size 262144      # drop any single diff over 256 KiB
git2xml commit --no-content                # diffs only, no file bodies
git2xml commit --no-untracked --no-content # combine: drop untracked, diffs only

Note - new files under --no-content: Normally a brand-new file's change is carried by its <content>. Because --no-content suppresses content, git2xml instead emits the file's add-diff (every line shown as an added + line), so the file's contents are still present in the brief - just rendered as a diff rather than a content block. Untracked files (not yet git added) are diffed against an empty file to produce the same result. This applies only under --no-content; in the default mode new files render as normal <content>.

Size limits: content vs. diffs

git2xml caps two things independently - file content (--max-size) and a single file's diff (--max-diff-size). Same unit (bytes), different mechanics.

--max-size caps file content. Content size is read from git's metadata (ls-tree / cat-file) or the filesystem before the file is loaded, so an oversized file is detected and skipped without ever being read into memory - the guard prevents the work. The file's <file> element and <diff> are still emitted, so the change stays visible.

--max-diff-size caps a single file's diff. Unlike content, a diff has no size git can report in advance - it exists only once git computes it - so the cap can't prevent the work the way --max-size does. Instead the diff is streamed and abandoned once it crosses the limit (git2xml stops reading rather than buffering the whole thing), then dropped from the output: its <diff> slot renders status="omitted" with a reason while the <content> stays. This keeps a runaway diff - a big generated or vendored file, or a large deleted file whose only payload is its diff - out of your context budget. Defaults to 1 MiB; pass --max-diff-size 0 to disable it and always include diffs in full.

Execution options

Flag Description
--git-timeout N Per-git-command timeout in seconds. Raise it for very large repos where a single diff/log can take a while. Default: 30.
--diff-semaphore-limit N Max number of diffs fetched concurrently. Default: 20. Lower it to reduce load; raise it for more parallelism on fast disks.
--verbose/-v Verbose logging. Logs per-file and per-commit progress, as well as debug log messages.
--hide-repo-path Emit only the repository's directory name in the root <{commit,pr}_brief repo="..."> attribute instead of its absolute local path. Use when sharing briefs externally. Individual file path attributes are always repo-relative and unaffected. Default: off (the absolute path is emitted).

Output location

The brief is written to the directory you ran the command from, using the name from --output (or the {command}_brief.xml default). A relative --output is resolved against your current directory; an absolute path is honored as given. Note this is independent of --repo: pointing --repo at another repository still writes the brief to where you invoked the command, not into that repository.

Use as a Python library

Beyond the CLI, git2xml exposes a small programmatic API that returns the brief as a string (nothing is written to disk), so you can feed it straight into an LLM call, an agent pipeline, or any tool that assembles context.

import git2xml
from git2xml import Git2xmlConfig

# Synchronous - for plain scripts
xml = git2xml.generate_commit_brief_sync(Git2xmlConfig(repo="/path/to/repo"))

# A PR brief against a base branch
xml = git2xml.generate_pr_brief_sync(Git2xmlConfig(repo=".", base="develop"))

The engine is asyncio-based, so async callers (agents, web handlers) can await the native coroutines directly instead of blocking their event loop:

import asyncio
import git2xml
from git2xml import Git2xmlConfig

async def main():
    xml = await git2xml.generate_commit_brief(Git2xmlConfig(repo=".", staged=True))
    # ... hand `xml` to your model / agent ...

asyncio.run(main())

Windows note: the async functions spawn git via asyncio subprocesses, which on Windows require the ProactorEventLoop. asyncio.run(...) (above) selects it for you, so the normal case needs no action. Only if you supply your own event loop on Windows must it be a ProactorEventLoop - the SelectorEventLoop cannot create subprocesses and the call will fail. The sync wrappers and the CLI are unaffected.

All options live on the typed Git2xmlConfig object - the same settings the CLI flags map to (repo, base, staged, strict_xml, no_untracked, max_size, max_diff_size, no_content, git_timeout, diff_semaphore_limit, hide_repo_path). The function name selects the mode, so you never set command yourself:

config = Git2xmlConfig(repo=".", base="main", strict_xml=True, max_size=100_000)
xml = git2xml.generate_pr_brief_sync(config)

API reference

Function Sync/Async Returns
generate_commit_brief(config) async XML string
generate_pr_brief(config) async XML string
generate_commit_brief_sync(config) sync XML string
generate_pr_brief_sync(config) sync XML string
  • Each returns the brief as a string, or an empty string "" when there is nothing to summarize (a clean working tree, or no commits between the branch and its base).
  • Failures raise git2xml.Git2xmlError, or a more specific subclass: NotAGitRepositoryError, GitNotInstalledError, GitCommandError.
  • The *_sync helpers cannot be called from inside a running event loop (e.g. a Jupyter cell or an async handler); use the async variants there - they raise a clear RuntimeError if misused.

Example output

A commit brief for one new file and one modified file looks like this. Content and diffs are wrapped in CDATA so source is embedded verbatim; the repository name is emitted as a <name> element, and added files carry their full contents as <content> (no diff is needed since the content is the whole change):

<commit_brief repo="/Users/dev/myapp">
  <name>myapp</name>
  <file path="src/tests/test_auth.py" status="added">
    <content format="cdata"><![CDATA[# New file contents
]]></content>
  </file>
  <file path="src/auth.py" status="modified">
    <content format="cdata"><![CDATA[def verify_token(token):
    return token in VALID_TOKENS and not is_expired(token)
]]></content>
    <diff format="cdata"><![CDATA[@@ -1,2 +1,2 @@
 def verify_token(token):
-    return token in VALID_TOKENS
+    return token in VALID_TOKENS and not is_expired(token)]]></diff>
  </file>
  <file path="src/config_loader.py" status="added">
    <content format="cdata"><![CDATA[Symlink pointing to: ../shared/config_loader.py]]></content>
  </file>
  <file path="assets/logo.png" status="modified" reason="omitted - binary file detected" />
</commit_brief>

The repo attribute shows the absolute path by default; run with --hide-repo-path to emit just the directory name (repo="myapp") when sharing the brief externally.

A file whose content is omitted by --max-size still carries its <diff> and an explanatory reason:

  <file path="data/big.csv" status="modified" reason="omitted - file exceeds 5242880 bytes">
    <diff format="cdata"><![CDATA[@@ ... @@]]></diff>
  </file>

A file whose diff is dropped by --max-diff-size keeps its <content> and marks the omission on the diff slot:

  <file path="vendor/bundle.js" status="modified">
    <content format="cdata"><![CDATA[/* ... file contents ... */]]></content>
    <diff status="omitted" reason="diff exceeds 1048576 bytes" />
  </file>

PR mode wraps the same <file> elements and additionally emits a <commit_log> of the branch's commits.

XML Compliance vs. File Fidelity

By default, git2xml prioritizes exact file fidelity over strict XML 1.0 compliance. AI models (like Claude) read raw token streams and do not use strict XML parsers.

  • Control Characters: Literal control bytes (e.g., 0x00–0x08, 0x0B, 0x0C, 0x0E–0x1F) in your source code are passed through exactly as they appear in <content> and <diff> bodies. This also applies to control bytes inside attribute values (a file path or commit author): in default mode they pass through unescaped, so a control byte there (such as a stray newline in a path) can break attribute well-formedness. --strict-xml escapes control characters in attributes too.
  • CDATA Terminators: If a file contains the literal string ]]>, git2xml avoids splitting the tag (which alters the raw text the LLM sees) and instead falls back to dynamic Markdown fencing (format="fenced").
  • Invalid UTF-8: Text is decoded as UTF-8 on a best-effort basis. Bytes that aren't valid UTF-8 are replaced with the Unicode replacement character (U+FFFD, ) rather than causing an error. Files git detects as binary are omitted entirely, so this affects only text files containing occasional malformed bytes.

If you are piping this output into a strict automated XML parser (like xml.etree or a CI/CD pipeline) rather than an LLM, you can use the --strict-xml flag. This will force strict XML 1.0 compliance by replacing control characters with their string representations (e.g., \x1b) and safely splitting CDATA terminators (]]]]><![CDATA[>).

Symlinks and file content

git2xml mirrors git's own behavior for symbolic links: it emits the link's target path as the content (e.g. Symlink pointing to: ../shared/config.py), never the contents of the file the link points to. It does not follow or dereference symlinks, so a link pointing outside the repository is recorded as a path, not read.

More broadly, git2xml includes file contents and diffs verbatim, exactly as git sees them. It does not scan, filter, or redact content for secrets or sensitive data - that is deliberately out of scope for a zero-dependency git formatter. Review generated briefs before pasting them into any external tool, and use .gitignore (git2xml respects it for untracked files) or --no-untracked to keep files out of the output.

One path is included by default: the root element's repo attribute carries the absolute local path of the repository (e.g. repo="/Users/dev/myapp"), so a brief records where on your machine it was generated. The per-file path attributes are always repo-relative and never absolute. If you are pasting briefs into a third-party tool and would rather not disclose your local path (which can reveal a username or directory layout), pass --hide-repo-path to emit only the repository's directory name (repo="myapp") instead. The repository name is also always available separately in the <name> element.

Security: run against repositories you trust

git2xml works by invoking your local git to read a repository's diffs, blobs, and status. It therefore inherits git's normal behavior of running repository-defined programs during otherwise read-only operations - for example a textconv or external-diff driver referenced from .gitattributes and defined in the repository's .git/config, or an fsmonitor command. A repository you don't control - especially one delivered as an archive that ships its own .git directory, rather than a fresh clone - can therefore cause code to run on your machine when you point git2xml at it. This is a property of git itself, not specific to git2xml.

One caveat to the "read-only" framing: in --staged mode, git2xml runs git write-tree to read staged-file metadata in a single batch. This writes a tree object into .git/objects, but it leaves your index, working tree, and HEAD untouched, and the unreferenced object is reclaimed by git's normal gc.

The practical guidance is the same as for running any git command: only run git2xml against repositories you trust. To inspect an untrusted repository, do it in a throwaway sandbox (a container or VM) rather than on your primary machine.

Why XML (not JSON)?

XML was chosen because LLMs - Claude in particular - parse structured XML tags more reliably than nested JSON when the content includes code with embedded quotes, brackets, and special characters. CDATA sections let you embed raw source code verbatim without escaping, which matters when you're feeding diffs and file contents into a prompt.

If you have a strong reason to want JSON output, open an issue - it's a reasonable addition.

Origin

I built git2xml while working on HiveTrail Mesh - a desktop app that assembles structured LLM context from multiple sources (Notion, GitHub Issues, local files, git repos). The git-handling component turned out to be useful as a standalone CLI, so I extracted it under MIT license.

If you find this tool helpful and want the same approach applied across your full developer context - not just git - check out HiveTrail Mesh.

Contributing

Issues and PRs welcome. This is a small utility, so expect light maintenance - but reasonable bug reports and improvements will be reviewed and merged.

License

MIT - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git2xml-0.1.0.tar.gz (78.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

git2xml-0.1.0-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file git2xml-0.1.0.tar.gz.

File metadata

  • Download URL: git2xml-0.1.0.tar.gz
  • Upload date:
  • Size: 78.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for git2xml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c2b73922a07e6797eda1ba7b569a531b39c78039da9840d2c553b6358651d7f0
MD5 a28528ddfa28a9f3723afe90d68f7a4d
BLAKE2b-256 a1d0f416c8c36b1175cdd8ba4f59b08d0db4277599f9e03dab748dd8444bccd1

See more details on using hashes here.

File details

Details for the file git2xml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: git2xml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 42.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for git2xml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d20dc3e647b09407495f553a7b802157d19b1e30e7677f0fbcd4188421ad1917
MD5 3b3c1cfe70b64ce3b5b256a85f6c1a3e
BLAKE2b-256 c4325b6a31468492f9e896cfc12fd02d4d664da4b5828df615ba038f49ea539c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page