Generate contextual XML briefs for Git Commits and PRs.
Project description
git2xml
A zero-dependency CLI that generates structured XML briefs of your Git commits and pull requests - ready to paste into Claude, ChatGPT, or any LLM that benefits from clean context.
Why this exists
LLMs work better with structured context than with raw blobs of text. When you ask Claude to write a PR description, pasting git diff output is workable but lossy - it strips staging information, mixes binary and text files into a mess, and doesn't separate file contents from their diffs. git2xml solves this by formatting your git state into XML that LLMs can parse cleanly, with explicit file paths, statuses, diffs, and content sections.
The result: better-quality output from your AI assistant with less prompt engineering on your end.
Features
- AI-ready output: Produces XML structured specifically for LLM consumption, with explicit file paths, change statuses, diffs, and content sections that models parse reliably.
- One command per use case:
git2xml commitfor current changes,git2xml prfor branch-vs-base - no flag-juggling for common workflows. - Zero dependencies: Built entirely on the Python standard library. No supply-chain surface beyond Python itself.
- Robust binary detection: Automatically excludes binary files using BOM detection and statistical character analysis to prevent XML corruption.
- Smart XML escaping: Safely wraps code containing CDATA terminators using dynamic Markdown fencing.
- Staging-aware: Differentiates between staged, unstaged, and untracked files for accurate commit briefs.
- Context-budget controls: Per-file content and diff size caps (
--max-size,--max-diff-size) keep oversized files and runaway diffs out of your prompt while still recording that the change happened. - Usable as a library: A small typed Python API returns the brief as a string (sync or async) for use inside scripts, agents, and LLM pipelines - not just the CLI.
Requirements
Python 3.9 or higher. No other dependencies.
Installation
pip install git2xml
Usage
Run from inside any local Git repository, or target one with --repo PATH
Generate a commit brief
Summarizes your currently modified files (or staged files if using the --staged flag) against HEAD.
git2xml commit
Outputs to commit_brief.xml by default, written to the directory you ran the command from.
Generate a pull request brief
Summarizes all changes on your current branch against a base branch (defaults to main).
git2xml pr --base main --output my_pr_summary.xml
Content-control flags
Four optional flags let you shape what ends up in the brief:
| Flag | Description |
|---|---|
--no-untracked |
Exclude untracked (new, un-git added) files from a commit brief. No-op when --staged is set (staged mode already excludes them) or in PR mode (no untracked files exist there). |
--max-size N |
Override the per-file content size threshold (in bytes), above which file content is omitted and replaced with a reason string. Does not apply to diffs - that is --max-diff-size (see "Size limits: content vs. diffs"). The file's <diff> is still emitted, so the change stays visible. Defaults to 5 MiB (5242880). Must be a positive integer; --max-size 0 or a negative value exits with an error. |
--max-diff-size N |
Override the per-file diff size threshold (in bytes, UTF-8). A diff larger than this is dropped from the output - its <diff> slot renders status="omitted" with a reason while the <content> stays. Unlike --max-size, this is output-shaping, not a pre-fetch guard (a diff has no size git can report before computing it). Defaults to 1 MiB (1048576). Must be >= 0; --max-diff-size 0 disables the cap (diffs are always included in full). |
--no-content |
Produce a diff-only brief - all <content> bodies are suppressed and every file is represented by its <diff>. For newly added and untracked files (which have no prior version to diff against), the diff is the full file content shown as added (+) lines - so a diff-only brief still captures new files completely. |
--strict-xml |
Generate strict XML 1.0 output - escape control characters and split CDATA terminators. If False (default), prioritize exact file fidelity, falling back to markdown fencing when a CDATA terminator is present. See the XML Compliance vs. File Fidelity section below for more details. |
These flags compose freely with each other and with --staged:
git2xml commit --no-untracked # omit untracked files
git2xml commit --max-size 102400 # cap content at 100 KiB
git2xml commit --max-diff-size 262144 # drop any single diff over 256 KiB
git2xml commit --no-content # diffs only, no file bodies
git2xml commit --no-untracked --no-content # combine: drop untracked, diffs only
Note - new files under
--no-content: Normally a brand-new file's change is carried by its<content>. Because--no-contentsuppresses content, git2xml instead emits the file's add-diff (every line shown as an added+line), so the file's contents are still present in the brief - just rendered as a diff rather than a content block. Untracked files (not yetgit added) are diffed against an empty file to produce the same result. This applies only under--no-content; in the default mode new files render as normal<content>.
Size limits: content vs. diffs
git2xml caps two things independently - file content (--max-size) and a
single file's diff (--max-diff-size). Same unit (bytes), different mechanics.
--max-size caps file content. Content size is read from git's metadata
(ls-tree / cat-file) or the filesystem before the file is loaded, so an
oversized file is detected and skipped without ever being read into memory - the
guard prevents the work. The file's <file> element and <diff> are still
emitted, so the change stays visible.
--max-diff-size caps a single file's diff. Unlike content, a diff has no
size git can report in advance - it exists only once git computes it - so the cap
can't prevent the work the way --max-size does. Instead the diff is streamed and
abandoned once it crosses the limit (git2xml stops reading rather than buffering
the whole thing), then dropped from the output: its <diff> slot renders
status="omitted" with a reason while the <content> stays. This keeps a runaway
diff - a big generated or vendored file, or a large deleted file whose only
payload is its diff - out of your context budget. Defaults to 1 MiB; pass
--max-diff-size 0 to disable it and always include diffs in full.
Execution options
| Flag | Description |
|---|---|
--git-timeout N |
Per-git-command timeout in seconds. Raise it for very large repos where a single diff/log can take a while. Default: 30. |
--diff-semaphore-limit N |
Max number of diffs fetched concurrently. Default: 20. Lower it to reduce load; raise it for more parallelism on fast disks. |
--verbose/-v |
Verbose logging. Logs per-file and per-commit progress, as well as debug log messages. |
--hide-repo-path |
Emit only the repository's directory name in the root <{commit,pr}_brief repo="..."> attribute instead of its absolute local path. Use when sharing briefs externally. Individual file path attributes are always repo-relative and unaffected. Default: off (the absolute path is emitted). |
Output location
The brief is written to the directory you ran the command from, using the name from
--output (or the {command}_brief.xml default). A relative --output is resolved
against your current directory; an absolute path is honored as given. Note this is
independent of --repo: pointing --repo at another repository still writes the
brief to where you invoked the command, not into that repository.
Use as a Python library
Beyond the CLI, git2xml exposes a small programmatic API that returns the brief
as a string (nothing is written to disk), so you can feed it straight into an LLM
call, an agent pipeline, or any tool that assembles context.
import git2xml
from git2xml import Git2xmlConfig
# Synchronous - for plain scripts
xml = git2xml.generate_commit_brief_sync(Git2xmlConfig(repo="/path/to/repo"))
# A PR brief against a base branch
xml = git2xml.generate_pr_brief_sync(Git2xmlConfig(repo=".", base="develop"))
The engine is asyncio-based, so async callers (agents, web handlers) can await the
native coroutines directly instead of blocking their event loop:
import asyncio
import git2xml
from git2xml import Git2xmlConfig
async def main():
xml = await git2xml.generate_commit_brief(Git2xmlConfig(repo=".", staged=True))
# ... hand `xml` to your model / agent ...
asyncio.run(main())
Windows note: the async functions spawn
gitvia asyncio subprocesses, which on Windows require theProactorEventLoop.asyncio.run(...)(above) selects it for you, so the normal case needs no action. Only if you supply your own event loop on Windows must it be aProactorEventLoop- theSelectorEventLoopcannot create subprocesses and the call will fail. The sync wrappers and the CLI are unaffected.
All options live on the typed Git2xmlConfig object - the same settings the CLI
flags map to (repo, base, staged, strict_xml, no_untracked, max_size,
max_diff_size, no_content, git_timeout, diff_semaphore_limit, hide_repo_path). The function name selects the
mode, so you never set command yourself:
config = Git2xmlConfig(repo=".", base="main", strict_xml=True, max_size=100_000)
xml = git2xml.generate_pr_brief_sync(config)
API reference
| Function | Sync/Async | Returns |
|---|---|---|
generate_commit_brief(config) |
async | XML string |
generate_pr_brief(config) |
async | XML string |
generate_commit_brief_sync(config) |
sync | XML string |
generate_pr_brief_sync(config) |
sync | XML string |
- Each returns the brief as a string, or an empty string
""when there is nothing to summarize (a clean working tree, or no commits between the branch and its base). - Failures raise
git2xml.Git2xmlError, or a more specific subclass:NotAGitRepositoryError,GitNotInstalledError,GitCommandError. - The
*_synchelpers cannot be called from inside a running event loop (e.g. a Jupyter cell or an async handler); use the async variants there - they raise a clearRuntimeErrorif misused.
Example output
A commit brief for one new file and one modified file looks like this. Content and
diffs are wrapped in CDATA so source is embedded verbatim; the repository name is
emitted as a <name> element, and added files carry their full contents as <content>
(no diff is needed since the content is the whole change):
<commit_brief repo="/Users/dev/myapp">
<name>myapp</name>
<file path="src/tests/test_auth.py" status="added">
<content format="cdata"><![CDATA[# New file contents
]]></content>
</file>
<file path="src/auth.py" status="modified">
<content format="cdata"><![CDATA[def verify_token(token):
return token in VALID_TOKENS and not is_expired(token)
]]></content>
<diff format="cdata"><![CDATA[@@ -1,2 +1,2 @@
def verify_token(token):
- return token in VALID_TOKENS
+ return token in VALID_TOKENS and not is_expired(token)]]></diff>
</file>
<file path="src/config_loader.py" status="added">
<content format="cdata"><![CDATA[Symlink pointing to: ../shared/config_loader.py]]></content>
</file>
<file path="assets/logo.png" status="modified" reason="omitted - binary file detected" />
</commit_brief>
The
repoattribute shows the absolute path by default; run with--hide-repo-pathto emit just the directory name (repo="myapp") when sharing the brief externally.
A file whose content is omitted by --max-size still carries its <diff> and an
explanatory reason:
<file path="data/big.csv" status="modified" reason="omitted - file exceeds 5242880 bytes">
<diff format="cdata"><![CDATA[@@ ... @@]]></diff>
</file>
A file whose diff is dropped by --max-diff-size keeps its <content> and marks
the omission on the diff slot:
<file path="vendor/bundle.js" status="modified">
<content format="cdata"><![CDATA[/* ... file contents ... */]]></content>
<diff status="omitted" reason="diff exceeds 1048576 bytes" />
</file>
PR mode wraps the same <file> elements and additionally emits a <commit_log> of the
branch's commits.
XML Compliance vs. File Fidelity
By default, git2xml prioritizes exact file fidelity over strict XML 1.0 compliance. AI models (like Claude) read raw token streams and do not use strict XML parsers.
- Control Characters: Literal control bytes (e.g.,
0x00–0x08,0x0B,0x0C,0x0E–0x1F) in your source code are passed through exactly as they appear in<content>and<diff>bodies. This also applies to control bytes inside attribute values (a file path or commit author): in default mode they pass through unescaped, so a control byte there (such as a stray newline in a path) can break attribute well-formedness.--strict-xmlescapes control characters in attributes too. - CDATA Terminators: If a file contains the literal string
]]>,git2xmlavoids splitting the tag (which alters the raw text the LLM sees) and instead falls back to dynamic Markdown fencing (format="fenced"). - Invalid UTF-8: Text is decoded as UTF-8 on a best-effort basis. Bytes that aren't valid UTF-8 are replaced with the Unicode replacement character (U+FFFD,
�) rather than causing an error. Files git detects as binary are omitted entirely, so this affects only text files containing occasional malformed bytes.
If you are piping this output into a strict automated XML parser (like xml.etree or a CI/CD pipeline) rather than an LLM, you can use the --strict-xml flag. This will force strict XML 1.0 compliance by replacing control characters with their string representations (e.g., \x1b) and safely splitting CDATA terminators (]]]]><![CDATA[>).
Symlinks and file content
git2xml mirrors git's own behavior for symbolic links: it emits the link's
target path as the content (e.g. Symlink pointing to: ../shared/config.py),
never the contents of the file the link points to. It does not follow or
dereference symlinks, so a link pointing outside the repository is recorded as
a path, not read.
More broadly, git2xml includes file contents and diffs verbatim, exactly as
git sees them. It does not scan, filter, or redact content for secrets or
sensitive data - that is deliberately out of scope for a zero-dependency git
formatter. Review generated briefs before pasting them into any external tool,
and use .gitignore (git2xml respects it for untracked files) or --no-untracked
to keep files out of the output.
One path is included by default: the root element's repo attribute carries the
absolute local path of the repository (e.g. repo="/Users/dev/myapp"), so a
brief records where on your machine it was generated. The per-file path
attributes are always repo-relative and never absolute. If you are pasting briefs
into a third-party tool and would rather not disclose your local path (which can
reveal a username or directory layout), pass --hide-repo-path to emit only the
repository's directory name (repo="myapp") instead. The repository name is also
always available separately in the <name> element.
Security: run against repositories you trust
git2xml works by invoking your local git to read a repository's diffs, blobs,
and status. It therefore inherits git's normal behavior of running
repository-defined programs during otherwise read-only operations - for example a
textconv or external-diff driver referenced from .gitattributes and defined in
the repository's .git/config, or an fsmonitor command. A repository you don't
control - especially one delivered as an archive that ships its own .git
directory, rather than a fresh clone - can therefore cause code to run on your
machine when you point git2xml at it. This is a property of git itself, not
specific to git2xml.
One caveat to the "read-only" framing: in --staged mode, git2xml runs
git write-tree to read staged-file metadata in a single batch. This writes a
tree object into .git/objects, but it leaves your index, working tree, and
HEAD untouched, and the unreferenced object is reclaimed by git's normal gc.
The practical guidance is the same as for running any git command: only run
git2xml against repositories you trust. To inspect an untrusted repository, do
it in a throwaway sandbox (a container or VM) rather than on your primary machine.
Why XML (not JSON)?
XML was chosen because LLMs - Claude in particular - parse structured XML tags more reliably than nested JSON when the content includes code with embedded quotes, brackets, and special characters. CDATA sections let you embed raw source code verbatim without escaping, which matters when you're feeding diffs and file contents into a prompt.
If you have a strong reason to want JSON output, open an issue - it's a reasonable addition.
Origin
I built git2xml while working on HiveTrail Mesh - a desktop app that assembles structured LLM context from multiple sources (Notion, GitHub Issues, local files, git repos). The git-handling component turned out to be useful as a standalone CLI, so I extracted it under MIT license.
If you find this tool helpful and want the same approach applied across your full developer context - not just git - check out HiveTrail Mesh.
Contributing
Issues and PRs welcome. This is a small utility, so expect light maintenance - but reasonable bug reports and improvements will be reviewed and merged.
License
MIT - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file git2xml-0.1.0.tar.gz.
File metadata
- Download URL: git2xml-0.1.0.tar.gz
- Upload date:
- Size: 78.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2b73922a07e6797eda1ba7b569a531b39c78039da9840d2c553b6358651d7f0
|
|
| MD5 |
a28528ddfa28a9f3723afe90d68f7a4d
|
|
| BLAKE2b-256 |
a1d0f416c8c36b1175cdd8ba4f59b08d0db4277599f9e03dab748dd8444bccd1
|
File details
Details for the file git2xml-0.1.0-py3-none-any.whl.
File metadata
- Download URL: git2xml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 42.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d20dc3e647b09407495f553a7b802157d19b1e30e7677f0fbcd4188421ad1917
|
|
| MD5 |
3b3c1cfe70b64ce3b5b256a85f6c1a3e
|
|
| BLAKE2b-256 |
c4325b6a31468492f9e896cfc12fd02d4d664da4b5828df615ba038f49ea539c
|