Skip to main content

Normalize PE files for reproducible MSVC++ builds

Project description

msvcpp-normalize-pe - Normalize PE Files for Reproducible MSVC++ Builds

Documentation PyPI Python Version Tests Windows Tests

⚠️ AI-Assisted Development Notice: This project was developed as an experiment in AI-assisted "vibe coding" using Claude Code. While the code has comprehensive tests and linting, it was primarily generated through AI assistance. The implementation is based on reverse-engineering PE file formats and may have edge cases or behaviors that haven't been thoroughly tested with all possible MSVC configurations. Use with caution in production environments and verify results with your specific toolchain.

A Python tool to patch Windows PE (Portable Executable) files to make MSVC builds reproducible by normalizing timestamps, GUIDs, and other non-deterministic debug metadata.

The Problem

When compiling Windows executables with Microsoft Visual C++ (MSVC), even with the /Brepro flag enabled, builds are not fully reproducible. The same source code compiled twice produces different binaries due to non-deterministic debug information:

  • COFF Header TimeDateStamp: Build timestamp in PE header
  • Debug Directory Timestamps: 4 separate timestamps in debug entries (CODEVIEW, VC_FEATURE, POGO, REPRO)
  • CODEVIEW GUID: Random GUID linking .exe to .pdb file
  • CODEVIEW Age: Incremental counter that varies between builds
  • REPRO Hash: Composite hash containing the GUID and timestamps

This makes binary verification in CI impossible - you can't verify that committed binaries match the source code because every rebuild produces different bytes, even though the executable code is identical.

The Solution

This tool patches all non-deterministic fields in PE files to fixed, deterministic values:

  • All timestamps0x00000001 (January 1, 1970 + 1 second)
  • CODEVIEW GUID00000000-0000-0000-0000-000000000000
  • CODEVIEW Age1
  • REPRO Hash → All zeros

After patching, identical source code produces byte-for-byte identical binaries, enabling reproducible builds and CI verification.

What Gets Patched

Fields Patched (8 total)

  1. PE COFF Header TimeDateStamp (offset varies, typically 0xC0-0x100)
  2. Debug CODEVIEW Entry Timestamp
  3. Debug CODEVIEW GUID (16 bytes)
  4. Debug CODEVIEW Age (4 bytes)
  5. Debug VC_FEATURE Entry Timestamp
  6. Debug POGO Entry Timestamp
  7. Debug REPRO Entry Timestamp
  8. Debug REPRO Hash (36 bytes)

What Doesn't Change

  • All executable code (.text section)
  • All program data (.data, .rdata sections)
  • Import/Export tables
  • Section headers
  • Relocations

The binary behaves identically at runtime - only metadata used for debugging is normalized.

Installation

From PyPI (Recommended)

pip install msvcpp-normalize-pe

From Source

git clone https://github.com/mithro/msvcpp-normalize-pe.git
cd msvcpp-normalize-pe
pip install .

Using uv

uv pip install msvcpp-normalize-pe

Usage

Command Line

After installation, the msvcpp-normalize-pe command is available:

# Basic usage
msvcpp-normalize-pe program.exe

# Custom timestamp
msvcpp-normalize-pe program.exe 1234567890

# Verbose output
msvcpp-normalize-pe --verbose program.exe

# See all options
msvcpp-normalize-pe --help

Python API

You can also use msvcpp-normalize-pe as a library in your Python code:

from pathlib import Path
from msvcpp_normalize_pe import patch_pe_file

result = patch_pe_file(Path("program.exe"), timestamp=1, verbose=True)
if result.success:
    print(f"Patched {result.patches_applied} fields")
else:
    print(f"Errors: {result.errors}")

Example Output

[1/1] COFF header: 0x829692a8 -> 0x00000001
  [2/?] Debug CODEVIEW timestamp: 0x829692a8 -> 0x00000001
  [3/?] Debug CODEVIEW GUID: e97b6ac706ea9b2dd577392d2bf08df7 -> 00000000000000000000000000000000
  [4/?] Debug CODEVIEW Age: 7 -> 1
  [5/?] Debug VC_FEATURE timestamp: 0x829692a8 -> 0x00000001
  [6/?] Debug POGO timestamp: 0x829692a8 -> 0x00000001
  [7/?] Debug REPRO timestamp: 0x829692a8 -> 0x00000001
  [8/?] Debug REPRO hash: 20000000e97b6ac7... -> 000000000000000000...
  Total: 8 timestamp(s) patched in program.exe

Integration with Build Systems

Makefile Integration (Native MSVC)

# Native MSVC builds
ifeq ($(USE_NATIVE_MSVC),1)
  program.exe: program.cpp
	cl.exe /O2 /Zi program.cpp /link /DEBUG:FULL /Brepro
	msvcpp-normalize-pe program.exe 1
endif

CI/CD Verification Workflow

name: Verify Binary Reproducibility

jobs:
  verify:
    runs-on: windows-latest
    steps:
      - name: Build from source
        run: |
          cl.exe /O2 program.cpp /link /DEBUG:FULL /Brepro
          msvcpp-normalize-pe program.exe 1

      - name: Compare with committed binary
        run: |
          fc /b program.exe committed/program.exe

Requirements

  • Python 3.9+ (type hints, dataclasses)
  • Target files: Windows PE executables (.exe) or DLLs (.dll)
  • Architecture: Works with both 32-bit (PE32) and 64-bit (PE32+) binaries

No runtime dependencies - uses only Python standard library (struct, sys, pathlib, dataclasses).

Limitations and Known Issues

What This Tool Fixes

  • ✅ Makes PE executables reproducible (timestamps, GUIDs)
  • ✅ Works with native MSVC (cl.exe + link.exe)
  • ✅ Preserves debugging capability (PDB files still work)

What This Tool Cannot Fix

  • PDB files remain non-deterministic (~11% of PDB content varies)

    • PDB files contain thousands of small differences (padding, internal offsets, GUIDs)
    • Microsoft's PDB format has fundamental non-determinism issues
    • Industry solution: Use clang-cl + lld-link instead of native MSVC
  • Does not work with stripped binaries (no debug directory to patch)

Alternative: Use clang-cl + lld-link

For fully reproducible builds including PDB files, use LLVM's Windows toolchain:

clang-cl /O2 /std:c++17 program.cpp /link /DEBUG:FULL /Brepro /TIMESTAMP:1

The /TIMESTAMP: flag is only supported by lld-link, not native MSVC link.exe.

Technical Details

PE File Structure

The tool parses the PE file structure to locate and patch:

  1. DOS Header (offset 0x3C) → PE signature offset
  2. PE Signature (offset varies) → Verify "PE\0\0"
  3. COFF Header (after PE sig) → TimeDateStamp at +4
  4. Optional Header (after COFF) → Contains Data Directories
  5. Data Directory #6 → Debug Directory (RVA + Size)
  6. Debug Directory Entries → 28-byte structures with timestamps
  7. CODEVIEW RSDS Structure → GUID at +4, Age at +20
  8. REPRO Hash → Full hash data

Why /Brepro Isn't Enough

MSVC's /Brepro flag:

  • ✅ Removes some non-determinism
  • ✅ Uses hash-based timestamps instead of wall clock time
  • ❌ Still produces different hashes for each build
  • ❌ GUID remains random
  • ❌ Age field increments

This is because /Brepro computes a hash of build inputs, but includes random/variable data in that hash.

Comparison with Alternatives

vs. ducible

ducible is an older tool with similar goals:

  • Unmaintained (last update 2018)
  • ❌ Only patches COFF header timestamp
  • ❌ Does not patch Debug Directory timestamps
  • ❌ Does not patch GUIDs or Age fields

vs. clang-cl + lld-link

Using LLVM's toolchain:

  • Fully reproducible (including PDB files)
  • ✅ Supports /TIMESTAMP: flag
  • ❌ Not always possible (may need native MSVC for compatibility)

This tool fills the gap when you must use native MSVC but still want reproducible .exe files.

Research and References

The non-determinism of MSVC builds with debug symbols is well-documented:

  • Microsoft PDB Repository Issue #9: PDB non-determinism issues (GUIDs, padding, uninitialized buffers)
  • Chromium Project: Uses clang-cl + lld-link specifically for reproducible builds
  • Bazel Team: Marked /experimental:deterministic as "not planned" because "PDBs are not deterministic"
  • Reproducible Builds Mailing List (Dec 2024): "there is no way to really solve this issue" with MSVC
  • Stack Overflow (Nov 2024): "No complete solution currently exists for achieving fully reproducible MSVC builds with debug symbols"

License

Apache License 2.0 - See LICENSE file

Contributing

Contributions welcome! Please test thoroughly with your build system before submitting PRs.

Credits

Developed as part of the ghidra-optimized-stdvector-decompiler project to enable CI verification of demo binaries compiled with multiple MSVC versions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

msvcpp_normalize_pe-0.0.post38.tar.gz (88.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

msvcpp_normalize_pe-0.0.post38-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file msvcpp_normalize_pe-0.0.post38.tar.gz.

File metadata

  • Download URL: msvcpp_normalize_pe-0.0.post38.tar.gz
  • Upload date:
  • Size: 88.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for msvcpp_normalize_pe-0.0.post38.tar.gz
Algorithm Hash digest
SHA256 0ab403ac6a84b660e6ac3b2d7dc35f36c80de587fd960aaaa6ce5788dbcfb2fb
MD5 98d6ea004e2e97c1ce479d0b4f792827
BLAKE2b-256 efc26f7973ebe6134cfd338f9771db28eef482236e2c5a4838db099535ca1639

See more details on using hashes here.

Provenance

The following attestation bundles were made for msvcpp_normalize_pe-0.0.post38.tar.gz:

Publisher: publish.yml on mithro/msvcpp-normalize-pe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file msvcpp_normalize_pe-0.0.post38-py3-none-any.whl.

File metadata

File hashes

Hashes for msvcpp_normalize_pe-0.0.post38-py3-none-any.whl
Algorithm Hash digest
SHA256 5d7252718abbd491b387cf41a46e6f9c9de6562ea4cb808e73819af2c0036a90
MD5 82bfe189483a0e044f054e5fa6e5e39e
BLAKE2b-256 75aa76c172db6c296f378e38400544813e0694f4e689769c12aa66c35a5aadcd

See more details on using hashes here.

Provenance

The following attestation bundles were made for msvcpp_normalize_pe-0.0.post38-py3-none-any.whl:

Publisher: publish.yml on mithro/msvcpp-normalize-pe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page