Skip to main content

HAR capture and PII sanitization library for network traffic analysis

Project description

har-capture

PyPI version Downloads codecov License: MIT AI Assisted

Capture and sanitize HAR (HTTP Archive) files. HAR files record browser HTTP activity and are commonly used for debugging, diagnostics, and test fixtures.

Quick Start

Windows
  1. Install Python from the Microsoft Store or python.org
  2. Open PowerShell and run:
pip install har-capture[full]
python -m har_capture get https://example.com
macOS / Linux
pip install har-capture[full]
har-capture get https://example.com
Already have a HAR file?
pip install har-capture
har-capture sanitize myfile.har
Python API
from har_capture.sanitization import sanitize_har

with open("input.har") as f:
    har_data = json.load(f)

sanitized = sanitize_har(har_data)

Why har-capture?

Chrome DevTools v130+ now sanitizes cookies and auth headers by default when exporting HAR files. That's a good start, but HAR files contain much more sensitive data:

  • IP addresses, MAC addresses, email addresses
  • Passwords and credentials in form bodies
  • Serial numbers, device names, session tokens

har-capture provides deep sanitization and CLI automation:

har-capture get <TARGET>     # Capture → sanitize → compress (all automatic)

Comparison with Existing Tools

Feature har-capture DevTools Google Cloudflare Edgio
Sanitization
Cookies/auth headers
IPs, MACs, emails
Passwords in forms
JWT smart redaction
Correlation-preserving
Usability
No installation needed
Data stays local
CLI/scriptable
Preview before redact
Extras
Integrated capture
Custom patterns
Validation

Target Use Cases

  • Support diagnostics: Users submit sanitized HAR files without exposing credentials
  • Web development: Capture and analyze HTTP traffic for debugging
  • Test fixtures: Generate reproducible traffic captures for testing
  • Security review: Validate HAR files for PII leaks before sharing

Features

  • Zero Dependencies Core: Core sanitization uses only Python stdlib
  • HAR Capture: Browser-based capture using Playwright (optional)
  • PII Sanitization: Remove sensitive data from HTML and HAR files
  • Correlation-Preserving Redaction: Salted hashes maintain value relationships
  • Custom Patterns: External JSON files for easy pattern updates
  • Validation: Check HAR files for PII leaks before committing
  • CLI Interface: Easy-to-use command line tools

Installation

# Core only (zero dependencies)
pip install har-capture

# With browser capture
pip install har-capture[capture]
playwright install chromium  # Install browser

# With CLI
pip install har-capture[cli]

# Full installation
pip install har-capture[full]

Quick Start

Python API

from har_capture.sanitization import sanitize_html, sanitize_har

# Sanitize HTML (correlation-preserving by default)
clean_html = sanitize_html(raw_html)

# Sanitize with consistent salt (correlate across files)
clean_html = sanitize_html(raw_html, salt="my-secret-key")

# Use static placeholders (legacy mode)
clean_html = sanitize_html(raw_html, salt=None)

# Sanitize HAR file
from har_capture.sanitization import sanitize_har_file
sanitize_har_file("capture.har")  # Creates capture.sanitized.har

CLI

# Capture HTTP traffic
har-capture get <TARGET>

# Sanitize a HAR file (uses random salt by default)
har-capture sanitize capture.har

# Sanitize with consistent salt
har-capture sanitize capture.har --salt my-key

# Sanitize with static placeholders
har-capture sanitize capture.har --no-salt

# Use custom patterns
har-capture sanitize capture.har --patterns custom.json

# Validate for PII leaks
har-capture validate capture.har

Correlation-Preserving Redaction

By default, har-capture uses format-preserving salted hashes for redaction:

  • Same value → same hash (within a session)
  • Different values → different hashes
  • Output remains valid format (parseable by analysis tools)
  • Uses reserved/documentation ranges that won't collide with real data

Example:

Before:
  MAC: AA:BB:CC:DD:EE:FF (appears 3 times)
  MAC: 11:22:33:44:55:66 (appears 2 times)

With salted hash (default):
  MAC: 02:a1:b2:c3:d4:e5 (appears 3 times - same device, valid MAC format)
  MAC: 02:7f:8e:9d:2c:01 (appears 2 times - different device)

With static placeholders (--no-salt):
  MAC: XX:XX:XX:XX:XX:XX (appears 5 times - correlation lost)

Format-preserving ranges used:

Type Range Standard
MAC 02:xx:xx:xx:xx:xx Locally administered bit
Private IP 10.255.x.x RFC 1918
Public IP 192.0.2.x RFC 5737 TEST-NET-1
IPv6 2001:db8:: RFC 3849 documentation
Email user_xxx@redacted.invalid RFC 2606 .invalid TLD

Salt options:

  • --salt auto (default): Random salt per session
  • --salt my-key: Consistent hashing across runs
  • --no-salt: Static placeholders (legacy mode)

Custom Patterns

Patterns are stored in external JSON files for easy customization:

src/har_capture/patterns/
├── pii.json          # PII detection patterns
├── sensitive.json    # Sensitive headers/fields
└── allowlist.json    # Safe placeholder values

Add custom patterns via CLI:

har-capture sanitize capture.har --patterns my_patterns.json
har-capture validate capture.har --patterns my_patterns.json

Add custom patterns via Python:

from har_capture.sanitization import sanitize_html

clean = sanitize_html(html, custom_patterns="my_patterns.json")

Example custom patterns file:

{
  "patterns": {
    "my_custom_id": {
      "regex": "CUST-[A-Z0-9]{8}",
      "replacement_prefix": "CUSTID",
      "description": "Customer ID pattern"
    }
  }
}

PII Categories Removed

The sanitization removes the following types of PII:

  • MAC Addresses: AA:BB:CC:DD:EE:FF02:a1:b2:c3:d4:e5
  • Private IPs: 192.168.1.10010.255.42.17
  • Public IPs: 8.8.8.8192.0.2.42
  • IPv6 Addresses: fe80::12001:db8::a1b2:c3d4
  • Email Addresses: user@example.comuser_a1b2c3d4@redacted.invalid
  • Passwords/Credentials: In forms, headers, and JavaScript → PASS_a1b2c3d4
  • Session Tokens: In cookies and headers → TOKEN_a1b2c3d4
  • Serial Numbers: → SERIAL_a1b2c3d4
  • WiFi Credentials: In JavaScript variables
  • Device Names: In network device lists

CLI Commands

get

Capture HTTP traffic using a browser. By default, the output is sanitized and compressed - you get a single .sanitized.har.gz file ready to share.

har-capture get <TARGET>                  # Outputs: <target>.sanitized.har.gz
har-capture get <TARGET> --output out.har # Outputs: out.sanitized.har.gz
har-capture get <TARGET> --keep-raw       # Also keeps the unsanitized .har file
har-capture get <TARGET> --no-sanitize    # Skip sanitization (not recommended)
har-capture get <TARGET> --no-compress    # Skip compression

Default behavior:

  1. Captures all HTTP traffic to a raw .har file
  2. Sanitizes PII → creates .sanitized.har
  3. Compresses → creates .sanitized.har.gz
  4. Deletes intermediate files (raw and uncompressed sanitized)

Use --keep-raw to preserve the original unsanitized file for debugging.

sanitize

Remove PII from HAR files.

har-capture sanitize capture.har
har-capture sanitize capture.har --output clean.har --compress
har-capture sanitize capture.har --salt my-key      # Consistent hash
har-capture sanitize capture.har --no-salt          # Static placeholders
har-capture sanitize capture.har --patterns custom.json
har-capture sanitize capture.har --max-size 500     # Allow up to 500MB
har-capture sanitize capture.har --compression-level 6  # Faster compression

validate

Check for PII leaks.

har-capture validate capture.har
har-capture validate --dir ./captures --recursive
har-capture validate capture.har --strict
har-capture validate capture.har --patterns custom.json

Platform Support

Component Windows macOS Linux
Sanitization Yes Yes Yes
Validation Yes Yes Yes
CLI Yes Yes Yes
Capture Yes Yes Yes

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .

# Type checking
mypy src/har_capture

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

har_capture-0.2.5.tar.gz (94.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

har_capture-0.2.5-py3-none-any.whl (51.4 kB view details)

Uploaded Python 3

File details

Details for the file har_capture-0.2.5.tar.gz.

File metadata

  • Download URL: har_capture-0.2.5.tar.gz
  • Upload date:
  • Size: 94.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for har_capture-0.2.5.tar.gz
Algorithm Hash digest
SHA256 6e93adcff4e630eed8877b9db71db62147446daf507b156ff39fca62b80f0049
MD5 f0ba4c2f1e0b5ad5302f7ee85adffd8f
BLAKE2b-256 6642346ded632fcabc978be5eb34a58730b03582d4a60ab7533b1cf68f4d1939

See more details on using hashes here.

Provenance

The following attestation bundles were made for har_capture-0.2.5.tar.gz:

Publisher: publish.yml on solentlabs/har-capture

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file har_capture-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: har_capture-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 51.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for har_capture-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b609a13aa6e6591df33ba71602098b0e82a5fc9da5c97555b5684091e50f965a
MD5 5b987d5eed6dabdfb4320843d86be891
BLAKE2b-256 b137997756a3e8d690aa9146fad615a7b1f8180a8f4998659c784e00b1010478

See more details on using hashes here.

Provenance

The following attestation bundles were made for har_capture-0.2.5-py3-none-any.whl:

Publisher: publish.yml on solentlabs/har-capture

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page