Skip to main content

HAR capture and PII sanitization library for network traffic analysis

Project description

har-capture

PyPI version Downloads codecov License: MIT AI Assisted

Capture and sanitize HAR (HTTP Archive) files for network traffic analysis. HAR files record browser network activity and are commonly used for debugging, diagnostics, and test fixtures.

Quick Start

pip install har-capture[full]
har-capture capture 192.168.100.1
Already have a HAR file?
pip install har-capture
har-capture sanitize myfile.har
Python API
from har_capture.sanitization import sanitize_har

with open("input.har") as f:
    har_data = json.load(f)

sanitized = sanitize_har(har_data)

Why har-capture?

Existing HAR sanitization tools require a manual, multi-step workflow:

  1. Open browser DevTools
  2. Record network traffic
  3. Export HAR file
  4. Find a sanitizer tool
  5. Upload, process, download

har-capture provides an integrated, CLI-first approach:

har-capture capture <DEVICE_IP>     # Capture + sanitize in one step

Comparison with Existing Tools

Feature har-capture Google Cloudflare Edgio
Automated browser capture Yes No No No
CLI-first design Yes No (Flask API) No (Web UI) No (Web UI)
Integrated capture+sanitize Yes No No No
Correlation-preserving redaction Yes No No No
Device-specific PII patterns Yes Generic JWT-focused Generic
Zero-dependency core Yes No No No
Custom pattern support Yes No No No
Cross-platform CLI Yes No No No

Target Use Cases

  • Support diagnostics: Users submit sanitized HAR files without exposing credentials
  • Parser development: Capture device web interfaces for building integrations
  • Test fixtures: Generate reproducible traffic captures for testing
  • Security review: Validate HAR files for PII leaks before sharing

Features

  • Zero Dependencies Core: Core sanitization uses only Python stdlib
  • HAR Capture: Browser-based capture using Playwright (optional)
  • PII Sanitization: Remove sensitive data from HTML and HAR files
  • Correlation-Preserving Redaction: Salted hashes maintain value relationships
  • Custom Patterns: External JSON files for easy pattern updates
  • Validation: Check HAR files for PII leaks before committing
  • CLI Interface: Easy-to-use command line tools

Installation

# Core only (zero dependencies)
pip install har-capture

# With browser capture
pip install har-capture[capture]
playwright install chromium  # Install browser

# With CLI
pip install har-capture[cli]

# Full installation
pip install har-capture[full]

Quick Start

Python API

from har_capture.sanitization import sanitize_html, sanitize_har

# Sanitize HTML (correlation-preserving by default)
clean_html = sanitize_html(raw_html)

# Sanitize with consistent salt (correlate across files)
clean_html = sanitize_html(raw_html, salt="my-secret-key")

# Use static placeholders (legacy mode)
clean_html = sanitize_html(raw_html, salt=None)

# Sanitize HAR file
from har_capture.sanitization import sanitize_har_file
sanitize_har_file("device.har")  # Creates device.sanitized.har

CLI

# Capture device traffic
har-capture capture <DEVICE_IP>

# Sanitize a HAR file (uses random salt by default)
har-capture sanitize device.har

# Sanitize with consistent salt
har-capture sanitize device.har --salt my-key

# Sanitize with static placeholders
har-capture sanitize device.har --no-salt

# Use custom patterns
har-capture sanitize device.har --patterns custom.json

# Validate for PII leaks
har-capture validate device.har

Correlation-Preserving Redaction

By default, har-capture uses format-preserving salted hashes for redaction:

  • Same value → same hash (within a session)
  • Different values → different hashes
  • Output remains valid format (parseable by analysis tools)
  • Uses reserved/documentation ranges that won't collide with real data

Example:

Before:
  MAC: AA:BB:CC:DD:EE:FF (appears 3 times)
  MAC: 11:22:33:44:55:66 (appears 2 times)

With salted hash (default):
  MAC: 02:a1:b2:c3:d4:e5 (appears 3 times - same device, valid MAC format)
  MAC: 02:7f:8e:9d:2c:01 (appears 2 times - different device)

With static placeholders (--no-salt):
  MAC: XX:XX:XX:XX:XX:XX (appears 5 times - correlation lost)

Format-preserving ranges used:

Type Range Standard
MAC 02:xx:xx:xx:xx:xx Locally administered bit
Private IP 10.255.x.x RFC 1918
Public IP 192.0.2.x RFC 5737 TEST-NET-1
IPv6 2001:db8:: RFC 3849 documentation
Email user_xxx@redacted.invalid RFC 2606 .invalid TLD

Salt options:

  • --salt auto (default): Random salt per session
  • --salt my-key: Consistent hashing across runs
  • --no-salt: Static placeholders (legacy mode)

Custom Patterns

Patterns are stored in external JSON files for easy customization:

src/har_capture/patterns/
├── pii.json          # PII detection patterns
├── sensitive.json    # Sensitive headers/fields
└── allowlist.json    # Safe placeholder values

Add custom patterns via CLI:

har-capture sanitize device.har --patterns my_patterns.json
har-capture validate device.har --patterns my_patterns.json

Add custom patterns via Python:

from har_capture.sanitization import sanitize_html

clean = sanitize_html(html, custom_patterns="my_patterns.json")

Example custom patterns file:

{
  "patterns": {
    "my_custom_id": {
      "regex": "CUST-[A-Z0-9]{8}",
      "replacement_prefix": "CUSTID",
      "description": "Customer ID pattern"
    }
  }
}

PII Categories Removed

The sanitization removes the following types of PII:

  • MAC Addresses: AA:BB:CC:DD:EE:FF02:a1:b2:c3:d4:e5
  • Private IPs: 192.168.1.10010.255.42.17
  • Public IPs: 8.8.8.8192.0.2.42
  • IPv6 Addresses: fe80::12001:db8::a1b2:c3d4
  • Email Addresses: user@example.comuser_a1b2c3d4@redacted.invalid
  • Passwords/Credentials: In forms, headers, and JavaScript → PASS_a1b2c3d4
  • Session Tokens: In cookies and headers → TOKEN_a1b2c3d4
  • Serial Numbers: → SERIAL_a1b2c3d4
  • WiFi Credentials: In JavaScript variables
  • Device Names: In network device lists

Modules

sanitization

Core PII removal with zero external dependencies.

from har_capture.sanitization import (
    sanitize_html,      # Remove PII from HTML
    sanitize_har,       # Remove PII from HAR data
    sanitize_har_file,  # Sanitize HAR file on disk
    check_for_pii,      # Detect potential PII
)

# All support salt and custom_patterns options
clean = sanitize_html(html, salt="auto", custom_patterns=None)

patterns

Pattern loading and hashing utilities.

from har_capture.patterns import (
    Hasher,                  # Salted hash generator
    load_pii_patterns,       # Load PII regex patterns
    load_sensitive_patterns, # Load sensitive field names
    load_allowlist,          # Load safe placeholders
)

# Create a hasher for manual use
hasher = Hasher.create(salt="my-key")
hashed_mac = hasher.hash_mac("AA:BB:CC:DD:EE:FF")  # "02:a1:b2:c3:d4:e5"

capture

Browser-based HAR capture using Playwright.

from har_capture.capture import capture_device_har

result = capture_device_har(
    ip="router.local",  # or IP address like "10.0.0.1"
    output="device.har",
    sanitize=True,
    compress=True,
)
print(result.har_path)
print(result.sanitized_path)

validation

Check HAR files for PII leaks.

from har_capture.validation import validate_har, Finding

findings = validate_har("device.har", custom_patterns="my_patterns.json")
for finding in findings:
    print(f"{finding.severity}: {finding.reason}")
    print(f"  Location: {finding.location}")
    print(f"  Value: {finding.value}")

CLI Commands

capture

Capture device traffic using a browser.

har-capture capture <DEVICE_IP>
har-capture capture <DEVICE_IP> --output device.har
har-capture capture <DEVICE_IP> --no-sanitize

sanitize

Remove PII from HAR files.

har-capture sanitize device.har
har-capture sanitize device.har --output clean.har --compress
har-capture sanitize device.har --salt my-key      # Consistent hash
har-capture sanitize device.har --no-salt          # Static placeholders
har-capture sanitize device.har --patterns custom.json
har-capture sanitize device.har --max-size 500     # Allow up to 500MB
har-capture sanitize device.har --compression-level 6  # Faster compression

validate

Check for PII leaks.

har-capture validate device.har
har-capture validate --dir ./captures --recursive
har-capture validate device.har --strict
har-capture validate device.har --patterns custom.json

Platform Support

Component Windows macOS Linux
Sanitization Yes Yes Yes
Validation Yes Yes Yes
CLI Yes Yes Yes
Capture Yes Yes Yes

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .

# Type checking
mypy src/har_capture

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

har_capture-0.1.2.tar.gz (60.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

har_capture-0.1.2-py3-none-any.whl (46.6 kB view details)

Uploaded Python 3

File details

Details for the file har_capture-0.1.2.tar.gz.

File metadata

  • Download URL: har_capture-0.1.2.tar.gz
  • Upload date:
  • Size: 60.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for har_capture-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ea47a5047404a5bd14dda34c2a818858aa98ef4858b9e8ef1b0c39eda9b570ac
MD5 fe85828cf37ba31effee17848304fbb3
BLAKE2b-256 8b6dcf20640f65f1c844a398ad7ce9db520b597d001ff96f042a5d77dc893bce

See more details on using hashes here.

Provenance

The following attestation bundles were made for har_capture-0.1.2.tar.gz:

Publisher: publish.yml on solentlabs/har-capture

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file har_capture-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: har_capture-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 46.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for har_capture-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dfdb4b5dc3aa513ed3b12cf0f1f6fe14337f7a447e985266c657dcaab509617a
MD5 94217b0f68ade28881ab4ad5ccc84295
BLAKE2b-256 6b0b9129e1e3147fbfe2d03a332e5f00f6176d833866020aefedd2ff9633b7d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for har_capture-0.1.2-py3-none-any.whl:

Publisher: publish.yml on solentlabs/har-capture

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page