Skip to main content

HAR capture and PII sanitization library for network traffic analysis

Project description

har-capture

PyPI version Downloads codecov License: MIT AI Assisted

Capture and sanitize HAR (HTTP Archive) files for network traffic analysis. HAR files record browser network activity and are commonly used for debugging, diagnostics, and test fixtures.

Quick Start

pip install har-capture[full]
har-capture capture --ip 192.168.100.1
Already have a HAR file?
pip install har-capture
har-capture sanitize myfile.har
Python API
from har_capture.sanitization import sanitize_har

with open("input.har") as f:
    har_data = json.load(f)

sanitized = sanitize_har(har_data)

Why har-capture?

Existing HAR sanitization tools require a manual, multi-step workflow:

  1. Open browser DevTools
  2. Record network traffic
  3. Export HAR file
  4. Find a sanitizer tool
  5. Upload, process, download

har-capture provides an integrated, CLI-first approach:

har-capture capture <DEVICE_IP>     # Capture + sanitize in one step

Comparison with Existing Tools

Feature har-capture Google Cloudflare Edgio
Automated browser capture Yes No No No
CLI-first design Yes No (Flask API) No (Web UI) No (Web UI)
Integrated capture+sanitize Yes No No No
Correlation-preserving redaction Yes No No No
Device-specific PII patterns Yes Generic JWT-focused Generic
Zero-dependency core Yes No No No
Custom pattern support Yes No No No
Cross-platform CLI Yes No No No

Target Use Cases

  • Support diagnostics: Users submit sanitized HAR files without exposing credentials
  • Parser development: Capture device web interfaces for building integrations
  • Test fixtures: Generate reproducible traffic captures for testing
  • Security review: Validate HAR files for PII leaks before sharing

Features

  • Zero Dependencies Core: Core sanitization uses only Python stdlib
  • HAR Capture: Browser-based capture using Playwright (optional)
  • PII Sanitization: Remove sensitive data from HTML and HAR files
  • Correlation-Preserving Redaction: Salted hashes maintain value relationships
  • Custom Patterns: External JSON files for easy pattern updates
  • Validation: Check HAR files for PII leaks before committing
  • CLI Interface: Easy-to-use command line tools

Installation

# Core only (zero dependencies)
pip install har-capture

# With browser capture
pip install har-capture[capture]
playwright install chromium  # Install browser

# With CLI
pip install har-capture[cli]

# Full installation
pip install har-capture[full]

Quick Start

Python API

from har_capture.sanitization import sanitize_html, sanitize_har

# Sanitize HTML (correlation-preserving by default)
clean_html = sanitize_html(raw_html)

# Sanitize with consistent salt (correlate across files)
clean_html = sanitize_html(raw_html, salt="my-secret-key")

# Use static placeholders (legacy mode)
clean_html = sanitize_html(raw_html, salt=None)

# Sanitize HAR file
from har_capture.sanitization import sanitize_har_file
sanitize_har_file("device.har")  # Creates device.sanitized.har

CLI

# Capture device traffic
har-capture capture <DEVICE_IP>

# Sanitize a HAR file (uses random salt by default)
har-capture sanitize device.har

# Sanitize with consistent salt
har-capture sanitize device.har --salt my-key

# Sanitize with static placeholders
har-capture sanitize device.har --no-salt

# Use custom patterns
har-capture sanitize device.har --patterns custom.json

# Validate for PII leaks
har-capture validate device.har

Correlation-Preserving Redaction

By default, har-capture uses format-preserving salted hashes for redaction:

  • Same value → same hash (within a session)
  • Different values → different hashes
  • Output remains valid format (parseable by analysis tools)
  • Uses reserved/documentation ranges that won't collide with real data

Example:

Before:
  MAC: AA:BB:CC:DD:EE:FF (appears 3 times)
  MAC: 11:22:33:44:55:66 (appears 2 times)

With salted hash (default):
  MAC: 02:a1:b2:c3:d4:e5 (appears 3 times - same device, valid MAC format)
  MAC: 02:7f:8e:9d:2c:01 (appears 2 times - different device)

With static placeholders (--no-salt):
  MAC: XX:XX:XX:XX:XX:XX (appears 5 times - correlation lost)

Format-preserving ranges used:

Type Range Standard
MAC 02:xx:xx:xx:xx:xx Locally administered bit
Private IP 10.255.x.x RFC 1918
Public IP 192.0.2.x RFC 5737 TEST-NET-1
IPv6 2001:db8:: RFC 3849 documentation
Email user_xxx@redacted.invalid RFC 2606 .invalid TLD

Salt options:

  • --salt auto (default): Random salt per session
  • --salt my-key: Consistent hashing across runs
  • --no-salt: Static placeholders (legacy mode)

Custom Patterns

Patterns are stored in external JSON files for easy customization:

src/har_capture/patterns/
├── pii.json          # PII detection patterns
├── sensitive.json    # Sensitive headers/fields
└── allowlist.json    # Safe placeholder values

Add custom patterns via CLI:

har-capture sanitize device.har --patterns my_patterns.json
har-capture validate device.har --patterns my_patterns.json

Add custom patterns via Python:

from har_capture.sanitization import sanitize_html

clean = sanitize_html(html, custom_patterns="my_patterns.json")

Example custom patterns file:

{
  "patterns": {
    "my_custom_id": {
      "regex": "CUST-[A-Z0-9]{8}",
      "replacement_prefix": "CUSTID",
      "description": "Customer ID pattern"
    }
  }
}

PII Categories Removed

The sanitization removes the following types of PII:

  • MAC Addresses: AA:BB:CC:DD:EE:FF02:a1:b2:c3:d4:e5
  • Private IPs: 192.168.1.10010.255.42.17
  • Public IPs: 8.8.8.8192.0.2.42
  • IPv6 Addresses: fe80::12001:db8::a1b2:c3d4
  • Email Addresses: user@example.comuser_a1b2c3d4@redacted.invalid
  • Passwords/Credentials: In forms, headers, and JavaScript → PASS_a1b2c3d4
  • Session Tokens: In cookies and headers → TOKEN_a1b2c3d4
  • Serial Numbers: → SERIAL_a1b2c3d4
  • WiFi Credentials: In JavaScript variables
  • Device Names: In network device lists

Modules

sanitization

Core PII removal with zero external dependencies.

from har_capture.sanitization import (
    sanitize_html,      # Remove PII from HTML
    sanitize_har,       # Remove PII from HAR data
    sanitize_har_file,  # Sanitize HAR file on disk
    check_for_pii,      # Detect potential PII
)

# All support salt and custom_patterns options
clean = sanitize_html(html, salt="auto", custom_patterns=None)

patterns

Pattern loading and hashing utilities.

from har_capture.patterns import (
    Hasher,                  # Salted hash generator
    load_pii_patterns,       # Load PII regex patterns
    load_sensitive_patterns, # Load sensitive field names
    load_allowlist,          # Load safe placeholders
)

# Create a hasher for manual use
hasher = Hasher.create(salt="my-key")
hashed_mac = hasher.hash_mac("AA:BB:CC:DD:EE:FF")  # "02:a1:b2:c3:d4:e5"

capture

Browser-based HAR capture using Playwright.

from har_capture.capture import capture_device_har

result = capture_device_har(
    ip="router.local",  # or IP address like "10.0.0.1"
    output="device.har",
    sanitize=True,
    compress=True,
)
print(result.har_path)
print(result.sanitized_path)

validation

Check HAR files for PII leaks.

from har_capture.validation import validate_har, Finding

findings = validate_har("device.har", custom_patterns="my_patterns.json")
for finding in findings:
    print(f"{finding.severity}: {finding.reason}")
    print(f"  Location: {finding.location}")
    print(f"  Value: {finding.value}")

CLI Commands

capture

Capture device traffic using a browser.

har-capture capture <DEVICE_IP>
har-capture capture <DEVICE_IP> --output device.har
har-capture capture <DEVICE_IP> --no-sanitize

sanitize

Remove PII from HAR files.

har-capture sanitize device.har
har-capture sanitize device.har --output clean.har --compress
har-capture sanitize device.har --salt my-key      # Consistent hash
har-capture sanitize device.har --no-salt          # Static placeholders
har-capture sanitize device.har --patterns custom.json
har-capture sanitize device.har --max-size 500     # Allow up to 500MB
har-capture sanitize device.har --compression-level 6  # Faster compression

validate

Check for PII leaks.

har-capture validate device.har
har-capture validate --dir ./captures --recursive
har-capture validate device.har --strict
har-capture validate device.har --patterns custom.json

Platform Support

Component Windows macOS Linux
Sanitization Yes Yes Yes
Validation Yes Yes Yes
CLI Yes Yes Yes
Capture Yes Yes Yes

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .

# Type checking
mypy src/har_capture

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

har_capture-0.1.1.tar.gz (59.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

har_capture-0.1.1-py3-none-any.whl (45.9 kB view details)

Uploaded Python 3

File details

Details for the file har_capture-0.1.1.tar.gz.

File metadata

  • Download URL: har_capture-0.1.1.tar.gz
  • Upload date:
  • Size: 59.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for har_capture-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0d02d9d3b13080fc244b0e1d3cab50d096fec2df9d5184f48c7dd16b32ad00b7
MD5 e20874b75cc9b12f0e675f46698f1597
BLAKE2b-256 a88b38a6a3b99c20e59e18db3d1acd898076ee3cbdfc8ed9979c6b6661aadae2

See more details on using hashes here.

Provenance

The following attestation bundles were made for har_capture-0.1.1.tar.gz:

Publisher: publish.yml on solentlabs/har-capture

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file har_capture-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: har_capture-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 45.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for har_capture-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0ce93868e32f910ca6b2306e989249eb566f8e16bca40fb5c1248b3a511d1abf
MD5 6c7f0d0b11e64c8e40aa227fe20b08eb
BLAKE2b-256 68745ac58a08abf4e1608f3fc9ca1d04cb57f9e758a2ec3f7bc958c5776bd235

See more details on using hashes here.

Provenance

The following attestation bundles were made for har_capture-0.1.1-py3-none-any.whl:

Publisher: publish.yml on solentlabs/har-capture

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page