HAR capture and PII sanitization library for network traffic analysis
Project description
har-capture
Capture and sanitize HAR (HTTP Archive) files. HAR files record browser HTTP activity and are commonly used for debugging, diagnostics, and test fixtures.
Quick Start
Windows
- Install Python from the Microsoft Store or python.org
- Open PowerShell and run:
pip install har-capture[full]
python -m har_capture get https://example.com
macOS / Linux
pip install har-capture[full]
har-capture get https://example.com
Already have a HAR file?
pip install har-capture
har-capture sanitize myfile.har
Python API
from har_capture.sanitization import sanitize_har
with open("input.har") as f:
har_data = json.load(f)
sanitized = sanitize_har(har_data)
Why har-capture?
Chrome DevTools v130+ now sanitizes cookies and auth headers by default when exporting HAR files. That's a good start, but HAR files contain much more sensitive data:
- IP addresses, MAC addresses, email addresses
- Passwords and credentials in form bodies
- Serial numbers, device names, session tokens
har-capture provides deep sanitization and CLI automation:
har-capture get <TARGET> # Capture → sanitize → compress (all automatic)
Comparison with Existing Tools
| Feature | har-capture | DevTools | Cloudflare | Edgio | |
|---|---|---|---|---|---|
| Sanitization | |||||
| Cookies/auth headers | ✅ | ✅ | ✅ | ✅ | ✅ |
| IPs, MACs, emails | ✅ | ❌ | ❌ | ❌ | ❌ |
| Passwords in forms | ✅ | ❌ | ✅ | ❌ | ✅ |
| JWT smart redaction | ❌ | ❌ | ❌ | ✅ | ❌ |
| Correlation-preserving | ✅ | ❌ | ❌ | ❌ | ❌ |
| Usability | |||||
| No installation needed | ❌ | ✅ | ❌ | ✅ | ✅ |
| Data stays local | ✅ | ✅ | ❌ | ✅ | ✅ |
| CLI/scriptable | ✅ | ❌ | ✅ | ❌ | ✅ |
| Preview before redact | ✅ | ❌ | ✅ | ❌ | ❌ |
| Extras | |||||
| Integrated capture | ✅ | ✅ | ❌ | ❌ | ❌ |
| Custom patterns | ✅ | ❌ | ✅ | ❌ | ❌ |
| Validation | ✅ | ❌ | ❌ | ❌ | ❌ |
Target Use Cases
- Support diagnostics: Users submit sanitized HAR files without exposing credentials
- Web development: Capture and analyze HTTP traffic for debugging
- Test fixtures: Generate reproducible traffic captures for testing
- Security review: Validate HAR files for PII leaks before sharing
Features
- Zero Dependencies Core: Core sanitization uses only Python stdlib
- HAR Capture: Browser-based capture using Playwright (optional)
- PII Sanitization: Remove sensitive data from HTML and HAR files
- Correlation-Preserving Redaction: Salted hashes maintain value relationships
- Custom Patterns: External JSON files for easy pattern updates
- Validation: Check HAR files for PII leaks before committing
- CLI Interface: Easy-to-use command line tools
Installation
# Core only (zero dependencies)
pip install har-capture
# With browser capture
pip install har-capture[capture]
playwright install chromium # Install browser
# With CLI
pip install har-capture[cli]
# Full installation
pip install har-capture[full]
Quick Start
Python API
from har_capture.sanitization import sanitize_html, sanitize_har
# Sanitize HTML (correlation-preserving by default)
clean_html = sanitize_html(raw_html)
# Sanitize with consistent salt (correlate across files)
clean_html = sanitize_html(raw_html, salt="my-secret-key")
# Use static placeholders (legacy mode)
clean_html = sanitize_html(raw_html, salt=None)
# Sanitize HAR file
from har_capture.sanitization import sanitize_har_file
sanitize_har_file("capture.har") # Creates capture.sanitized.har
CLI
# Capture HTTP traffic
har-capture get <TARGET>
# Sanitize a HAR file (uses random salt by default)
har-capture sanitize capture.har
# Sanitize with consistent salt
har-capture sanitize capture.har --salt my-key
# Sanitize with static placeholders
har-capture sanitize capture.har --no-salt
# Use custom patterns
har-capture sanitize capture.har --patterns custom.json
# Validate for PII leaks
har-capture validate capture.har
Correlation-Preserving Redaction
By default, har-capture uses format-preserving salted hashes for redaction:
- Same value → same hash (within a session)
- Different values → different hashes
- Output remains valid format (parseable by analysis tools)
- Uses reserved/documentation ranges that won't collide with real data
Example:
Before:
MAC: AA:BB:CC:DD:EE:FF (appears 3 times)
MAC: 11:22:33:44:55:66 (appears 2 times)
With salted hash (default):
MAC: 02:a1:b2:c3:d4:e5 (appears 3 times - same device, valid MAC format)
MAC: 02:7f:8e:9d:2c:01 (appears 2 times - different device)
With static placeholders (--no-salt):
MAC: XX:XX:XX:XX:XX:XX (appears 5 times - correlation lost)
Format-preserving ranges used:
| Type | Range | Standard |
|---|---|---|
| MAC | 02:xx:xx:xx:xx:xx |
Locally administered bit |
| Private IP | 10.255.x.x |
RFC 1918 |
| Public IP | 192.0.2.x |
RFC 5737 TEST-NET-1 |
| IPv6 | 2001:db8:: |
RFC 3849 documentation |
user_xxx@redacted.invalid |
RFC 2606 .invalid TLD |
Salt options:
--salt auto(default): Random salt per session--salt my-key: Consistent hashing across runs--no-salt: Static placeholders (legacy mode)
Custom Patterns
Patterns are stored in external JSON files for easy customization:
src/har_capture/patterns/
├── pii.json # PII detection patterns
├── sensitive.json # Sensitive headers/fields
└── allowlist.json # Safe placeholder values
Add custom patterns via CLI:
har-capture sanitize capture.har --patterns my_patterns.json
har-capture validate capture.har --patterns my_patterns.json
Add custom patterns via Python:
from har_capture.sanitization import sanitize_html
clean = sanitize_html(html, custom_patterns="my_patterns.json")
Example custom patterns file:
{
"patterns": {
"my_custom_id": {
"regex": "CUST-[A-Z0-9]{8}",
"replacement_prefix": "CUSTID",
"description": "Customer ID pattern"
}
}
}
PII Categories Removed
The sanitization removes the following types of PII:
- MAC Addresses:
AA:BB:CC:DD:EE:FF→02:a1:b2:c3:d4:e5 - Private IPs:
192.168.1.100→10.255.42.17 - Public IPs:
8.8.8.8→192.0.2.42 - IPv6 Addresses:
fe80::1→2001:db8::a1b2:c3d4 - Email Addresses:
user@example.com→user_a1b2c3d4@redacted.invalid - Passwords/Credentials: In forms, headers, and JavaScript →
PASS_a1b2c3d4 - Session Tokens: In cookies and headers →
TOKEN_a1b2c3d4 - Serial Numbers: →
SERIAL_a1b2c3d4 - WiFi Credentials: In JavaScript variables
- Device Names: In network device lists
CLI Commands
get
Capture HTTP traffic using a browser. By default, the output is sanitized and compressed - you get a single .sanitized.har.gz file ready to share.
har-capture get <TARGET> # Outputs: <target>.sanitized.har.gz
har-capture get <TARGET> --output out.har # Outputs: out.sanitized.har.gz
har-capture get <TARGET> --keep-raw # Also keeps the unsanitized .har file
har-capture get <TARGET> --no-sanitize # Skip sanitization (not recommended)
har-capture get <TARGET> --no-compress # Skip compression
Default behavior:
- Captures all HTTP traffic to a raw
.harfile - Sanitizes PII → creates
.sanitized.har - Compresses → creates
.sanitized.har.gz - Deletes intermediate files (raw and uncompressed sanitized)
Use --keep-raw to preserve the original unsanitized file for debugging.
sanitize
Remove PII from HAR files.
har-capture sanitize capture.har
har-capture sanitize capture.har --output clean.har --compress
har-capture sanitize capture.har --salt my-key # Consistent hash
har-capture sanitize capture.har --no-salt # Static placeholders
har-capture sanitize capture.har --patterns custom.json
har-capture sanitize capture.har --max-size 500 # Allow up to 500MB
har-capture sanitize capture.har --compression-level 6 # Faster compression
validate
Check for PII leaks.
har-capture validate capture.har
har-capture validate --dir ./captures --recursive
har-capture validate capture.har --strict
har-capture validate capture.har --patterns custom.json
Platform Support
| Component | Windows | macOS | Linux |
|---|---|---|---|
| Sanitization | Yes | Yes | Yes |
| Validation | Yes | Yes | Yes |
| CLI | Yes | Yes | Yes |
| Capture | Yes | Yes | Yes |
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
ruff check .
# Type checking
mypy src/har_capture
License
MIT License - see LICENSE for details.
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file har_capture-0.2.2.tar.gz.
File metadata
- Download URL: har_capture-0.2.2.tar.gz
- Upload date:
- Size: 89.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
385d58d2fe243011ca1ef46c9f50441b02005755e76bd7fed626e2963f1c8e60
|
|
| MD5 |
704cf48a2fb0c07a95c4a08e7b093ba6
|
|
| BLAKE2b-256 |
e81c55e78830f5ef8bedd4990af22456147e19a2d9127e93824578909adc1b16
|
Provenance
The following attestation bundles were made for har_capture-0.2.2.tar.gz:
Publisher:
publish.yml on solentlabs/har-capture
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
har_capture-0.2.2.tar.gz -
Subject digest:
385d58d2fe243011ca1ef46c9f50441b02005755e76bd7fed626e2963f1c8e60 - Sigstore transparency entry: 876699850
- Sigstore integration time:
-
Permalink:
solentlabs/har-capture@e28758599b6d0457868c5a625ab681560005e36b -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/solentlabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e28758599b6d0457868c5a625ab681560005e36b -
Trigger Event:
push
-
Statement type:
File details
Details for the file har_capture-0.2.2-py3-none-any.whl.
File metadata
- Download URL: har_capture-0.2.2-py3-none-any.whl
- Upload date:
- Size: 50.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4240c5e11824a1f1595f344b12e26e4bcd7eb951fd49469f4a85774c548d3c1f
|
|
| MD5 |
2e24f06165b319d7780b87b8d6595700
|
|
| BLAKE2b-256 |
3e0ae98908f323fa63bba2dfd649b91559ce397b75ab4777accb5eec36e48c23
|
Provenance
The following attestation bundles were made for har_capture-0.2.2-py3-none-any.whl:
Publisher:
publish.yml on solentlabs/har-capture
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
har_capture-0.2.2-py3-none-any.whl -
Subject digest:
4240c5e11824a1f1595f344b12e26e4bcd7eb951fd49469f4a85774c548d3c1f - Sigstore transparency entry: 876699952
- Sigstore integration time:
-
Permalink:
solentlabs/har-capture@e28758599b6d0457868c5a625ab681560005e36b -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/solentlabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e28758599b6d0457868c5a625ab681560005e36b -
Trigger Event:
push
-
Statement type: