HTTP Response Fuzzy Hashing
Project description
HRFH - HTTP Response Fuzzy Hashing
A Python library for generating fuzzy hashes of HTTP responses, useful for identifying similar web content, detecting CDN configurations, and analyzing web infrastructure.
Features
- Fast Processing: Efficient HTTP response parsing and hashing
- Fuzzy Hashing: Generate consistent hashes for similar content
- Content Masking: Intelligent masking of dynamic content (timestamps, IDs, etc.)
- Multiple Formats: Support for raw HTTP responses and JSON data
- Python 3.7+: Compatible with modern Python versions
- Easy Integration: Simple API for embedding in your projects
Installation
From PyPI (Recommended)
pip install hrfh
From Source
git clone https://github.com/yourusername/hrfh.git
cd hrfh
uv sync
Quick Start
Basic Usage
from hrfh.utils.parser import create_http_response_from_bytes
# Parse HTTP response from bytes
response = create_http_response_from_bytes(
b"""HTTP/1.0 200 OK\r\nServer: nginx\r\nServer: apache\r\nETag: ea67ba7f802fb5c6cfa13a6b6d27adc6\r\n\r\n"""
)
# Get basic response info
print(response)
# Output: <HTTPResponse 1.1.1.1:80 200 OK>
# Get masked content (with dynamic parts masked)
print(response.masked)
# Output: HTTP/1.0 200 OK
# ETag: [MASK]
# Server: apache
# Server: nginx
# Generate fuzzy hash for similarity detection
print(response.fuzzy_hash())
# Output: ba15cc1f9ad3ef632d0ce7798f7fa44718f1e7fcc2c0f94c1a702f647b79923b
Interactive Example
>>> from hrfh.utils.parser import create_http_response_from_bytes
>>> response = create_http_response_from_bytes(b"""HTTP/1.0 200 OK\r\nServer: nginx\r\nServer: apache\r\nETag: ea67ba7f802fb5c6cfa13a6b6d27adc6\r\n\r\n""")
>>> print(response)
<HTTPResponse 1.1.1.1:80 200 OK>
>>> print(response.masked)
HTTP/1.0 200 OK
ETag: [MASK]
Server: apache
Server: nginx
>>> print(response.fuzzy_hash())
ba15cc1f9ad3ef632d0ce7798f7fa44718f1e7fcc2c0f94c1a702f647b79923b
API Reference
Core Classes
HTTPResponse
Main class for representing HTTP responses with fuzzy hashing capabilities.
from hrfh.models import HTTPResponse
response = HTTPResponse(
ip="1.2.3.4",
port=80,
version="HTTP/1.1",
status_code=200,
status_reason="OK",
headers=[("Server", "nginx"), ("Content-Type", "text/html")],
body=b"<html>Hello World</html>"
)
Key Methods:
fuzzy_hash(): Generate fuzzy hash for similarity detectionmasked: Get masked content with dynamic parts hiddendump(): Get formatted HTTP response string
HTTPRequest
Class for representing HTTP requests.
from hrfh.models import HTTPRequest
request = HTTPRequest(
ip="1.2.3.4",
port=80,
method="GET",
version="HTTP/1.1",
headers=[("Host", "example.com")],
body=b""
)
Utility Functions
Parsing Functions
from hrfh.utils.parser import (
create_http_response_from_bytes,
create_http_response_from_json,
create_http_request_from_json
)
# Parse from raw HTTP response bytes
response = create_http_response_from_bytes(http_bytes)
# Parse from JSON data
response = create_http_response_from_json(json_data)
request = create_http_request_from_json(json_data)
Advanced Usage
Working with JSON Data
import json
from hrfh.utils.parser import create_http_response_from_json
# Load HTTP response data from JSON file
with open('response_data.json', 'r') as f:
data = json.load(f)
response = create_http_response_from_json(data)
hash_value = response.fuzzy_hash()
Example JSON format:
{
"ip": "104.103.147.116",
"timestamp": 1717146116,
"status_code": 400,
"status_reason": "Bad Request",
"headers": {
"Server": "AkamaiGHost",
"Content-Type": "text/html",
"Content-Length": "312"
},
"body": "<HTML><HEAD><TITLE>Invalid URL</TITLE></HEAD><BODY>...</BODY></HTML>"
}
Batch Processing
import os
from hrfh.utils.parser import create_http_response_from_json
def process_responses(data_dir):
results = {}
for cdn_dir in os.listdir(data_dir):
cdn_path = os.path.join(data_dir, cdn_dir)
if os.path.isdir(cdn_path):
for json_file in os.listdir(cdn_path):
if json_file.endswith('.json'):
file_path = os.path.join(cdn_path, json_file)
with open(file_path, 'r') as f:
data = json.load(f)
response = create_http_response_from_json(data)
hash_value = response.fuzzy_hash()
results[hash_value] = response
return results
# Usage
results = process_responses('data/')
for hash_val, response in results.items():
print(f"{hash_val[:16]} {response}")
Development
Setting Up Development Environment
-
Clone the repository
git clone https://github.com/yourusername/hrfh.git cd hrfh
-
Install dependencies
uv sync -
Run tests
uv run pytest
-
Type checking
uv run mypy hrfh/
Project Structure
hrfh/
├── hrfh/ # Main package
│ ├── models/ # Data models (HTTPRequest, HTTPResponse)
│ ├── utils/ # Utility functions
│ │ ├── parser.py # HTTP parsing utilities
│ │ ├── masker.py # Content masking logic
│ │ ├── hasher.py # Hashing algorithms
│ │ └── tokenizer.py # HTML tokenization
│ └── __main__.py # CLI entry point
├── tests/ # Test suite
├── data/ # Sample data for testing
├── pyproject.toml # Project configuration
└── README.md # This file
Running the CLI Tool
# Install the package in development mode
uv sync
# Run the CLI tool
uv run hrfh --help
# Process a specific file
uv run hrfh data/akamai/104.103.147.116.json
# Process from stdin
cat data/akamai/104.103.147.116.json | uv run hrfh -
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Testing
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=hrfh
# Run specific test file
uv run pytest tests/test_http_response.py
Examples
CDN Analysis
from hrfh.utils.parser import create_http_response_from_bytes
# Analyze responses from different CDNs
akamai_response = create_http_response_from_bytes(akamai_bytes)
cloudflare_response = create_http_response_from_bytes(cloudflare_bytes)
# Compare hashes to detect similar content
if akamai_response.fuzzy_hash() == cloudflare_response.fuzzy_hash():
print("Same content served from different CDNs")
Content Change Detection
# Monitor for content changes
old_hash = response.fuzzy_hash()
# After some time...
new_response = create_http_response_from_bytes(new_bytes)
new_hash = new_response.fuzzy_hash()
if old_hash != new_hash:
print("Content has changed!")
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
- Issues: GitHub Issues
- Documentation: GitHub Wiki
- Discussions: GitHub Discussions
Acknowledgments
- Built with BeautifulSoup for HTML parsing
- Uses NLTK for natural language processing
- Inspired by fuzzy hashing techniques for digital forensics
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hrfh-0.1.21.tar.gz.
File metadata
- Download URL: hrfh-0.1.21.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1d299187f259ba06a4c11faa7a33e6631bc6d5394fa8738f4aad21aa35ec5a4
|
|
| MD5 |
3d53d1f9a157698336d73b761fd30e62
|
|
| BLAKE2b-256 |
32cb53e28d2247bc624a48ff6ddc9ec34709fd967479f9fac4f2602ec291ae79
|
File details
Details for the file hrfh-0.1.21-py3-none-any.whl.
File metadata
- Download URL: hrfh-0.1.21-py3-none-any.whl
- Upload date:
- Size: 131.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd54a5fdb0cd21af4f3ae3690319159f3dd8f04da2eae89c08a1c3b94371ff8b
|
|
| MD5 |
99649a162d0d2bcc809cda8d5c8789a9
|
|
| BLAKE2b-256 |
ecae41675c5691cc2ca4966308ec1ad977a74cb0ee1af697a5334cb393b33140
|