Skip to main content

HTTP Response Fuzzy Hashing

Project description

HRFH - HTTP Response Fuzzy Hashing

Python 3.7+ License PyPI

A Python library for generating fuzzy hashes of HTTP responses, useful for identifying similar web content, detecting CDN configurations, and analyzing web infrastructure.

Features

  • Fast Processing: Efficient HTTP response parsing and hashing
  • Fuzzy Hashing: Generate consistent hashes for similar content
  • Content Masking: Intelligent masking of dynamic content (timestamps, IDs, etc.)
  • Multiple Formats: Support for raw HTTP responses and JSON data
  • Python 3.7+: Compatible with modern Python versions
  • Easy Integration: Simple API for embedding in your projects

Installation

From PyPI (Recommended)

pip install hrfh

From Source

git clone https://github.com/yourusername/hrfh.git
cd hrfh
uv sync

Quick Start

Basic Usage

from hrfh.utils.parser import create_http_response_from_bytes

# Parse HTTP response from bytes
response = create_http_response_from_bytes(
    b"""HTTP/1.0 200 OK\r\nServer: nginx\r\nServer: apache\r\nETag: ea67ba7f802fb5c6cfa13a6b6d27adc6\r\n\r\n"""
)

# Get basic response info
print(response)
# Output: <HTTPResponse 1.1.1.1:80 200 OK>

# Get masked content (with dynamic parts masked)
print(response.masked)
# Output: HTTP/1.0 200 OK
#         ETag: [MASK]
#         Server: apache
#         Server: nginx

# Generate fuzzy hash for similarity detection
print(response.fuzzy_hash())
# Output: ba15cc1f9ad3ef632d0ce7798f7fa44718f1e7fcc2c0f94c1a702f647b79923b

Interactive Example

>>> from hrfh.utils.parser import create_http_response_from_bytes
>>> response = create_http_response_from_bytes(b"""HTTP/1.0 200 OK\r\nServer: nginx\r\nServer: apache\r\nETag: ea67ba7f802fb5c6cfa13a6b6d27adc6\r\n\r\n""")
>>> print(response)
<HTTPResponse 1.1.1.1:80 200 OK>
>>> print(response.masked)
HTTP/1.0 200 OK
ETag: [MASK]
Server: apache
Server: nginx
>>> print(response.fuzzy_hash())
ba15cc1f9ad3ef632d0ce7798f7fa44718f1e7fcc2c0f94c1a702f647b79923b

API Reference

Core Classes

HTTPResponse

Main class for representing HTTP responses with fuzzy hashing capabilities.

from hrfh.models import HTTPResponse

response = HTTPResponse(
    ip="1.2.3.4",
    port=80,
    version="HTTP/1.1",
    status_code=200,
    status_reason="OK",
    headers=[("Server", "nginx"), ("Content-Type", "text/html")],
    body=b"<html>Hello World</html>"
)

Key Methods:

  • fuzzy_hash(): Generate fuzzy hash for similarity detection
  • masked: Get masked content with dynamic parts hidden
  • dump(): Get formatted HTTP response string

HTTPRequest

Class for representing HTTP requests.

from hrfh.models import HTTPRequest

request = HTTPRequest(
    ip="1.2.3.4",
    port=80,
    method="GET",
    version="HTTP/1.1",
    headers=[("Host", "example.com")],
    body=b""
)

Utility Functions

Parsing Functions

from hrfh.utils.parser import (
    create_http_response_from_bytes,
    create_http_response_from_json,
    create_http_request_from_json
)

# Parse from raw HTTP response bytes
response = create_http_response_from_bytes(http_bytes)

# Parse from JSON data
response = create_http_response_from_json(json_data)
request = create_http_request_from_json(json_data)

Advanced Usage

Working with JSON Data

import json
from hrfh.utils.parser import create_http_response_from_json

# Load HTTP response data from JSON file
with open('response_data.json', 'r') as f:
    data = json.load(f)

response = create_http_response_from_json(data)
hash_value = response.fuzzy_hash()

Example JSON format:

{
  "ip": "104.103.147.116",
  "timestamp": 1717146116,
  "status_code": 400,
  "status_reason": "Bad Request",
  "headers": {
    "Server": "AkamaiGHost",
    "Content-Type": "text/html",
    "Content-Length": "312"
  },
  "body": "<HTML><HEAD><TITLE>Invalid URL</TITLE></HEAD><BODY>...</BODY></HTML>"
}

Batch Processing

import os
from hrfh.utils.parser import create_http_response_from_json

def process_responses(data_dir):
    results = {}

    for cdn_dir in os.listdir(data_dir):
        cdn_path = os.path.join(data_dir, cdn_dir)
        if os.path.isdir(cdn_path):
            for json_file in os.listdir(cdn_path):
                if json_file.endswith('.json'):
                    file_path = os.path.join(cdn_path, json_file)
                    with open(file_path, 'r') as f:
                        data = json.load(f)

                    response = create_http_response_from_json(data)
                    hash_value = response.fuzzy_hash()
                    results[hash_value] = response

    return results

# Usage
results = process_responses('data/')
for hash_val, response in results.items():
    print(f"{hash_val[:16]} {response}")

Development

Setting Up Development Environment

  1. Clone the repository

    git clone https://github.com/yourusername/hrfh.git
    cd hrfh
    
  2. Install dependencies

    uv sync
    
  3. Run tests

    uv run pytest
    
  4. Type checking

    uv run mypy hrfh/
    

Project Structure

hrfh/
├── hrfh/                    # Main package
│   ├── models/             # Data models (HTTPRequest, HTTPResponse)
│   ├── utils/              # Utility functions
│   │   ├── parser.py       # HTTP parsing utilities
│   │   ├── masker.py       # Content masking logic
│   │   ├── hasher.py       # Hashing algorithms
│   │   └── tokenizer.py    # HTML tokenization
│   └── __main__.py         # CLI entry point
├── tests/                   # Test suite
├── data/                    # Sample data for testing
├── pyproject.toml          # Project configuration
└── README.md               # This file

Running the CLI Tool

# Install the package in development mode
uv sync

# Run the CLI tool
uv run hrfh --help

# Process a specific file
uv run hrfh data/akamai/104.103.147.116.json

# Process from stdin
cat data/akamai/104.103.147.116.json | uv run hrfh -

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=hrfh

# Run specific test file
uv run pytest tests/test_http_response.py

Examples

CDN Analysis

from hrfh.utils.parser import create_http_response_from_bytes

# Analyze responses from different CDNs
akamai_response = create_http_response_from_bytes(akamai_bytes)
cloudflare_response = create_http_response_from_bytes(cloudflare_bytes)

# Compare hashes to detect similar content
if akamai_response.fuzzy_hash() == cloudflare_response.fuzzy_hash():
    print("Same content served from different CDNs")

Content Change Detection

# Monitor for content changes
old_hash = response.fuzzy_hash()

# After some time...
new_response = create_http_response_from_bytes(new_bytes)
new_hash = new_response.fuzzy_hash()

if old_hash != new_hash:
    print("Content has changed!")

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Acknowledgments

  • Built with BeautifulSoup for HTML parsing
  • Uses NLTK for natural language processing
  • Inspired by fuzzy hashing techniques for digital forensics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hrfh-0.1.21.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hrfh-0.1.21-py3-none-any.whl (131.8 kB view details)

Uploaded Python 3

File details

Details for the file hrfh-0.1.21.tar.gz.

File metadata

  • Download URL: hrfh-0.1.21.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for hrfh-0.1.21.tar.gz
Algorithm Hash digest
SHA256 b1d299187f259ba06a4c11faa7a33e6631bc6d5394fa8738f4aad21aa35ec5a4
MD5 3d53d1f9a157698336d73b761fd30e62
BLAKE2b-256 32cb53e28d2247bc624a48ff6ddc9ec34709fd967479f9fac4f2602ec291ae79

See more details on using hashes here.

File details

Details for the file hrfh-0.1.21-py3-none-any.whl.

File metadata

  • Download URL: hrfh-0.1.21-py3-none-any.whl
  • Upload date:
  • Size: 131.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for hrfh-0.1.21-py3-none-any.whl
Algorithm Hash digest
SHA256 fd54a5fdb0cd21af4f3ae3690319159f3dd8f04da2eae89c08a1c3b94371ff8b
MD5 99649a162d0d2bcc809cda8d5c8789a9
BLAKE2b-256 ecae41675c5691cc2ca4966308ec1ad977a74cb0ee1af697a5334cb393b33140

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page