Hyper-fast HTTP Scraping Tool

These details have not been verified by PyPI

Project links

Homepage

Project description

HTTPZ Web Scanner

A high-performance concurrent HTTP recon tool. HTTPZ checks domains for HTTP/HTTPS services and pulls back status codes, titles, body previews, response headers, favicon hashes, TLS certificate info, and resolved IPs — all configurable per scan.

Designed to run as a library inside distributed workers scanning hundreds of millions of domains.

Requirements

Python 3.8+
- aiodns
- aiofiles
- aiohttp
- beautifulsoup4
- cryptography
- dnspython
- mmh3

Installation

Via pip (recommended)

pip install httpz_scanner
httpz --help

From source

git clone https://github.com/acidvegas/httpz
cd httpz
pip install -r requirements.txt

CLI usage

Basic:

python -m httpz_scanner domains.txt

All fields, JSONL output to stdout and a file:

python -m httpz_scanner domains.txt -all -c 100 -j -o results.jsonl

Read from stdin:

cat domains.txt | python -m httpz_scanner - -all
echo example.com | python -m httpz_scanner - -all

Filter by status code:

python -m httpz_scanner domains.txt -mc 200,301-399 -ec 404,500

Specific fields with custom timeout and resolvers:

python -m httpz_scanner domains.txt -sc -ti -i -tls -to 10 -r resolvers.txt

Distributed scanning

Built-in shard mode splits a file across N workers (line-modulo):

# Machine 1
httpz domains.txt --shard 1/3
# Machine 2
httpz domains.txt --shard 2/3
# Machine 3
httpz domains.txt --shard 3/3

Workers can also handle their own line offsetting and feed domains directly to the library — see below.

Library usage

import asyncio
from httpz_scanner import HTTPZScanner

async def domain_source():
    # Any of: list, async generator, sync generator, file path string, '-'
    for d in ['example.com', 'github.com', 'cloudflare.com']:
        yield d

async def main():
    scanner = HTTPZScanner(
        concurrent_limit = 100,
        timeout          = 5,
        retries          = 1,
        retry_backoff    = 0.5,
        follow_redirects = True,
        max_body_size    = 1024 * 1024,
        favicon_max_size = 256 * 1024,

        # Feature toggles — all default OFF
        fetch_headers        = True,
        fetch_content_type   = True,
        fetch_content_length = True,
        fetch_title          = True,
        fetch_body           = True,
        fetch_favicon        = True,
        fetch_tls            = True,
        fetch_ips            = True,

        # Optional filters
        match_codes   = None,        # e.g. {200, 301, 302}
        exclude_codes = None,        # e.g. {404, 500}

        # Optional knobs
        custom_headers = None,       # {'X-Foo': 'bar'}
        post_data      = None,
        shard          = None,       # (index, total) — workers usually do this themselves
        resolvers      = None,       # ['1.1.1.1', '8.8.8.8'] for A/AAAA lookups
        dns_timeout    = 2.0,
    )

    async for result in scanner.scan(domain_source()):
        print(result['domain'], result['status'])

asyncio.run(main())

The scanner accepts:

a file path (string)
'-' for stdin
a list/tuple of domains
a sync iterator/generator
an async generator

Graceful shutdown

Workers receiving SIGTERM (or any orchestrator signal) can drain cleanly:

async def supervisor(scanner, scan_iterator):
    async for result in scan_iterator:
        ...

scanner = HTTPZScanner(...)
scan_task = asyncio.create_task(supervisor(scanner, scanner.scan(domains)))

# Later, on shutdown signal:
await scanner.stop()        # drops queued domains, lets in-flight finish, exits
await scan_task

stop() is idempotent and async-safe.

Result schema

Each yielded result is a dict. Fields appear only when their feature toggle is on and data is available.

{
  "domain":      "example.com",
  "url":         "https://example.com/",
  "status":      200,                          // -1 on error
  "protocol":    "https",                      // or "http"

  // -- toggleable fields --
  "response_headers": {"Server": "...", ...},  // fetch_headers
  "content_type":     "text/html; charset=utf-8",
  "content_length":   1234,
  "redirect_chain":   ["https://example.com", "https://www.example.com/"],
  "title":            "Example Domain",        // single line, max 1024 chars
  "body_preview":     "<!doctype html>...",    // first 1024 raw bytes, normalized
  "body_clean":       "Example Domain ...",    // HTML-stripped, max 1024 chars
  "favicon_hash":     "1014476666658474844",   // mmh3 64-bit, capped read
  "ips":              ["93.184.216.34", "..."],
  "tls": {
    "fingerprint": "<sha256 hex>",
    "subject":     "*.example.com",
    "issuer":      "DigiCert TLS RSA SHA256 2020 CA1",
    "email":       null,
    "alt_names":   ["*.example.com", "example.com"],
    "not_before":  "2026-01-15T00:00:00",
    "not_after":   "2027-02-14T23:59:59"
  },

  // -- only on failure --
  "error":      "Connection timed out",
  "error_type": "TIMEOUT"   // CONN | SSL | CERT | TIMEOUT | HTTP | UNKNOWN | PROCESS | TASK | NO_RESPONSE
}

Protocol fallback

https://x → tries https, falls back to http on connection failure
http://x → tries http, falls back to https on connection failure
x (no scheme) → tries https, falls back to http

Any HTTP response (including 4xx/5xx) is accepted — only connection-level errors trigger fallback.

Retries

retries is per protocol, applied only to transient errors (TIMEOUT, CONN, HTTP). Cert errors, DNS failures, and HTTP responses do not retry. Backoff is linear: retry_backoff * (attempt + 1).

Performance notes for distributed use

force_close=True on the connector — keep-alive is disabled (you're scanning unique hosts).
TLS cert is captured from the original request's connection via a connector subclass, no second handshake per https domain.
DNS uses aiodns + 5-minute in-process cache.
Bounded internal queue (concurrent_limit * 2) keeps memory flat regardless of input size.
Ensure your worker's ulimit -n is high enough for concurrent_limit * 2 sockets.

CLI arguments

Argument	Long form	Description
`file`		Domain file (one per line) or `-` for stdin
`-c N`	`--concurrent N`	Concurrent in-flight checks (default 100)
`-to N`	`--timeout N`	Request timeout in seconds (default 5)
`-rt N`	`--retries N`	Retry attempts per protocol (default 1)
`-rb N`	`--retry-backoff N`	Linear backoff base seconds (default 0.5)
`-mb N`	`--max-body-size N`	Max body bytes to read (default 1 MB)
`-fm N`	`--favicon-max-size N`	Max favicon bytes (default 256 KB)
`-dt N`	`--dns-timeout N`	DNS query timeout (default 2.0)
`-fr`	`--follow-redirects`	Follow redirects (max 10)
`-r FILE`	`--resolvers FILE`	DNS resolver IP list for IP lookups
`-hd "k: v,..."`	`--headers "k: v,..."`	Custom request headers
`-pd DATA`	`--post-data DATA`	Send POST with this body
`-sh N/T`	`--shard N/T`	Shard `N` of `T` (line-modulo)
`-mc CODES`	`--match-codes CODES`	Only show these status codes
`-ec CODES`	`--exclude-codes CODES`	Exclude these status codes
`-o FILE`	`--output FILE`	Append-write JSONL to file
`-j`	`--jsonl`	Print JSONL to stdout
`-p`	`--progress`	Show numeric counter alongside output
`-d`	`--debug`	Show error states and debug logs
`-all`	`--all-flags`	Enable every output field

Field flags

Flag	Long form	Description
`-sc`	`--status-code`	Status code
`-ct`	`--content-type`	Content-Type header
`-cl`	`--content-length`	Content-Length header
`-ti`	`--title`	Page title (≤1024 chars)
`-b`	`--body`	body_preview + body_clean
`-i`	`--ip`	A/AAAA records
`-f`	`--favicon`	mmh3 favicon hash
`-hr`	`--show-headers`	Full response headers
`-tls`	`--tls-info`	TLS certificate fields

Mirrors: SuperNETs • GitHub • GitLab • Codeberg

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

3.1.2

May 2, 2026

3.1.1

May 2, 2026

This version

3.0.0

May 1, 2026

2.1.9

Feb 12, 2025

2.1.8

Feb 12, 2025

2.1.7

Feb 12, 2025

2.1.6

Feb 12, 2025

2.1.5

Feb 12, 2025

2.1.4

Feb 12, 2025

2.1.3

Feb 12, 2025

2.1.2

Feb 12, 2025

2.1.1

Feb 12, 2025

2.0.11

Feb 12, 2025

2.0.9

Feb 12, 2025

2.0.8

Feb 12, 2025

2.0.7

Feb 12, 2025

2.0.6

Feb 12, 2025

2.0.5

Feb 12, 2025

2.0.4

Feb 12, 2025

2.0.3

Feb 12, 2025

2.0.2

Feb 12, 2025

2.0.1

Feb 12, 2025

2.0.0

Feb 12, 2025

1.0.9

Feb 11, 2025

1.0.8

Feb 11, 2025

1.0.7

Feb 11, 2025

1.0.5

Feb 11, 2025

1.0.4

Feb 11, 2025

1.0.3

Feb 11, 2025

1.0.0

Feb 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

httpz_scanner-3.0.0.tar.gz (22.2 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

httpz_scanner-3.0.0-py3-none-any.whl (21.8 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file httpz_scanner-3.0.0.tar.gz.

File metadata

Download URL: httpz_scanner-3.0.0.tar.gz
Upload date: May 1, 2026
Size: 22.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for httpz_scanner-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`bb76c25a75bcd7d9882bf02a0c0c19f2e6bc8405186a77170857ca447eb9f1f1`
MD5	`44c2ad0ecd346d94540390cef6532d79`
BLAKE2b-256	`ac793261e2fb6965874a1a29ae1f1bfec178e8310e89a5110fcf3e9f627d46c1`

See more details on using hashes here.

File details

Details for the file httpz_scanner-3.0.0-py3-none-any.whl.

File metadata

Download URL: httpz_scanner-3.0.0-py3-none-any.whl
Upload date: May 1, 2026
Size: 21.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for httpz_scanner-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8dbf5e904c0e1730f2b27e08f41248eca2be08ade19bdec566a93e1905bb4439`
MD5	`db17de64c337d6a11f5c99ad005e87cb`
BLAKE2b-256	`b82b2358ecc45b87f8eceea91a9b05a36ec58cfda8c41cff4ebb6f02223d8b6e`

See more details on using hashes here.

httpz-scanner 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HTTPZ Web Scanner

Requirements

Installation

Via pip (recommended)

From source

CLI usage

Distributed scanning

Library usage

Graceful shutdown

Result schema

Protocol fallback

Retries

Performance notes for distributed use

CLI arguments

Field flags

Mirrors: SuperNETs • GitHub • GitLab • Codeberg

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes