Hyper-fast HTTP Scraping Tool
Project description
HTTPZ Web Scanner
A high-performance concurrent HTTP recon tool. HTTPZ checks domains for HTTP/HTTPS services and pulls back status codes, titles, body previews, response headers, favicon hashes, TLS certificate info, and resolved IPs — all configurable per scan.
Designed to run as a library inside distributed workers scanning hundreds of millions of domains.
Requirements
Installation
Via pip (recommended)
pip install httpz_scanner
httpz --help
From source
git clone https://github.com/acidvegas/httpz
cd httpz
pip install -r requirements.txt
CLI usage
Basic:
python -m httpz_scanner domains.txt
All fields, JSONL output to stdout and a file:
python -m httpz_scanner domains.txt -all -c 100 -j -o results.jsonl
Read from stdin:
cat domains.txt | python -m httpz_scanner - -all
echo example.com | python -m httpz_scanner - -all
Filter by status code:
python -m httpz_scanner domains.txt -mc 200,301-399 -ec 404,500
Specific fields with custom timeout and resolvers:
python -m httpz_scanner domains.txt -sc -ti -i -tls -to 10 -r resolvers.txt
Distributed scanning
Built-in shard mode splits a file across N workers (line-modulo):
# Machine 1
httpz domains.txt --shard 1/3
# Machine 2
httpz domains.txt --shard 2/3
# Machine 3
httpz domains.txt --shard 3/3
Workers can also handle their own line offsetting and feed domains directly to the library — see below.
Library usage
import asyncio
from httpz_scanner import HTTPZScanner
async def domain_source():
# Any of: list, async generator, sync generator, file path string, '-'
for d in ['example.com', 'github.com', 'cloudflare.com']:
yield d
async def main():
scanner = HTTPZScanner(
concurrent_limit = 100,
timeout = 5,
retries = 1,
retry_backoff = 0.5,
follow_redirects = True,
# Feature toggles — all default OFF
fetch_headers = True,
fetch_content_type = True,
fetch_content_length = True,
fetch_title = True,
fetch_body = True,
fetch_favicon = True,
fetch_tls = True,
fetch_ips = True,
fetch_cname = True, # follow CNAME chain (max 3) and scan the final hop
# Optional filters
match_codes = None, # e.g. {200, 301, 302}
exclude_codes = None, # e.g. {404, 500}
# Optional knobs
custom_headers = None, # {'X-Foo': 'bar'}
post_data = None,
shard = None, # (index, total) — workers usually do this themselves
resolvers = None, # ['1.1.1.1', '8.8.8.8'] for A/AAAA lookups
dns_timeout = 2.0,
)
async for result in scanner.scan(domain_source()):
print(result['domain'], result['status'])
asyncio.run(main())
The scanner accepts:
- a file path (string)
'-'for stdin- a list/tuple of domains
- a sync iterator/generator
- an async generator
Graceful shutdown
Workers receiving SIGTERM (or any orchestrator signal) can drain cleanly:
async def supervisor(scanner, scan_iterator):
async for result in scan_iterator:
...
scanner = HTTPZScanner(...)
scan_task = asyncio.create_task(supervisor(scanner, scanner.scan(domains)))
# Later, on shutdown signal:
await scanner.stop() # drops queued domains, lets in-flight finish, exits
await scan_task
stop() is idempotent and async-safe.
Result schema
Each yielded result is a dict. Fields appear only when their feature toggle is on and data is available.
{
"domain": "example.com",
"url": "https://example.com/",
"status": 200, // -1 on error
"protocol": "https", // or "http"
// -- toggleable fields --
"response_headers": {"Server": "...", ...}, // fetch_headers
"content_type": "text/html; charset=utf-8",
"content_length": 1234,
"redirect_chain": ["https://example.com", "https://www.example.com/"],
"cname_chain": ["example.com", "edge.example.net", "akamai.net"], // up to 3 entries
"title": "Example Domain", // single line, max 1024 chars
"body_preview": "<!doctype html>...", // first 1024 raw bytes, normalized
"body_clean": "Example Domain ...", // HTML-stripped, max 1024 chars
"favicon_hash": "1014476666658474844", // mmh3 64-bit, capped at 256 KB
"ips": ["93.184.216.34", "..."],
"tls": {
"fingerprint": "<sha256 hex>",
"subject": "*.example.com",
"issuer": "DigiCert TLS RSA SHA256 2020 CA1",
"email": null,
"alt_names": ["*.example.com", "example.com"],
"not_before": "2026-01-15T00:00:00",
"not_after": "2027-02-14T23:59:59"
},
// -- only on failure --
"error": "Connection timed out",
"error_type": "TIMEOUT" // CONN | SSL | CERT | TIMEOUT | HTTP | UNKNOWN | PROCESS | TASK | NO_RESPONSE
}
Protocol fallback
https://x→ tries https, falls back to http on connection failurehttp://x→ tries http, falls back to https on connection failurex(no scheme) → tries https, falls back to http
Any HTTP response (including 4xx/5xx) is accepted — only connection-level errors trigger fallback.
Retries
retries is per protocol, applied only to transient errors (TIMEOUT, CONN, HTTP). Cert errors, DNS failures, and HTTP responses do not retry. Backoff is linear: retry_backoff * (attempt + 1).
Performance notes for distributed use
force_close=Trueon the connector — keep-alive is disabled (you're scanning unique hosts).- TLS cert is captured from the original request's connection via a connector subclass, no second handshake per https domain.
- DNS uses
aiodns+ 5-minute in-process cache. - Bounded internal queue (
concurrent_limit * 2) keeps memory flat regardless of input size. - Ensure your worker's
ulimit -nis high enough forconcurrent_limit * 2sockets.
CLI arguments
| Argument | Long form | Description |
|---|---|---|
file |
Domain file (one per line) or - for stdin |
|
-c N |
--concurrent N |
Concurrent in-flight checks (default 100) |
-to N |
--timeout N |
Request timeout in seconds (default 5) |
-rt N |
--retries N |
Retry attempts per protocol (default 1) |
-rb N |
--retry-backoff N |
Linear backoff base seconds (default 0.5) |
-dt N |
--dns-timeout N |
DNS query timeout (default 2.0) |
-fr |
--follow-redirects |
Follow redirects (max 10) |
-r FILE |
--resolvers FILE |
DNS resolver IP list for IP lookups |
-hd "k: v,..." |
--headers "k: v,..." |
Custom request headers |
-pd DATA |
--post-data DATA |
Send POST with this body |
-sh N/T |
--shard N/T |
Shard N of T (line-modulo) |
-mc CODES |
--match-codes CODES |
Only show these status codes |
-ec CODES |
--exclude-codes CODES |
Exclude these status codes |
-o FILE |
--output FILE |
Append-write JSONL to file |
-j |
--jsonl |
Print JSONL to stdout |
-p |
--progress |
Show numeric counter alongside output |
-d |
--debug |
Show error states and debug logs |
-all |
--all-flags |
Enable every output field |
Field flags
| Flag | Long form | Description |
|---|---|---|
-sc |
--status-code |
Status code |
-ct |
--content-type |
Content-Type header |
-cl |
--content-length |
Content-Length header |
-ti |
--title |
Page title (≤1024 chars) |
-b |
--body |
body_preview + body_clean |
-i |
--ip |
A/AAAA records |
-f |
--favicon |
mmh3 favicon hash |
-hr |
--show-headers |
Full response headers |
-tls |
--tls-info |
TLS certificate fields |
-cn |
--cname |
CNAME chain (max 3) + scan target hostname |
Mirrors: SuperNETs • GitHub • GitLab • Codeberg
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file httpz_scanner-3.1.2.tar.gz.
File metadata
- Download URL: httpz_scanner-3.1.2.tar.gz
- Upload date:
- Size: 22.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93106e78c3e9ec9df42b6c3d49f81908d2c7ed2b5eb0f574b8caa4199539807a
|
|
| MD5 |
55007dceecca64e1155744f67fd7e744
|
|
| BLAKE2b-256 |
0d2934131e880c4afe80a0a9370f23b8bb3dca68517b1568237f9f6effd1651a
|
File details
Details for the file httpz_scanner-3.1.2-py3-none-any.whl.
File metadata
- Download URL: httpz_scanner-3.1.2-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db1f102ae70a7558a28e489a0488375153b5f8548599062fd825bcfffaa4f48c
|
|
| MD5 |
346a9699f3ac915584ca3861dbd0a2bb
|
|
| BLAKE2b-256 |
bf5b54bda6bcb09a861eea3e99042fec91009de82df05c4dd287b03d7ca450b9
|