A CLI tool that runs a sequential reconnaissance pipeline against a target URL before a developer writes a scraper.

Project description

scraperecon

Run this before you write a scraper. It tells you what bot protection a site has, whether plain HTTP or TLS impersonation is enough to get through, and how aggressively it rate limits — before you've written a single line of scraper code.

Usage

scraperecon https://target.com

scraperecon — https://target.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Stage 1 — Plain HTTP (httpx, scraper User-Agent)
  Status:   403 Forbidden
  Time:     212ms
  Verdict:  Blocked

Stage 2 — TLS Impersonation (chrome131)
  Status:   200 OK
  Time:     389ms
  Verdict:  Open
  Note:     TLS fingerprint was the blocker ✓

Stage 3 — Vendor Detection
  Vendor:     Cloudflare
  Confidence: High
  Signals:    cf-ray header, __cf_bm cookie, challenges.cloudflare.com in body

Stage 4 — Rate Limit Probe
  Skipped (pass --probe-rate to enable)

Recommendation
  Use curl_cffi with chrome131 TLS profile
  No CAPTCHA detected at probe volume
  Proxy rotation not required at low request rates

What it does

scraperecon runs four stages against a URL in order, stopping early where it can.

Stage 1 — Plain HTTP

A basic GET with no tricks. If this comes back clean, you don't need anything else — plain httpx or requests will work fine and you can stop here.

It also checks whether a 200 response is actually real content or a JS challenge page. Cloudflare in particular loves returning 200 with a challenge rather than a 403. scraperecon catches that and marks it Challenged instead of lying to you with a green Open.

Stage 2 — TLS Impersonation

If Stage 1 was blocked or challenged, it retries using curl_cffi impersonating Chrome's TLS fingerprint. A lot of bot detection happens at the TLS handshake level — Python's requests library has a completely different fingerprint from a real browser, and that alone is enough to get you blocked on many sites before the server has even looked at your headers. If Stage 2 passes where Stage 1 didn't, you know exactly what the fix is.

Stage 3 — Vendor Detection

Inspects headers, cookies, and the response body for known signatures and tells you which bot protection vendor is running. This matters because Cloudflare, DataDome, Akamai, and PerimeterX all require different bypass strategies. Knowing which one you're dealing with upfront saves you from trying things that were never going to work.

Stage 4 — Rate Limit Probe (opt-in)

Fires N requests with configurable concurrency and watches what happens — hard 429s, silent response time degradation, mid-session redirects. Off by default because blasting a site without thinking about it is bad practice. Pass --probe-rate when you actually need the data.

Install

pipx install scraperecon

Usage

scraperecon https://target.com
scraperecon https://target.com --probe-rate
scraperecon https://target.com --probe-rate --concurrency 10 --requests 50
scraperecon https://target.com --impersonate firefox120
scraperecon https://target.com --save
scraperecon https://target.com --json | jq .recommendation

| Flag            | Default   | Description                                                                          |
| --------------- | --------- | ------------------------------------------------------------------------------------ |
| `--probe-rate`  | off       | Run Stage 4 rate limit probe                                                         |
| `--concurrency` | 5         | Workers for rate probe                                                               |
| `--requests`    | 20        | Total requests for rate probe                                                        |
| `--impersonate` | chrome131 | TLS profile for Stage 2. Options: `chrome131`, `chrome120`, `firefox120`, `safari17` |
| `--timeout`     | 10        | Per-request timeout in seconds                                                       |
| `--json`        | off       | Machine-readable JSON output                                                         |
| `--save`        | off       | Save the full HTML responses to local files (`<domain>_stage1.html`, etc.)           |
| `--skip-tls`    | off       | Skip Stage 2                                                                         |
| `--skip-vendor` | off       | Skip Stage 3                                                                         |

---

## Reading the recommendation

At the end of every run you get a plain-English recommendation based on what was found.

- **Plain HTTP should be sufficient** — `httpx` or `requests` will work. No special setup needed.
- **Use curl_cffi with `<profile>`** — TLS fingerprinting is blocking you. Switch to `curl_cffi` with the listed profile.
- **May need browser automation** — both plain and TLS requests were blocked. You're likely looking at a full JS challenge (Turnstile, hCaptcha). Playwright with a stealth plugin is probably your next move.
- **Proxy rotation recommended** — the rate probe hit throttling. At any real request volume you'll need rotating proxies.
- **CAPTCHA detected** — the response body contained CAPTCHA indicators. Automated solving or a managed scraping service required.

---

## Adding vendor signatures

Signatures live in `scraperecon/data/signatures.json`. It's a flat JSON file — no code required. If you know a signal that's missing, open a PR.

```json
{
  "name": "YourVendor",
  "signals": [
    { "type": "header_present", "key": "x-your-vendor", "weight": 0.8 },
    { "type": "cookie_name", "value": "your_cookie", "weight": 0.6 }
  ]
}

Signal types: header_present, header_value, cookie_name, body_contains, status_code.

What it won't do

scraperecon is a recon tool, not a scraping library. It tells you what you need — it doesn't do it for you. No CAPTCHA solving, no Playwright integration, no proxy support, no persistent history.

Every scraper project starts with the same 20 minutes of manual work: try curl, get blocked, try curl_cffi, check the headers, fire some requests and see what happens. This automates that.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraperecon-0.1.0.tar.gz (12.1 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scraperecon-0.1.0-py3-none-any.whl (16.0 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file scraperecon-0.1.0.tar.gz.

File metadata

Download URL: scraperecon-0.1.0.tar.gz
Upload date: May 3, 2026
Size: 12.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scraperecon-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ddae7a9b956201cd2a7dcd3df1a7e423234543c569f40cd19906503fc6876576`
MD5	`29c8673e25d900d7e0aed04655884db9`
BLAKE2b-256	`044ce1be6da8fc2dbfd64228c7f04ba9e19337541fc0222d787d5242862e5989`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraperecon-0.1.0.tar.gz:

Publisher: publish.yml on DaKheera47/scraperecon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scraperecon-0.1.0.tar.gz
- Subject digest: ddae7a9b956201cd2a7dcd3df1a7e423234543c569f40cd19906503fc6876576
- Sigstore transparency entry: 1434170706
- Sigstore integration time: May 3, 2026
Source repository:
- Permalink: DaKheera47/scraperecon@e8283d7156458ee602c05f9385785bd423698583
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/DaKheera47
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e8283d7156458ee602c05f9385785bd423698583
- Trigger Event: release

File details

Details for the file scraperecon-0.1.0-py3-none-any.whl.

File metadata

Download URL: scraperecon-0.1.0-py3-none-any.whl
Upload date: May 3, 2026
Size: 16.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scraperecon-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c24484829b59f6f4e4748b3817e50c1c98be7f02c4a2ecb1f22e2858ffd8718b`
MD5	`a6c050517b20baf4ed576b30ae006803`
BLAKE2b-256	`23413f59ef0789b322cbf569fcab06552036ca909656eaa0f5e9c9e4e6300635`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraperecon-0.1.0-py3-none-any.whl:

Publisher: publish.yml on DaKheera47/scraperecon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scraperecon-0.1.0-py3-none-any.whl
- Subject digest: c24484829b59f6f4e4748b3817e50c1c98be7f02c4a2ecb1f22e2858ffd8718b
- Sigstore transparency entry: 1434170768
- Sigstore integration time: May 3, 2026
Source repository:
- Permalink: DaKheera47/scraperecon@e8283d7156458ee602c05f9385785bd423698583
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/DaKheera47
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e8283d7156458ee602c05f9385785bd423698583
- Trigger Event: release

scraperecon 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

scraperecon

Usage

What it does

Install

Usage

What it won't do

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance