Web scraping API — clean output from any URL
Project description
Pawgrab
Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.
Features
- Single URL scraping with multiple output formats
- Async site crawling (BFS, depth/page limits, Redis job queue)
- Structured extraction via OpenAI, CSS selectors, XPath, or regex
- Auto JS detection - curl_cffi first, Playwright fallback for JS-heavy pages
- Anti-bot evasion - TLS fingerprint impersonation, stealth browser profiles
- Robots.txt compliance
- Per-domain and API-level rate limiting
- Proxy rotation with health checking
- Unified error responses with machine-readable error codes
- SSE heartbeats for reliable streaming through reverse proxies
- Idempotency keys for safe retries on crawl/batch endpoints
- Request ID correlation and response timing headers
- Docker Compose deployment (API + worker + Redis)
Install
pip install pawgrab
patchright install chromium
Quickstart
# Start Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine
# Configure
cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract
# Run
pawgrab serve
Or with Docker:
cp .env.example .env
docker compose up
API
All endpoints under /v1.
POST /v1/scrape
curl -X POST http://localhost:8000/v1/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com"}'
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
array | ["markdown"] |
markdown, html, text, json |
wait_for_js |
bool/null | null |
Force JS (true), skip (false), auto (null) |
timeout |
int | 30000 |
Timeout in ms |
POST /v1/crawl
Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.
curl -X POST http://localhost:8000/v1/crawl \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "max_pages": 5}'
Supports Idempotency-Key header for safe retries.
POST /v1/extract
Requires PAWGRAB_OPENAI_API_KEY.
curl -X POST http://localhost:8000/v1/extract \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "prompt": "Extract the main heading"}'
POST /v1/search
Searches the web and scrapes each result in parallel.
curl -X POST http://localhost:8000/v1/search \
-H 'Content-Type: application/json' \
-d '{"query": "python web scraping"}'
GET /health
Returns ok, degraded, or unhealthy with per-component checks.
curl http://localhost:8000/health
Error Responses
All errors return a consistent JSON shape:
{
"success": false,
"error": "Human-readable message",
"code": "machine_readable_code",
"details": null,
"request_id": "a1b2c3d4e5f6"
}
Error codes: validation_error, invalid_api_key, rate_limited, robots_blocked, resource_not_found, timeout, fetch_failed, browser_unavailable, queue_unavailable, llm_unavailable, extraction_failed, search_failed, internal_error.
Response Headers
Every response includes:
| Header | Description |
|---|---|
X-Request-ID |
Unique request identifier for correlation |
X-API-Version |
API version |
X-Response-Time |
Request duration (e.g. 42.3ms) |
X-RateLimit-Limit |
Requests allowed per minute |
X-RateLimit-Remaining |
Requests remaining in current window |
CLI
pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reload
Configuration
All settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.
Key settings:
| Variable | Default | Description |
|---|---|---|
PAWGRAB_API_KEY |
empty | API key for Bearer auth (empty = no auth) |
PAWGRAB_RATE_LIMIT_RPM |
60 |
Per-domain rate limit (requests/min) |
PAWGRAB_API_RATE_LIMIT_RPM |
600 |
API-level rate limit per client (requests/min) |
PAWGRAB_REDIS_URL |
redis://localhost:6379/0 |
Redis for job queue and idempotency |
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pawgrab-0.1.0.tar.gz.
File metadata
- Download URL: pawgrab-0.1.0.tar.gz
- Upload date:
- Size: 246.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c4a1871c100212f389723ac3911e84cd87dc6dd34bfca229eb87fa672b21592
|
|
| MD5 |
f033cf8507b690d437e09636b41c67c0
|
|
| BLAKE2b-256 |
8e1baf053daa738bc524cf86e3f61433d0cc8848337562ec102562e6197a4bdf
|
Provenance
The following attestation bundles were made for pawgrab-0.1.0.tar.gz:
Publisher:
workflow.yml on jaywyawhare/Pawgrab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pawgrab-0.1.0.tar.gz -
Subject digest:
3c4a1871c100212f389723ac3911e84cd87dc6dd34bfca229eb87fa672b21592 - Sigstore transparency entry: 1092085001
- Sigstore integration time:
-
Permalink:
jaywyawhare/Pawgrab@d802d524a16db3e64a5361e4d66b480846e042c7 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/jaywyawhare
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@d802d524a16db3e64a5361e4d66b480846e042c7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pawgrab-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pawgrab-0.1.0-py3-none-any.whl
- Upload date:
- Size: 103.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de5c7e3e4198be4a1c5a31883b9dc35b6ac119f7a4e06ecc932fa71e0deef5e6
|
|
| MD5 |
91c47add6db211c1daa3f51a31bc72d2
|
|
| BLAKE2b-256 |
b9fad94c269ebcf009bc0e80d593ecbd824aad1ac62b4b042dd8082345d2b9aa
|
Provenance
The following attestation bundles were made for pawgrab-0.1.0-py3-none-any.whl:
Publisher:
workflow.yml on jaywyawhare/Pawgrab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pawgrab-0.1.0-py3-none-any.whl -
Subject digest:
de5c7e3e4198be4a1c5a31883b9dc35b6ac119f7a4e06ecc932fa71e0deef5e6 - Sigstore transparency entry: 1092085014
- Sigstore integration time:
-
Permalink:
jaywyawhare/Pawgrab@d802d524a16db3e64a5361e4d66b480846e042c7 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/jaywyawhare
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@d802d524a16db3e64a5361e4d66b480846e042c7 -
Trigger Event:
release
-
Statement type: