Skip to main content

Web scraping API — clean output from any URL

Project description

Pawgrab

Pawgrab

Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.

Features

  • Single URL scraping with multiple output formats
  • Async site crawling (BFS, depth/page limits, Redis job queue)
  • Structured extraction via OpenAI, CSS selectors, XPath, or regex
  • Auto JS detection - curl_cffi first, Playwright fallback for JS-heavy pages
  • Anti-bot evasion - TLS fingerprint impersonation, stealth browser profiles
  • Robots.txt compliance
  • Per-domain and API-level rate limiting
  • Proxy rotation with health checking
  • Unified error responses with machine-readable error codes
  • SSE heartbeats for reliable streaming through reverse proxies
  • Idempotency keys for safe retries on crawl/batch endpoints
  • Request ID correlation and response timing headers
  • Docker Compose deployment (API + worker + Redis)

Install

pip install pawgrab
patchright install chromium

Quickstart

# Start Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine

# Configure
cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract

# Run
pawgrab serve

Or with Docker:

cp .env.example .env
docker compose up

API

All endpoints under /v1.

POST /v1/scrape

curl -X POST http://localhost:8000/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com"}'
Field Type Default Description
url string required URL to scrape
formats array ["markdown"] markdown, html, text, json
wait_for_js bool/null null Force JS (true), skip (false), auto (null)
timeout int 30000 Timeout in ms

POST /v1/crawl

Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.

curl -X POST http://localhost:8000/v1/crawl \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "max_pages": 5}'

Supports Idempotency-Key header for safe retries.

POST /v1/extract

Requires PAWGRAB_OPENAI_API_KEY.

curl -X POST http://localhost:8000/v1/extract \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "prompt": "Extract the main heading"}'

POST /v1/search

Searches the web and scrapes each result in parallel.

curl -X POST http://localhost:8000/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "python web scraping"}'

GET /health

Returns ok, degraded, or unhealthy with per-component checks.

curl http://localhost:8000/health

Error Responses

All errors return a consistent JSON shape:

{
  "success": false,
  "error": "Human-readable message",
  "code": "machine_readable_code",
  "details": null,
  "request_id": "a1b2c3d4e5f6"
}

Error codes: validation_error, invalid_api_key, rate_limited, robots_blocked, resource_not_found, timeout, fetch_failed, browser_unavailable, queue_unavailable, llm_unavailable, extraction_failed, search_failed, internal_error.

Response Headers

Every response includes:

Header Description
X-Request-ID Unique request identifier for correlation
X-API-Version API version
X-Response-Time Request duration (e.g. 42.3ms)
X-RateLimit-Limit Requests allowed per minute
X-RateLimit-Remaining Requests remaining in current window

CLI

pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reload

Configuration

All settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.

Key settings:

Variable Default Description
PAWGRAB_API_KEY empty API key for Bearer auth (empty = no auth)
PAWGRAB_RATE_LIMIT_RPM 60 Per-domain rate limit (requests/min)
PAWGRAB_API_RATE_LIMIT_RPM 600 API-level rate limit per client (requests/min)
PAWGRAB_REDIS_URL redis://localhost:6379/0 Redis for job queue and idempotency

License

DBaJ-GPL v69.420

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawgrab-0.1.0.tar.gz (246.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pawgrab-0.1.0-py3-none-any.whl (103.6 kB view details)

Uploaded Python 3

File details

Details for the file pawgrab-0.1.0.tar.gz.

File metadata

  • Download URL: pawgrab-0.1.0.tar.gz
  • Upload date:
  • Size: 246.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3c4a1871c100212f389723ac3911e84cd87dc6dd34bfca229eb87fa672b21592
MD5 f033cf8507b690d437e09636b41c67c0
BLAKE2b-256 8e1baf053daa738bc524cf86e3f61433d0cc8848337562ec102562e6197a4bdf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.1.0.tar.gz:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pawgrab-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pawgrab-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 103.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de5c7e3e4198be4a1c5a31883b9dc35b6ac119f7a4e06ecc932fa71e0deef5e6
MD5 91c47add6db211c1daa3f51a31bc72d2
BLAKE2b-256 b9fad94c269ebcf009bc0e80d593ecbd824aad1ac62b4b042dd8082345d2b9aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page