pawgrab

Web scraping API — clean output from any URL

These details have not been verified by PyPI

Project description

Pawgrab

Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.

Features

Single URL scraping with multiple output formats
Async site crawling (BFS, depth/page limits, Redis job queue)
Structured extraction via OpenAI, CSS selectors, XPath, or regex
Auto JS detection - curl_cffi first, Playwright fallback for JS-heavy pages
Anti-bot evasion - TLS fingerprint impersonation, stealth browser profiles
Robots.txt compliance
Per-domain and API-level rate limiting
Proxy rotation with health checking
Unified error responses with machine-readable error codes
SSE heartbeats for reliable streaming through reverse proxies
Idempotency keys for safe retries on crawl/batch endpoints
Request ID correlation and response timing headers
Docker Compose deployment (API + worker + Redis)

Install

pip install pawgrab
patchright install chromium

Quickstart

# Start Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine

# Configure
cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract

# Run
pawgrab serve

Or with Docker:

cp .env.example .env
docker compose up

API

All endpoints under /v1.

POST /v1/scrape

curl -X POST http://localhost:8000/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com"}'

Field	Type	Default	Description
`url`	string	required	URL to scrape
`formats`	array	`["markdown"]`	`markdown`, `html`, `text`, `json`
`wait_for_js`	bool/null	`null`	Force JS (`true`), skip (`false`), auto (`null`)
`timeout`	int	`30000`	Timeout in ms

POST /v1/crawl

Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.

curl -X POST http://localhost:8000/v1/crawl \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "max_pages": 5}'

Supports Idempotency-Key header for safe retries.

POST /v1/extract

Requires PAWGRAB_OPENAI_API_KEY.

curl -X POST http://localhost:8000/v1/extract \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "prompt": "Extract the main heading"}'

POST /v1/search

Searches the web and scrapes each result in parallel.

curl -X POST http://localhost:8000/v1/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "python web scraping"}'

GET /health

Returns ok, degraded, or unhealthy with per-component checks.

curl http://localhost:8000/health

Error Responses

All errors return a consistent JSON shape:

{
  "success": false,
  "error": "Human-readable message",
  "code": "machine_readable_code",
  "details": null,
  "request_id": "a1b2c3d4e5f6"
}

Error codes: validation_error, invalid_api_key, rate_limited, robots_blocked, resource_not_found, timeout, fetch_failed, browser_unavailable, queue_unavailable, llm_unavailable, extraction_failed, search_failed, internal_error.

Response Headers

Every response includes:

Header	Description
`X-Request-ID`	Unique request identifier for correlation
`X-API-Version`	API version
`X-Response-Time`	Request duration (e.g. `42.3ms`)
`X-RateLimit-Limit`	Requests allowed per minute
`X-RateLimit-Remaining`	Requests remaining in current window

CLI

pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reload

Configuration

All settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.

Key settings:

Variable	Default	Description
`PAWGRAB_API_KEY`	empty	API key for Bearer auth (empty = no auth)
`PAWGRAB_RATE_LIMIT_RPM`	`60`	Per-domain rate limit (requests/min)
`PAWGRAB_API_RATE_LIMIT_RPM`	`600`	API-level rate limit per client (requests/min)
`PAWGRAB_REDIS_URL`	`redis://localhost:6379/0`	Redis for job queue and idempotency

License

DBaJ-GPL v69.420

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 12, 2026

0.0.4

Mar 10, 2026

0.0.3

Mar 10, 2026

0.0.2

Mar 10, 2026

0.0.1

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawgrab-0.1.0.tar.gz (246.9 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pawgrab-0.1.0-py3-none-any.whl (103.6 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file pawgrab-0.1.0.tar.gz.

File metadata

Download URL: pawgrab-0.1.0.tar.gz
Upload date: Mar 12, 2026
Size: 246.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3c4a1871c100212f389723ac3911e84cd87dc6dd34bfca229eb87fa672b21592`
MD5	`f033cf8507b690d437e09636b41c67c0`
BLAKE2b-256	`8e1baf053daa738bc524cf86e3f61433d0cc8848337562ec102562e6197a4bdf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.1.0.tar.gz:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pawgrab-0.1.0.tar.gz
- Subject digest: 3c4a1871c100212f389723ac3911e84cd87dc6dd34bfca229eb87fa672b21592
- Sigstore transparency entry: 1092085001
- Sigstore integration time: Mar 12, 2026
Source repository:
- Permalink: jaywyawhare/Pawgrab@d802d524a16db3e64a5361e4d66b480846e042c7
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/jaywyawhare
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@d802d524a16db3e64a5361e4d66b480846e042c7
- Trigger Event: release

File details

Details for the file pawgrab-0.1.0-py3-none-any.whl.

File metadata

Download URL: pawgrab-0.1.0-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 103.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de5c7e3e4198be4a1c5a31883b9dc35b6ac119f7a4e06ecc932fa71e0deef5e6`
MD5	`91c47add6db211c1daa3f51a31bc72d2`
BLAKE2b-256	`b9fad94c269ebcf009bc0e80d593ecbd824aad1ac62b4b042dd8082345d2b9aa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pawgrab-0.1.0-py3-none-any.whl
- Subject digest: de5c7e3e4198be4a1c5a31883b9dc35b6ac119f7a4e06ecc932fa71e0deef5e6
- Sigstore transparency entry: 1092085014
- Sigstore integration time: Mar 12, 2026
Source repository:
- Permalink: jaywyawhare/Pawgrab@d802d524a16db3e64a5361e4d66b480846e042c7
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/jaywyawhare
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@d802d524a16db3e64a5361e4d66b480846e042c7
- Trigger Event: release

pawgrab 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Pawgrab

Features

Install

Quickstart

API

POST /v1/scrape

POST /v1/crawl

POST /v1/extract

POST /v1/search

GET /health

Error Responses

Response Headers

CLI

Configuration

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance