Web scraping API — clean output from any URL
Project description
Pawgrab
Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.
Features
- Single URL scraping with multiple output formats
- Async site crawling (BFS, depth/page limits, Redis job queue)
- Structured extraction via OpenAI, CSS selectors, XPath, or regex
- Auto JS detection - curl_cffi first, Playwright fallback for JS-heavy pages
- Anti-bot evasion - TLS fingerprint impersonation, stealth browser profiles
- Robots.txt compliance
- Per-domain rate limiting
- Proxy rotation with health checking
- Docker Compose deployment (API + worker + Redis)
Install
pip install pawgrab
patchright install chromium
Quickstart
# Start Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine
# Configure
cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract
# Run
pawgrab serve
Or with Docker:
cp .env.example .env
docker compose up
API
All endpoints under /v1.
POST /v1/scrape
curl -X POST http://localhost:8000/v1/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com"}'
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
array | ["markdown"] |
markdown, html, text, json |
wait_for_js |
bool/null | null |
Force JS (true), skip (false), auto (null) |
timeout |
int | 30000 |
Timeout in ms |
POST /v1/crawl
Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.
curl -X POST http://localhost:8000/v1/crawl \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "max_pages": 5}'
POST /v1/extract
Requires PAWGRAB_OPENAI_API_KEY.
curl -X POST http://localhost:8000/v1/extract \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "prompt": "Extract the main heading"}'
GET /health
curl http://localhost:8000/health
CLI
pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reload
Configuration
All settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pawgrab-0.0.3.tar.gz.
File metadata
- Download URL: pawgrab-0.0.3.tar.gz
- Upload date:
- Size: 234.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b6b36e5bcaff097d1db67817e0218aca8f2b435a431d1127d50662c70619868
|
|
| MD5 |
699bd038fbf7cb402a7eb776a2a7d68a
|
|
| BLAKE2b-256 |
8d0058c78cc534f82c5c9dc300e443f5712fd9cc390944578abc38169a636ec0
|
Provenance
The following attestation bundles were made for pawgrab-0.0.3.tar.gz:
Publisher:
workflow.yml on jaywyawhare/Pawgrab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pawgrab-0.0.3.tar.gz -
Subject digest:
3b6b36e5bcaff097d1db67817e0218aca8f2b435a431d1127d50662c70619868 - Sigstore transparency entry: 1077369771
- Sigstore integration time:
-
Permalink:
jaywyawhare/Pawgrab@b3df07d5174d302cee929af52395a98e078cf8ff -
Branch / Tag:
refs/tags/0.0.3 - Owner: https://github.com/jaywyawhare
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@b3df07d5174d302cee929af52395a98e078cf8ff -
Trigger Event:
release
-
Statement type:
File details
Details for the file pawgrab-0.0.3-py3-none-any.whl.
File metadata
- Download URL: pawgrab-0.0.3-py3-none-any.whl
- Upload date:
- Size: 91.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83de12954fa951a71976a4f8532c2dbc9edeb0a167c4e1550096f532277f18b1
|
|
| MD5 |
52c1d18a5ae93ce6ebc39006170c8dc9
|
|
| BLAKE2b-256 |
ef33f964875bfdb87c9f217d870a90bb7f0ca03952da841ce1cbf1338730a5aa
|
Provenance
The following attestation bundles were made for pawgrab-0.0.3-py3-none-any.whl:
Publisher:
workflow.yml on jaywyawhare/Pawgrab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pawgrab-0.0.3-py3-none-any.whl -
Subject digest:
83de12954fa951a71976a4f8532c2dbc9edeb0a167c4e1550096f532277f18b1 - Sigstore transparency entry: 1077369788
- Sigstore integration time:
-
Permalink:
jaywyawhare/Pawgrab@b3df07d5174d302cee929af52395a98e078cf8ff -
Branch / Tag:
refs/tags/0.0.3 - Owner: https://github.com/jaywyawhare
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@b3df07d5174d302cee929af52395a98e078cf8ff -
Trigger Event:
release
-
Statement type: