Skip to main content

Web scraping API — clean output from any URL

Project description

Pawgrab

Pawgrab

Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.

Features

  • Single URL scraping with multiple output formats
  • Async site crawling (BFS, depth/page limits, Redis job queue)
  • Structured extraction via OpenAI, CSS selectors, XPath, or regex
  • Auto JS detection - curl_cffi first, Playwright fallback for JS-heavy pages
  • Anti-bot evasion - TLS fingerprint impersonation, stealth browser profiles
  • Robots.txt compliance
  • Per-domain rate limiting
  • Proxy rotation with health checking
  • Docker Compose deployment (API + worker + Redis)

Install

pip install pawgrab
patchright install chromium

Quickstart

# Start Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine

# Configure
cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract

# Run
pawgrab serve

Or with Docker:

cp .env.example .env
docker compose up

API

All endpoints under /v1.

POST /v1/scrape

curl -X POST http://localhost:8000/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com"}'
Field Type Default Description
url string required URL to scrape
formats array ["markdown"] markdown, html, text, json
wait_for_js bool/null null Force JS (true), skip (false), auto (null)
timeout int 30000 Timeout in ms

POST /v1/crawl

Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.

curl -X POST http://localhost:8000/v1/crawl \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "max_pages": 5}'

POST /v1/extract

Requires PAWGRAB_OPENAI_API_KEY.

curl -X POST http://localhost:8000/v1/extract \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "prompt": "Extract the main heading"}'

GET /health

curl http://localhost:8000/health

CLI

pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reload

Configuration

All settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.

License

DBaJ-GPL v69.420

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawgrab-0.0.4.tar.gz (234.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pawgrab-0.0.4-py3-none-any.whl (91.1 kB view details)

Uploaded Python 3

File details

Details for the file pawgrab-0.0.4.tar.gz.

File metadata

  • Download URL: pawgrab-0.0.4.tar.gz
  • Upload date:
  • Size: 234.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.0.4.tar.gz
Algorithm Hash digest
SHA256 eb800981d4e27067df71bc8ec6792122077231987dfe5dd5b112936a22b2073d
MD5 23dbca64c69f3fa0ffd1677502b1c86c
BLAKE2b-256 a68adfc2ced7f40de7853fa0c6f2b8bf076dcc64cc8063fd10fce0e27f5eef94

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.0.4.tar.gz:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pawgrab-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: pawgrab-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 91.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 392597541153c9b08112575e7ec7057239dd3a1a50561e6a59c34d10eabac74b
MD5 029738f4012b945eb52454a6eedfacfd
BLAKE2b-256 5043aa7da7287546bdfa471730b253c7797be5b33779c0f97b273b77ee877d49

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.0.4-py3-none-any.whl:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page