Skip to main content

Web scraping API — clean output from any URL

Project description

Pawgrab

Pawgrab

Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.

Features

  • Single URL scraping with multiple output formats
  • Async site crawling (BFS, depth/page limits, Redis job queue)
  • Structured extraction via OpenAI, CSS selectors, XPath, or regex
  • Auto JS detection - curl_cffi first, Playwright fallback for JS-heavy pages
  • Anti-bot evasion - TLS fingerprint impersonation, stealth browser profiles
  • Robots.txt compliance
  • Per-domain rate limiting
  • Proxy rotation with health checking
  • Docker Compose deployment (API + worker + Redis)

Install

pip install pawgrab
patchright install chromium

Quickstart

# Start Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine

# Configure
cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract

# Run
pawgrab serve

Or with Docker:

cp .env.example .env
docker compose up

API

All endpoints under /v1.

POST /v1/scrape

curl -X POST http://localhost:8000/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com"}'
Field Type Default Description
url string required URL to scrape
formats array ["markdown"] markdown, html, text, json
wait_for_js bool/null null Force JS (true), skip (false), auto (null)
timeout int 30000 Timeout in ms

POST /v1/crawl

Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.

curl -X POST http://localhost:8000/v1/crawl \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "max_pages": 5}'

POST /v1/extract

Requires PAWGRAB_OPENAI_API_KEY.

curl -X POST http://localhost:8000/v1/extract \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "prompt": "Extract the main heading"}'

GET /health

curl http://localhost:8000/health

CLI

pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reload

Configuration

All settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.

License

DBaJ-GPL v69.420

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawgrab-0.0.3.tar.gz (234.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pawgrab-0.0.3-py3-none-any.whl (91.1 kB view details)

Uploaded Python 3

File details

Details for the file pawgrab-0.0.3.tar.gz.

File metadata

  • Download URL: pawgrab-0.0.3.tar.gz
  • Upload date:
  • Size: 234.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.0.3.tar.gz
Algorithm Hash digest
SHA256 3b6b36e5bcaff097d1db67817e0218aca8f2b435a431d1127d50662c70619868
MD5 699bd038fbf7cb402a7eb776a2a7d68a
BLAKE2b-256 8d0058c78cc534f82c5c9dc300e443f5712fd9cc390944578abc38169a636ec0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.0.3.tar.gz:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pawgrab-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pawgrab-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 91.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 83de12954fa951a71976a4f8532c2dbc9edeb0a167c4e1550096f532277f18b1
MD5 52c1d18a5ae93ce6ebc39006170c8dc9
BLAKE2b-256 ef33f964875bfdb87c9f217d870a90bb7f0ca03952da841ce1cbf1338730a5aa

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.0.3-py3-none-any.whl:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page