Skip to main content

Web scraping API — clean output from any URL

Project description

Pawgrab

Pawgrab

Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.

Features

  • Single URL scraping with multiple output formats
  • Async site crawling (BFS, depth/page limits, Redis job queue)
  • Structured extraction via OpenAI, CSS selectors, XPath, or regex
  • Auto JS detection — curl_cffi first, Playwright fallback for JS-heavy pages
  • Anti-bot evasion — TLS fingerprint impersonation, stealth browser profiles
  • Robots.txt compliance
  • Per-domain rate limiting
  • Proxy rotation with health checking
  • Docker Compose deployment (API + worker + Redis)

Quickstart

pip install -e ".[dev]"
playwright install chromium

# Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine

cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract

pawgrab serve

Or with Docker:

cp .env.example .env
docker compose up

API

All endpoints under /v1.

POST /v1/scrape

curl -X POST http://localhost:8000/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com"}'
Field Type Default Description
url string required URL to scrape
formats array ["markdown"] markdown, html, text, json
wait_for_js bool/null null Force JS (true), skip (false), auto (null)
timeout int 30000 Timeout in ms

POST /v1/crawl

Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.

curl -X POST http://localhost:8000/v1/crawl \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "max_pages": 5}'

POST /v1/extract

Requires PAWGRAB_OPENAI_API_KEY.

curl -X POST http://localhost:8000/v1/extract \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "prompt": "Extract the main heading"}'

GET /health

curl http://localhost:8000/health

CLI

pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reload

Configuration

All settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.

License

DBaJ-GPL v69.420

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawgrab-0.0.1.tar.gz (156.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pawgrab-0.0.1-py3-none-any.whl (87.4 kB view details)

Uploaded Python 3

File details

Details for the file pawgrab-0.0.1.tar.gz.

File metadata

  • Download URL: pawgrab-0.0.1.tar.gz
  • Upload date:
  • Size: 156.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.0.1.tar.gz
Algorithm Hash digest
SHA256 06ef3a2fac42f917e4db3cfdef4e4ab137fbee0eccac2836131f133f75465017
MD5 5d6b9b7a901b681e0d504be34f35997c
BLAKE2b-256 e3fe7dd3b8fa84814958326feb38950a145e3a7eabf4837d0013a820e27f0183

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.0.1.tar.gz:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pawgrab-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pawgrab-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 87.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6b57e0f7cc46f3d2f9ea4432f4437b479c1ebe53c322d8158ae74ead9acd542
MD5 a3e0577b315f41a126dbf933d57f5bfd
BLAKE2b-256 018cdcbac555ab5c0fbb97cce914f4332b7f4ae43768b6863b794627a9e87d37

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.0.1-py3-none-any.whl:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page