Skip to main content

Web scraping API — clean output from any URL

Project description

Pawgrab

Pawgrab

Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.

Features

  • Single URL scraping with multiple output formats
  • Async site crawling (BFS, depth/page limits, Redis job queue)
  • Structured extraction via OpenAI, CSS selectors, XPath, or regex
  • Auto JS detection - curl_cffi first, Playwright fallback for JS-heavy pages
  • Anti-bot evasion - TLS fingerprint impersonation, stealth browser profiles
  • Robots.txt compliance
  • Per-domain rate limiting
  • Proxy rotation with health checking
  • Docker Compose deployment (API + worker + Redis)

Install

pip install pawgrab
playwright install chromium

Quickstart

# Start Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine

# Configure
cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract

# Run
pawgrab serve

Or with Docker:

cp .env.example .env
docker compose up

API

All endpoints under /v1.

POST /v1/scrape

curl -X POST http://localhost:8000/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com"}'
Field Type Default Description
url string required URL to scrape
formats array ["markdown"] markdown, html, text, json
wait_for_js bool/null null Force JS (true), skip (false), auto (null)
timeout int 30000 Timeout in ms

POST /v1/crawl

Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.

curl -X POST http://localhost:8000/v1/crawl \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "max_pages": 5}'

POST /v1/extract

Requires PAWGRAB_OPENAI_API_KEY.

curl -X POST http://localhost:8000/v1/extract \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "prompt": "Extract the main heading"}'

GET /health

curl http://localhost:8000/health

CLI

pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reload

Configuration

All settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.

License

DBaJ-GPL v69.420

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawgrab-0.0.2.tar.gz (161.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pawgrab-0.0.2-py3-none-any.whl (87.6 kB view details)

Uploaded Python 3

File details

Details for the file pawgrab-0.0.2.tar.gz.

File metadata

  • Download URL: pawgrab-0.0.2.tar.gz
  • Upload date:
  • Size: 161.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5704f1c301555aad23064cb1e88d2c72ed05a1795c2be9b8a7890a51d600ee90
MD5 fc721925d6cd13e8e1f22b6a97a1f57b
BLAKE2b-256 3c4043398ac0dd5ecabc942de29e70a3c273e8a4a814cd18f88892b678957221

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.0.2.tar.gz:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pawgrab-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pawgrab-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 87.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pawgrab-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 596a3e98dbf88068a2031e274d2f43eedb5825dcd3e20e2f8e0bf9cb8ab75bce
MD5 507996f1c61dc404d83966a2582c31e3
BLAKE2b-256 bd79d51692af111eef7472d11295561c4b9e1277133b615e3ecaee13021dc0cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for pawgrab-0.0.2-py3-none-any.whl:

Publisher: workflow.yml on jaywyawhare/Pawgrab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page