Skip to main content

Domain allowlist, per-domain rate caps, response audit, and a Streamlit dashboard for Bright Data scraping agents. The picks-and-shovels layer for any agent that consumes the live web.

Project description

birddog

Audited Bright Data egress for AI agents. Drop one context manager around an agent that scrapes the web and you get:

  1. Domain allowlist — deny everything outside it, log the attempt
  2. Per-domain rate caps — simple token bucket per host
  3. Response audit log — one JSONL line per fetch (url, status, bytes, ms)
  4. Bright Data Web Unlocker proxy — opt-in: route via Bright Data
  5. Streamlit dashboard — point it at the JSONL, get per-host bytes, denial counts, latency p50

Built for the kind of agent that hits live sites: research bots, price trackers, RAG ingest jobs. If you've ever watched an agent rip through a sponsor's free tier in 30 seconds, this is for you.

Install

pip install birddog                    # core
pip install "birddog[dashboard]"       # + Streamlit dashboard

Python 3.10+.

Why

LLM agents don't know what a sane scraping cadence looks like. They'll hammer a site, ignore robots.txt, follow links into spammy subdomains, and burn through a Bright Data quota in a single run.

birddog puts a leash on the egress side:

Concern What birddog does
Wandering off-domain Allowlist with example.com + *.example.com
Burst scraping Token bucket per host (qps + burst)
"What did it fetch?" JSONL audit log, one event per fetch
Anti-bot blocks Optional Bright Data Web Unlocker proxy
Post-run review Bundled Streamlit dashboard

It does not parse HTML, manage cookies, render JS, or rotate user agents. That's what Bright Data + your scraping code are for.

Usage

from birddog import Birddog

bd = Birddog(
    allowed_domains={"docs.brightdata.com", "*.example.com"},
    per_domain_qps=1.0,
    per_domain_burst=2.0,
    audit_path="runs/scrape.jsonl",
    # Optional — route through Bright Data Web Unlocker:
    bright_data={
        "host": "brd.superproxy.io:33335",
        "username": "brd-customer-...-zone-web_unlocker",
        "password": "...",
    },
)

with bd.session("research-bot") as s:
    r = s.fetch("https://docs.brightdata.com/api")
    print(r.status, r.bytes_len, "bytes")

    # second hit within 1s -> RateLimitedError (qps cap = 1)
    s.fetch("https://docs.brightdata.com/pricing")

    # off-allowlist -> DomainDeniedError, also logged
    s.fetch("https://evil.example/exfil")

FetchResult carries url, status, text, headers, elapsed_ms, and a via_brightdata flag so downstream code can tell whether the response came through the proxy.

Audit log

One JSON object per line, e.g.:

{"ts":1747779600.12,"session_id":"research-bot","kind":"fetch_ok",
 "url":"https://docs.brightdata.com/api","host":"docs.brightdata.com",
 "status":200,"bytes":4221,"elapsed_ms":312.4}
{"ts":1747779600.45,"session_id":"research-bot","kind":"domain_denied",
 "url":"https://evil.example/exfil","host":"evil.example",
 "error":"host 'evil.example' not in allowlist"}

Kinds: session_open, fetch_ok, fetch_failed, domain_denied, rate_limited, session_close.

Dashboard

pip install "birddog[dashboard]"
streamlit run -m birddog.dashboard -- --audit runs/scrape.jsonl

Shows total fetches, denials, bytes, and a per-host breakdown of fetches + bytes + p50 latency.

Demos

Two runnable examples in examples/:

1. Smoke test — scrape_demo.py

python examples/scrape_demo.py

Hits each feature once: happy path, domain denial, rate-limit burst, summary. Offline via httpx.MockTransport.

2. Realistic agent — watchdog_agent.py

python examples/watchdog_agent.py

A small price-tracker agent. Polls a watchlist of product pages, extracts prices, alerts when something moves more than a per-product threshold. Three passes show:

  • allowlist denials (off-domain mirror URL is dropped)
  • per-domain rate cap kicking in on pass 3
  • threshold alerts (Δ -6.4% > 3.0%)
  • a runs/watchdog.jsonl audit log you can dashboard

Set BIRDDOG_USE_BRIGHTDATA=1 + your Bright Data Web Unlocker env vars to flip the demo to a real proxy.

Companion libraries

birddog is the egress half of a small agent-stack:

  • agentleash — USD/call budget cap + tool-arg schema gate
  • agentvet — tool-arg validation with LLM-friendly retry hints
  • agentsnap — snapshot tests for agent traces
  • agenttrace — cost + latency aggregation per run

Pair birddog with agentleash and you have egress allowlist + budget cap on the same agent.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

birddog-0.1.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

birddog-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file birddog-0.1.0.tar.gz.

File metadata

  • Download URL: birddog-0.1.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for birddog-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6ed459047b1007606b2e16bb185374e4fc9c7914f0e9b72041610fef9c8b2fe2
MD5 fea76af8be4b6107654274d3a17837b4
BLAKE2b-256 c3a919a59a1d0932269c9a0b7667f54a90b5ac2915ce371ff2a648bdca9dc80a

See more details on using hashes here.

File details

Details for the file birddog-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: birddog-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for birddog-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b89fb5de280bfa895a6e1cb5ebfd55295c695d0d84e79e7836bf54e65565e17
MD5 e6d9250a1985ce48df0157705657cf8e
BLAKE2b-256 c655170dc27bed0d8dbac921060d8cf2b2e5bc4f77e0ab21588c71170a7d79d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page