Domain allowlist, per-domain rate caps, response audit, and a Streamlit dashboard for Bright Data scraping agents. The picks-and-shovels layer for any agent that consumes the live web.
Project description
birddog
Audited Bright Data egress for AI agents. Drop one context manager around an agent that scrapes the web and you get:
- Domain allowlist — deny everything outside it, log the attempt
- Per-domain rate caps — simple token bucket per host
- Response audit log — one JSONL line per fetch (url, status, bytes, ms)
- Bright Data Web Unlocker proxy — opt-in: route via Bright Data
- Streamlit dashboard — point it at the JSONL, get per-host bytes, denial counts, latency p50
Built for the kind of agent that hits live sites: research bots, price trackers, RAG ingest jobs. If you've ever watched an agent rip through a sponsor's free tier in 30 seconds, this is for you.
Install
pip install birddog # core
pip install "birddog[dashboard]" # + Streamlit dashboard
Python 3.10+.
Why
LLM agents don't know what a sane scraping cadence looks like. They'll hammer a site, ignore robots.txt, follow links into spammy subdomains, and burn through a Bright Data quota in a single run.
birddog puts a leash on the egress side:
| Concern | What birddog does |
|---|---|
| Wandering off-domain | Allowlist with example.com + *.example.com |
| Burst scraping | Token bucket per host (qps + burst) |
| "What did it fetch?" | JSONL audit log, one event per fetch |
| Anti-bot blocks | Optional Bright Data Web Unlocker proxy |
| Post-run review | Bundled Streamlit dashboard |
It does not parse HTML, manage cookies, render JS, or rotate user agents. That's what Bright Data + your scraping code are for.
Usage
from birddog import Birddog
bd = Birddog(
allowed_domains={"docs.brightdata.com", "*.example.com"},
per_domain_qps=1.0,
per_domain_burst=2.0,
audit_path="runs/scrape.jsonl",
# Optional — route through Bright Data Web Unlocker:
bright_data={
"host": "brd.superproxy.io:33335",
"username": "brd-customer-...-zone-web_unlocker",
"password": "...",
},
)
with bd.session("research-bot") as s:
r = s.fetch("https://docs.brightdata.com/api")
print(r.status, r.bytes_len, "bytes")
# second hit within 1s -> RateLimitedError (qps cap = 1)
s.fetch("https://docs.brightdata.com/pricing")
# off-allowlist -> DomainDeniedError, also logged
s.fetch("https://evil.example/exfil")
FetchResult carries url, status, text, headers, elapsed_ms,
and a via_brightdata flag so downstream code can tell whether the
response came through the proxy.
Audit log
One JSON object per line, e.g.:
{"ts":1747779600.12,"session_id":"research-bot","kind":"fetch_ok",
"url":"https://docs.brightdata.com/api","host":"docs.brightdata.com",
"status":200,"bytes":4221,"elapsed_ms":312.4}
{"ts":1747779600.45,"session_id":"research-bot","kind":"domain_denied",
"url":"https://evil.example/exfil","host":"evil.example",
"error":"host 'evil.example' not in allowlist"}
Kinds: session_open, fetch_ok, fetch_failed, domain_denied,
rate_limited, session_close.
Dashboard
pip install "birddog[dashboard]"
streamlit run -m birddog.dashboard -- --audit runs/scrape.jsonl
Shows total fetches, denials, bytes, and a per-host breakdown of fetches + bytes + p50 latency.
Demos
Two runnable examples in examples/:
1. Smoke test — scrape_demo.py
python examples/scrape_demo.py
Hits each feature once: happy path, domain denial, rate-limit burst,
summary. Offline via httpx.MockTransport.
2. Realistic agent — watchdog_agent.py
python examples/watchdog_agent.py
A small price-tracker agent. Polls a watchlist of product pages, extracts prices, alerts when something moves more than a per-product threshold. Three passes show:
- allowlist denials (off-domain mirror URL is dropped)
- per-domain rate cap kicking in on pass 3
- threshold alerts (
Δ -6.4% > 3.0%) - a
runs/watchdog.jsonlaudit log you can dashboard
Set BIRDDOG_USE_BRIGHTDATA=1 + your Bright Data Web Unlocker env
vars to flip the demo to a real proxy.
Companion libraries
birddog is the egress half of a small agent-stack:
- agentleash — USD/call budget cap + tool-arg schema gate
- agentvet — tool-arg validation with LLM-friendly retry hints
- agentsnap — snapshot tests for agent traces
- agenttrace — cost + latency aggregation per run
Pair birddog with agentleash and you have egress allowlist + budget
cap on the same agent.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file birddog-0.1.0.tar.gz.
File metadata
- Download URL: birddog-0.1.0.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ed459047b1007606b2e16bb185374e4fc9c7914f0e9b72041610fef9c8b2fe2
|
|
| MD5 |
fea76af8be4b6107654274d3a17837b4
|
|
| BLAKE2b-256 |
c3a919a59a1d0932269c9a0b7667f54a90b5ac2915ce371ff2a648bdca9dc80a
|
File details
Details for the file birddog-0.1.0-py3-none-any.whl.
File metadata
- Download URL: birddog-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b89fb5de280bfa895a6e1cb5ebfd55295c695d0d84e79e7836bf54e65565e17
|
|
| MD5 |
e6d9250a1985ce48df0157705657cf8e
|
|
| BLAKE2b-256 |
c655170dc27bed0d8dbac921060d8cf2b2e5bc4f77e0ab21588c71170a7d79d7
|