Skip to main content

Audited Bright Data egress for AI scraping agents. Domain allowlist, per-domain rate caps, JSONL audit, Streamlit dashboard, optional Web Unlocker proxy, and optional on-close Merkle attestation via mantle-agent-attest.

Project description

birddog

PyPI

demo

Audited Bright Data egress for AI agents. Drop one context manager around an agent that scrapes the web and you get:

  1. Domain allowlist — deny everything outside it, log the attempt
  2. Per-domain rate caps — simple token bucket per host
  3. Response audit log — one JSONL line per fetch (url, status, bytes, ms)
  4. Bright Data Web Unlocker proxy — opt-in: route via Bright Data
  5. Streamlit dashboard — point it at the JSONL, get per-host bytes, denial counts, latency p50

Built for the kind of agent that hits live sites: research bots, price trackers, RAG ingest jobs. If you've ever watched an agent rip through a sponsor's free tier in 30 seconds, this is for you.

Install

pip install birddog                    # core
pip install "birddog[dashboard]"       # + Streamlit dashboard

Python 3.10+.

Why

LLM agents don't know what a sane scraping cadence looks like. They'll hammer a site, ignore robots.txt, follow links into spammy subdomains, and burn through a Bright Data quota in a single run.

birddog puts a leash on the egress side:

Concern What birddog does
Wandering off-domain Allowlist with example.com + *.example.com
Burst scraping Token bucket per host (qps + burst)
"What did it fetch?" JSONL audit log, one event per fetch
Anti-bot blocks Optional Bright Data Web Unlocker proxy
Post-run review Bundled Streamlit dashboard

It does not parse HTML, manage cookies, render JS, or rotate user agents. That's what Bright Data + your scraping code are for.

Usage

from birddog import Birddog

bd = Birddog(
    allowed_domains={"docs.brightdata.com", "*.example.com"},
    per_domain_qps=1.0,
    per_domain_burst=2.0,
    audit_path="runs/scrape.jsonl",
    # Optional — route through Bright Data Web Unlocker:
    bright_data={
        "host": "brd.superproxy.io:33335",
        "username": "brd-customer-...-zone-web_unlocker",
        "password": "...",
    },
)

with bd.session("research-bot") as s:
    r = s.fetch("https://docs.brightdata.com/api")
    print(r.status, r.bytes_len, "bytes")

    # second hit within 1s -> RateLimitedError (qps cap = 1)
    s.fetch("https://docs.brightdata.com/pricing")

    # off-allowlist -> DomainDeniedError, also logged
    s.fetch("https://evil.example/exfil")

FetchResult carries url, status, text, headers, elapsed_ms, and a via_brightdata flag so downstream code can tell whether the response came through the proxy.

Audit log

One JSON object per line, e.g.:

{"ts":1747779600.12,"session_id":"research-bot","kind":"fetch_ok",
 "url":"https://docs.brightdata.com/api","host":"docs.brightdata.com",
 "status":200,"bytes":4221,"elapsed_ms":312.4}
{"ts":1747779600.45,"session_id":"research-bot","kind":"domain_denied",
 "url":"https://evil.example/exfil","host":"evil.example",
 "error":"host 'evil.example' not in allowlist"}

Kinds: session_open, fetch_ok, fetch_failed, domain_denied, rate_limited, session_close.

Dashboard

pip install "birddog[dashboard]"
streamlit run -m birddog.dashboard -- --audit runs/scrape.jsonl

Shows total fetches, denials, bytes, and a per-host breakdown of fetches + bytes + p50 latency.

Demos

Two runnable examples in examples/:

1. Smoke test — scrape_demo.py

python examples/scrape_demo.py

Hits each feature once: happy path, domain denial, rate-limit burst, summary. Offline via httpx.MockTransport.

2. Realistic agent — watchdog_agent.py

python examples/watchdog_agent.py

A small price-tracker agent. Polls a watchlist of product pages, extracts prices, alerts when something moves more than a per-product threshold. Three passes show:

  • allowlist denials (off-domain mirror URL is dropped)
  • per-domain rate cap kicking in on pass 3
  • threshold alerts (Δ -6.4% > 3.0%)
  • a runs/watchdog.jsonl audit log you can dashboard

Set BIRDDOG_USE_BRIGHTDATA=1 + your Bright Data Web Unlocker env vars to flip the demo to a real proxy.

Companion libraries

birddog is the egress half of a small agent-stack:

  • agentleash — USD/call budget cap + tool-arg schema gate
  • agentvet — tool-arg validation with LLM-friendly retry hints
  • agentsnap — snapshot tests for agent traces
  • agenttrace — cost + latency aggregation per run

Pair birddog with agentleash and you have egress allowlist + budget cap on the same agent.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

birddog-0.2.0.tar.gz (95.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

birddog-0.2.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file birddog-0.2.0.tar.gz.

File metadata

  • Download URL: birddog-0.2.0.tar.gz
  • Upload date:
  • Size: 95.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for birddog-0.2.0.tar.gz
Algorithm Hash digest
SHA256 63dba4335ee1b2eaaaaa8039b64f6e322e3414f19d35ab2678638fd2c51c9b37
MD5 6630c46adfb8d33b78880f75c2753cbc
BLAKE2b-256 c5b7c53cd00eabe95746098bac37cf906cdb9785b1bea6fbb2ae9fa9b044016c

See more details on using hashes here.

File details

Details for the file birddog-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: birddog-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for birddog-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e140d6e53c839571070d31b5ce5aba053aa18cefbb9dffc4d457f82f4796b630
MD5 1f70e788348fd723515970a11c2b1163
BLAKE2b-256 804310232792349bdc747276e6508a55d7b60b8d9ea74ddc4c78b9509a9bca48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page