Skip to main content

High-speed malicious URL detection using a Bloom Filter

Project description

dvara

High-speed malicious URL detection using a Bloom Filter. Checks 3 million URLs in 5MB of RAM.

PyPI version Python 3.11+ License: MIT

pip install dvara

dvara check https://suspicious-site.com
🚨 MALICIOUS | urlhaus | malware_download | 2.5ms | online

dvara check https://google.com
✅ CLEAN | 0.1ms | online

What is dvara?

dvara is a Python CLI and library for detecting malicious URLs using a Bloom Filter — the same probabilistic data structure used internally by Chrome Safe Browsing.

It ingests threat feeds from URLhaus, PhishTank, and OpenPhish (~86,000 URLs updated daily), stores them in a 5MB Bloom Filter, and checks any URL in under 1ms — without touching a database for clean URLs.


Architecture

Daily ingestion job
→ Pulls URLhaus + PhishTank + OpenPhish (~86k–3M URLs)
→ Builds Bloom Filter (5.2MB, 0.1% FPR)
→ Saves to ~/.dvara/filter.bin

dvara check [url]  (online mode)
→ FastAPI backend
→ Hash URL → check 10 bit positions in Bloom Filter
→ All bits OFF → CLEAN instantly (0.1ms, DB never touched)
→ All bits ON  → query PostgreSQL confirmed_urls table
→ Found        → MALICIOUS + source + category
→ Not found    → SUSPICIOUS (false positive)

dvara check [url] --offline
→ Loads filter from ~/.dvara/filter.bin
→ Checks locally, zero network calls
→ dvara update to refresh

Two-stage design (the key insight)

Stage What Latency When
1 — Bloom Filter Redis bitstring, 10 hash lookups 0.1ms Every request
2 — PostgreSQL confirmed_urls table lookup 1–3ms Only on bloom hits

Clean URLs never touch the database. False negatives are mathematically impossible.


Benchmarks

Metric Result
Clean URL check 0.1ms
Malicious URL check (full pipeline) 2.5ms
URLs stored 85,976 (scales to 3M)
Filter size 5.14 MB
False negative rate 0% (guaranteed)
Target false positive rate 0.1%
Actual false positive rate ~0% at current fill

Installation

pip install dvara

Quick Start (no server needed)

dvara ships with a built-in filter. After installing, offline checks work immediately:

dvara check https://suspicious-site.com --offline

No API key, no Docker, no setup. Just install and check.

For running the backend server

pip install dvara[server]

CLI Usage

Check a URL (online mode — hits API)

dvara check https://suspicious-site.com

Check a URL (offline mode — local filter, zero network)

dvara check https://suspicious-site.com --offline

Show filter and API stats

dvara stats

Update local filter cache

dvara update

Run ingestion manually

dvara ingest
dvara ingest --dry-run

Running the Backend

With Docker Compose (recommended)

git clone https://github.com/dhruv-0512/dvara
cd dvara
docker compose up --build

This starts:

  • FastAPI — API server on port 8000
  • Redis — Bloom filter bitstring cache
  • PostgreSQL — confirmed URLs table

Manually

pip install dvara[server]

# Build the filter
python -m dvara.ingestion

# Start the API
python -m uvicorn dvara.app:app --reload

API Endpoints

Endpoint Method Description
/api/check?url=... GET Two-stage URL check
/api/confirm?url=... GET Direct DB lookup
/api/stats GET Filter + connection stats
/api/reload POST Reload filter from disk
/health GET Health check

Example response

{
  "url": "http://110.36.95.252:49267/bin.sh",
  "result": "MALICIOUS",
  "source": "urlhaus",
  "category": "malware_download",
  "latency_ms": 2.5,
  "stage": "db",
  "checked_at": "2026-05-02T13:01:04.767776+00:00"
}

The Math

  • n = 3,000,000 URLs, p = 0.001 (0.1% FPR)
  • Bit array size: m = -(n × ln(p)) / (ln(2))² = ~43M bits = 5.2MB
  • Hash count: k = (m/n) × ln(2) = 10 hash functions
  • Hash algorithm: MurmurHash3 with seeds 0–9

Why Bloom Filter and not a hash set?

3M URLs in a Python hash set = 500MB+. A Bloom Filter at 0.1% FPR = 5.2MB. False positives just trigger the DB confirm — acceptable. False negatives are mathematically impossible. The Bloom Filter is the right tool.

Why Redis and not disk?

Multiple FastAPI workers need to read the same filter simultaneously. Disk requires locking. Redis bitstring is shared memory across all workers — horizontal scaling for free.


Threat Feed Sources

Feed Format URLs
URLhaus CSV ~26,000
PhishTank JSON (gzipped) ~59,000
OpenPhish Plaintext ~300

Project Structure

dvara/
├── bloom.py        ← BloomFilter class (core)
├── ingestion.py    ← Fetch feeds, build filter
├── app.py          ← FastAPI backend
├── cli.py          ← Click CLI commands
└── config.py       ← Constants and env vars

Why "dvara"?

Dvara (द्वार) is the Sanskrit word for gateway or door.

Every URL is a gateway — dvara stands at that door and decides what gets through.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvara-0.1.5.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dvara-0.1.5-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file dvara-0.1.5.tar.gz.

File metadata

  • Download URL: dvara-0.1.5.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dvara-0.1.5.tar.gz
Algorithm Hash digest
SHA256 37e209ba7c1052e26b8f5c06a9900addf2eb3d0070ca572f0f83aef2cbbf47f8
MD5 126a38092d0f632469d7c69726a991f1
BLAKE2b-256 8b4c2b1fe22facc3142dd9fa19b95af5e21d2a2191040144c9ea0683e5e545a6

See more details on using hashes here.

File details

Details for the file dvara-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: dvara-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dvara-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1c83279ad229560aa740036aae14663365cf42f10fc619fbf997dc661b39abd4
MD5 817134cf64d0ffe9bde688b4384ce650
BLAKE2b-256 c6f5321fbbad2facc5449c2df448ba80644fc30f2849799f5a6ed3597adf705c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page