High-speed malicious URL detection using a Bloom Filter
Project description
dvara
High-speed malicious URL detection using a Bloom Filter. Checks 3 million URLs in 5MB of RAM.
pip install dvara
dvara check https://suspicious-site.com
🚨 MALICIOUS | urlhaus | malware_download | 2.5ms | online
dvara check https://google.com
✅ CLEAN | 0.1ms | online
What is dvara?
dvara is a Python CLI and library for detecting malicious URLs using a Bloom Filter — the same probabilistic data structure used internally by Chrome Safe Browsing.
It ingests threat feeds from URLhaus, PhishTank, and OpenPhish (~86,000 URLs updated daily), stores them in a 5MB Bloom Filter, and checks any URL in under 1ms — without touching a database for clean URLs.
Architecture
Daily ingestion job
→ Pulls URLhaus + PhishTank + OpenPhish (~86k–3M URLs)
→ Builds Bloom Filter (5.2MB, 0.1% FPR)
→ Saves to ~/.dvara/filter.bin
dvara check [url] (online mode)
→ FastAPI backend
→ Hash URL → check 10 bit positions in Bloom Filter
→ All bits OFF → CLEAN instantly (0.1ms, DB never touched)
→ All bits ON → query PostgreSQL confirmed_urls table
→ Found → MALICIOUS + source + category
→ Not found → SUSPICIOUS (false positive)
dvara check [url] --offline
→ Loads filter from ~/.dvara/filter.bin
→ Checks locally, zero network calls
→ dvara update to refresh
Two-stage design (the key insight)
| Stage | What | Latency | When |
|---|---|---|---|
| 1 — Bloom Filter | Redis bitstring, 10 hash lookups | 0.1ms | Every request |
| 2 — PostgreSQL | confirmed_urls table lookup | 1–3ms | Only on bloom hits |
Clean URLs never touch the database. False negatives are mathematically impossible.
Benchmarks
| Metric | Result |
|---|---|
| Clean URL check | 0.1ms |
| Malicious URL check (full pipeline) | 2.5ms |
| URLs stored | 85,976 (scales to 3M) |
| Filter size | 5.14 MB |
| False negative rate | 0% (guaranteed) |
| Target false positive rate | 0.1% |
| Actual false positive rate | ~0% at current fill |
Installation
pip install dvara
Quick Start (no server needed)
dvara ships with a built-in filter. After installing, offline checks work immediately:
dvara check https://suspicious-site.com --offline
No API key, no Docker, no setup. Just install and check.
For running the backend server
pip install dvara[server]
CLI Usage
Check a URL (online mode — hits API)
dvara check https://suspicious-site.com
Check a URL (offline mode — local filter, zero network)
dvara check https://suspicious-site.com --offline
Show filter and API stats
dvara stats
Update local filter cache
dvara update
Run ingestion manually
dvara ingest
dvara ingest --dry-run
Running the Backend
With Docker Compose (recommended)
git clone https://github.com/dhruv-0512/dvara
cd dvara
docker compose up --build
This starts:
- FastAPI — API server on port 8000
- Redis — Bloom filter bitstring cache
- PostgreSQL — confirmed URLs table
Manually
pip install dvara[server]
# Build the filter
python -m dvara.ingestion
# Start the API
python -m uvicorn dvara.app:app --reload
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/check?url=... |
GET | Two-stage URL check |
/api/confirm?url=... |
GET | Direct DB lookup |
/api/stats |
GET | Filter + connection stats |
/api/reload |
POST | Reload filter from disk |
/health |
GET | Health check |
Example response
{
"url": "http://110.36.95.252:49267/bin.sh",
"result": "MALICIOUS",
"source": "urlhaus",
"category": "malware_download",
"latency_ms": 2.5,
"stage": "db",
"checked_at": "2026-05-02T13:01:04.767776+00:00"
}
The Math
- n = 3,000,000 URLs, p = 0.001 (0.1% FPR)
- Bit array size: m = -(n × ln(p)) / (ln(2))² = ~43M bits = 5.2MB
- Hash count: k = (m/n) × ln(2) = 10 hash functions
- Hash algorithm: MurmurHash3 with seeds 0–9
Why Bloom Filter and not a hash set?
3M URLs in a Python hash set = 500MB+. A Bloom Filter at 0.1% FPR = 5.2MB. False positives just trigger the DB confirm — acceptable. False negatives are mathematically impossible. The Bloom Filter is the right tool.
Why Redis and not disk?
Multiple FastAPI workers need to read the same filter simultaneously. Disk requires locking. Redis bitstring is shared memory across all workers — horizontal scaling for free.
Threat Feed Sources
| Feed | Format | URLs |
|---|---|---|
| URLhaus | CSV | ~26,000 |
| PhishTank | JSON (gzipped) | ~59,000 |
| OpenPhish | Plaintext | ~300 |
Project Structure
dvara/
├── bloom.py ← BloomFilter class (core)
├── ingestion.py ← Fetch feeds, build filter
├── app.py ← FastAPI backend
├── cli.py ← Click CLI commands
└── config.py ← Constants and env vars
Why "dvara"?
Dvara (द्वार) is the Sanskrit word for gateway or door.
Every URL is a gateway — dvara stands at that door and decides what gets through.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dvara-0.1.5.tar.gz.
File metadata
- Download URL: dvara-0.1.5.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37e209ba7c1052e26b8f5c06a9900addf2eb3d0070ca572f0f83aef2cbbf47f8
|
|
| MD5 |
126a38092d0f632469d7c69726a991f1
|
|
| BLAKE2b-256 |
8b4c2b1fe22facc3142dd9fa19b95af5e21d2a2191040144c9ea0683e5e545a6
|
File details
Details for the file dvara-0.1.5-py3-none-any.whl.
File metadata
- Download URL: dvara-0.1.5-py3-none-any.whl
- Upload date:
- Size: 18.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c83279ad229560aa740036aae14663365cf42f10fc619fbf997dc661b39abd4
|
|
| MD5 |
817134cf64d0ffe9bde688b4384ce650
|
|
| BLAKE2b-256 |
c6f5321fbbad2facc5449c2df448ba80644fc30f2849799f5a6ed3597adf705c
|