Skip to main content

High-speed malicious URL detection using a Bloom Filter

Project description

dvara

High-speed malicious URL detection using a Bloom Filter. Checks 3 million URLs in 5MB of RAM.

PyPI version Python 3.11+ License: MIT

pip install dvara

dvara check https://suspicious-site.com
🚨 MALICIOUS | urlhaus | malware_download | 2.5ms | online

dvara check https://google.com
✅ CLEAN | 0.1ms | online

What is dvara?

dvara is a Python CLI and library for detecting malicious URLs using a Bloom Filter — the same probabilistic data structure used internally by Chrome Safe Browsing.

It ingests threat feeds from URLhaus, PhishTank, and OpenPhish (~86,000 URLs updated daily), stores them in a 5MB Bloom Filter, and checks any URL in under 1ms — without touching a database for clean URLs.


Architecture

Daily ingestion job
→ Pulls URLhaus + PhishTank + OpenPhish (~86k–3M URLs)
→ Builds Bloom Filter (5.2MB, 0.1% FPR)
→ Saves to ~/.dvara/filter.bin

dvara check [url]  (online mode)
→ FastAPI backend
→ Hash URL → check 10 bit positions in Bloom Filter
→ All bits OFF → CLEAN instantly (0.1ms, DB never touched)
→ All bits ON  → query PostgreSQL confirmed_urls table
→ Found        → MALICIOUS + source + category
→ Not found    → SUSPICIOUS (false positive)

dvara check [url] --offline
→ Loads filter from ~/.dvara/filter.bin
→ Checks locally, zero network calls
→ dvara update to refresh

Two-stage design (the key insight)

Stage What Latency When
1 — Bloom Filter Redis bitstring, 10 hash lookups 0.1ms Every request
2 — PostgreSQL confirmed_urls table lookup 1–3ms Only on bloom hits

Clean URLs never touch the database. False negatives are mathematically impossible.


Benchmarks

Metric Result
Clean URL check 0.1ms
Malicious URL check (full pipeline) 2.5ms
URLs stored 85,976 (scales to 3M)
Filter size 5.14 MB
False negative rate 0% (guaranteed)
Target false positive rate 0.1%
Actual false positive rate ~0% at current fill

Installation

pip install dvara

For running the backend server

pip install dvara[server]

CLI Usage

Check a URL (online mode — hits API)

dvara check https://suspicious-site.com

Check a URL (offline mode — local filter, zero network)

dvara check https://suspicious-site.com --offline

Show filter and API stats

dvara stats

Update local filter cache

dvara update

Run ingestion manually

dvara ingest
dvara ingest --dry-run

Running the Backend

With Docker Compose (recommended)

git clone https://github.com/yourusername/dvara
cd dvara
docker compose up --build

This starts:

  • FastAPI — API server on port 8000
  • Redis — Bloom filter bitstring cache
  • PostgreSQL — confirmed URLs table

Manually

pip install dvara[server]

# Build the filter
python -m dvara.ingestion

# Start the API
python -m uvicorn dvara.app:app --reload

API Endpoints

Endpoint Method Description
/api/check?url=... GET Two-stage URL check
/api/confirm?url=... GET Direct DB lookup
/api/stats GET Filter + connection stats
/api/reload POST Reload filter from disk
/health GET Health check

Example response

{
  "url": "http://110.36.95.252:49267/bin.sh",
  "result": "MALICIOUS",
  "source": "urlhaus",
  "category": "malware_download",
  "latency_ms": 2.5,
  "stage": "db",
  "checked_at": "2026-05-02T13:01:04.767776+00:00"
}

The Math

  • n = 3,000,000 URLs, p = 0.001 (0.1% FPR)
  • Bit array size: m = -(n × ln(p)) / (ln(2))² = ~43M bits = 5.2MB
  • Hash count: k = (m/n) × ln(2) = 10 hash functions
  • Hash algorithm: MurmurHash3 with seeds 0–9

Why Bloom Filter and not a hash set?

3M URLs in a Python hash set = 500MB+. A Bloom Filter at 0.1% FPR = 5.2MB. False positives just trigger the DB confirm — acceptable. False negatives are mathematically impossible. The Bloom Filter is the right tool.

Why Redis and not disk?

Multiple FastAPI workers need to read the same filter simultaneously. Disk requires locking. Redis bitstring is shared memory across all workers — horizontal scaling for free.


Threat Feed Sources

Feed Format URLs
URLhaus CSV ~26,000
PhishTank JSON (gzipped) ~59,000
OpenPhish Plaintext ~300

Project Structure

dvara/
├── bloom.py        ← BloomFilter class (core)
├── ingestion.py    ← Fetch feeds, build filter
├── app.py          ← FastAPI backend
├── cli.py          ← Click CLI commands
└── config.py       ← Constants and env vars

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvara-0.1.0.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dvara-0.1.0-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file dvara-0.1.0.tar.gz.

File metadata

  • Download URL: dvara-0.1.0.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dvara-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ed9caa3ad2ad5dd8f4541f027a32e02a0a33429d31b95d4ed976ec8d95eda7ef
MD5 0cfa8b043a7f87a0c255f86f3a4eb0ca
BLAKE2b-256 0713c508ec42528d8226bf67d212789766a77ba7ac718719aac8d08db994aca8

See more details on using hashes here.

File details

Details for the file dvara-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dvara-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dvara-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b1cccfc3c8192415d1198ff89d2bb7e6f05d06bbd67f4d867bd76595486757b
MD5 f06db59160c3719f87a232a25b82b1e6
BLAKE2b-256 0a2f3cfc73b525b1a92d9ad9d3d67ab8653004562e91cab8485181fff6f130c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page