Skip to main content

High-speed malicious URL detection using a Bloom Filter

Project description

dvara

High-speed malicious URL detection using a Bloom Filter. Checks 3 million URLs in 5MB of RAM.

PyPI version Python 3.11+ License: MIT

pip install dvara

dvara check https://suspicious-site.com
🚨 MALICIOUS | urlhaus | malware_download | 2.5ms | online

dvara check https://google.com
✅ CLEAN | 0.1ms | online

What is dvara?

dvara is a Python CLI and library for detecting malicious URLs using a Bloom Filter — the same probabilistic data structure used internally by Chrome Safe Browsing.

It ingests threat feeds from URLhaus, PhishTank, and OpenPhish (~86,000 URLs updated daily), stores them in a 5MB Bloom Filter, and checks any URL in under 1ms — without touching a database for clean URLs.


Architecture

Daily ingestion job
→ Pulls URLhaus + PhishTank + OpenPhish (~86k–3M URLs)
→ Builds Bloom Filter (5.2MB, 0.1% FPR)
→ Saves to ~/.dvara/filter.bin

dvara check [url]  (online mode)
→ FastAPI backend
→ Hash URL → check 10 bit positions in Bloom Filter
→ All bits OFF → CLEAN instantly (0.1ms, DB never touched)
→ All bits ON  → query PostgreSQL confirmed_urls table
→ Found        → MALICIOUS + source + category
→ Not found    → SUSPICIOUS (false positive)

dvara check [url] --offline
→ Loads filter from ~/.dvara/filter.bin
→ Checks locally, zero network calls
→ dvara update to refresh

Two-stage design (the key insight)

Stage What Latency When
1 — Bloom Filter Redis bitstring, 10 hash lookups 0.1ms Every request
2 — PostgreSQL confirmed_urls table lookup 1–3ms Only on bloom hits

Clean URLs never touch the database. False negatives are mathematically impossible.


Benchmarks

Metric Result
Clean URL check 0.1ms
Malicious URL check (full pipeline) 2.5ms
URLs stored 85,976 (scales to 3M)
Filter size 5.14 MB
False negative rate 0% (guaranteed)
Target false positive rate 0.1%
Actual false positive rate ~0% at current fill

Installation

pip install dvara

For running the backend server

pip install dvara[server]

CLI Usage

Check a URL (online mode — hits API)

dvara check https://suspicious-site.com

Check a URL (offline mode — local filter, zero network)

dvara check https://suspicious-site.com --offline

Show filter and API stats

dvara stats

Update local filter cache

dvara update

Run ingestion manually

dvara ingest
dvara ingest --dry-run

Running the Backend

With Docker Compose (recommended)

git clone https://github.com/yourusername/dvara
cd dvara
docker compose up --build

This starts:

  • FastAPI — API server on port 8000
  • Redis — Bloom filter bitstring cache
  • PostgreSQL — confirmed URLs table

Manually

pip install dvara[server]

# Build the filter
python -m dvara.ingestion

# Start the API
python -m uvicorn dvara.app:app --reload

API Endpoints

Endpoint Method Description
/api/check?url=... GET Two-stage URL check
/api/confirm?url=... GET Direct DB lookup
/api/stats GET Filter + connection stats
/api/reload POST Reload filter from disk
/health GET Health check

Example response

{
  "url": "http://110.36.95.252:49267/bin.sh",
  "result": "MALICIOUS",
  "source": "urlhaus",
  "category": "malware_download",
  "latency_ms": 2.5,
  "stage": "db",
  "checked_at": "2026-05-02T13:01:04.767776+00:00"
}

The Math

  • n = 3,000,000 URLs, p = 0.001 (0.1% FPR)
  • Bit array size: m = -(n × ln(p)) / (ln(2))² = ~43M bits = 5.2MB
  • Hash count: k = (m/n) × ln(2) = 10 hash functions
  • Hash algorithm: MurmurHash3 with seeds 0–9

Why Bloom Filter and not a hash set?

3M URLs in a Python hash set = 500MB+. A Bloom Filter at 0.1% FPR = 5.2MB. False positives just trigger the DB confirm — acceptable. False negatives are mathematically impossible. The Bloom Filter is the right tool.

Why Redis and not disk?

Multiple FastAPI workers need to read the same filter simultaneously. Disk requires locking. Redis bitstring is shared memory across all workers — horizontal scaling for free.


Threat Feed Sources

Feed Format URLs
URLhaus CSV ~26,000
PhishTank JSON (gzipped) ~59,000
OpenPhish Plaintext ~300

Project Structure

dvara/
├── bloom.py        ← BloomFilter class (core)
├── ingestion.py    ← Fetch feeds, build filter
├── app.py          ← FastAPI backend
├── cli.py          ← Click CLI commands
└── config.py       ← Constants and env vars

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvara-0.1.1.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dvara-0.1.1-py3-none-any.whl (2.2 MB view details)

Uploaded Python 3

File details

Details for the file dvara-0.1.1.tar.gz.

File metadata

  • Download URL: dvara-0.1.1.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dvara-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7f15967e121f6d2e0621e44019d9a5da3eebb8c40eb40a162d2a0fcb02771439
MD5 77a68d3ae46ab47f47a0c1bbd772cd99
BLAKE2b-256 84ebbbaacdc5e3ccc997fa38ae40b199457efbce65f505e1de533c05fa584d80

See more details on using hashes here.

File details

Details for the file dvara-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dvara-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dvara-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d1900a49f57a185b4b7e8d5bb335867de70b799c42daa5184e89452725467071
MD5 1aaf5a1d67a73cc2fda9588db684bec3
BLAKE2b-256 f0255dd75f60963d2a341963e78039f695d02393b15037d4cda666ec4d81daee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page