Skip to main content

High-speed malicious URL detection using a Bloom Filter

Project description

dvara

High-speed malicious URL detection using a probabilistic Bloom Filter pipeline.

PyPI version Python 3.11+ License: MIT

pip install dvara

dvara check https://google.com
✅ CLEAN | 0.03ms | online

dvara check "http://xn--90abegbttpjb3bzb2j.xn--p1ai/doc/En/ACCOUNT/Auditor-of-State-Notification-of-EFT-Deposit"
🚨 MALICIOUS | 213.2ms | online

What is dvara?

dvara is a Python CLI and backend system for malicious URL detection using a probabilistic Bloom Filter architecture inspired by systems like Google Safe Browsing.

It ingests live threat intelligence feeds from:

  • URLhaus
  • PhishTank
  • OpenPhish
  • Cert.pl

and currently indexes:

268,970 confirmed malicious URLs

inside a compressed Bloom Filter occupying only:

5.14 MB

Most clean URLs are resolved entirely in-memory without touching the database.

Only Bloom filter hits trigger PostgreSQL confirmation.


Architecture

Threat feeds
    ↓
URL normalization + deduplication
    ↓
Bloom Filter generation
    ↓
PostgreSQL confirmed_urls database
    ↓
FastAPI backend deployment
    ↓
CLI / API URL checks

URL check pipeline

dvara check [url]
    ↓
Bloom Filter lookup (~3µs local)
    ↓
No match
    → CLEAN instantly

Possible match
    ↓
SHA256(url)
    ↓
PostgreSQL confirmation lookup
    ↓
MALICIOUS or SUSPICIOUS

Why Bloom Filters?

Traditional hash sets for millions of URLs consume hundreds of MBs of RAM.

Bloom Filters allow:

  • massive memory compression
  • constant-time lookups
  • zero false negatives
  • extremely high throughput

Tradeoff:

  • small false positive probability

False positives are resolved using PostgreSQL confirmation.


Benchmarks

Generated using:

python -m dvara.benchmarks
Metric Result
Local Bloom lookup latency ~0.003ms (3µs)
Throughput ~145k URLs/sec
Indexed malicious URLs 268,970
Filter size 5.14 MB
Peak RAM usage ~10.53 MB
False negatives 0 observed
False positives 0 / 100,000 tested
Bloom capacity 3,000,000 URLs

Benchmark latency refers to local in-memory Bloom Filter checks. Network/API requests are naturally slower due to HTTP and database confirmation stages.


Threat Intelligence Sources

Feed Type
URLhaus Malware URLs
PhishTank Verified phishing URLs
OpenPhish Active phishing feeds
Cert.pl Malicious domains

Installation

CLI only

pip install dvara

Backend/server dependencies

pip install dvara[server]

CLI Usage

Check URL (online)

dvara check https://example.com

Check URL (offline)

dvara check https://example.com --offline

Show stats

dvara stats

Update local filter

dvara update

Run ingestion

dvara ingest

Running the Backend

Docker Compose

git clone https://github.com/dhruv-0512/dvara
cd dvara

docker compose up --build

Services:

  • FastAPI
  • PostgreSQL
  • Redis

Manual setup

pip install dvara[server]

python -m dvara.ingestion

uvicorn dvara.app:app --reload

API Endpoints

Endpoint Description
/api/check Full two-stage URL check
/api/confirm Direct PostgreSQL lookup
/api/stats Bloom + backend stats
/api/reload Reload filter
/health Health check

Example API Response

{
  "url": "http://malicious-site.com",
  "result": "MALICIOUS",
  "latency_ms": 213.2,
  "stage": "db",
  "checked_at": "2026-05-09T09:08:32.663182+00:00"
}

Project Structure

dvara/
├── app.py
├── bloom.py
├── cli.py
├── config.py
├── ingestion.py
├── benchmarks.py

Technical Details

Bloom Filter Parameters

Capacity:            3,000,000 URLs
Target FPR:          0.1%
Hash functions (k):  10
Current fill ratio:  ~6%
Filter size:         5.14 MB

Hashing

  • MurmurHash3 for Bloom lookups
  • SHA256 for PostgreSQL confirmation keys

Deployment Stack

Component Service
API Render
Database Supabase PostgreSQL
Redis Upstash Redis
Package hosting PyPI

Why "dvara"?

dvara (द्वार) is the Sanskrit word for:

gateway / doorway

Every URL is a gateway.

dvara stands at that gateway and decides what gets through.


Security Note

dvara is intended for defensive cybersecurity research, malicious URL analysis, and educational purposes.

While the system uses real threat intelligence feeds and probabilistic detection techniques, it should not be treated as a replacement for enterprise secure web gateways, antivirus engines, or production threat prevention systems.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvara-0.2.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dvara-0.2.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file dvara-0.2.0.tar.gz.

File metadata

  • Download URL: dvara-0.2.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dvara-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9068de3ff38809a5d81be59de572d380f6e704f63a275936b1dcdde1c4f56dc7
MD5 762719ed653b4be30019a175636954b3
BLAKE2b-256 f45bb7b4661c774bff886c180755fcbf776fb845974a16545fc21f4b07533788

See more details on using hashes here.

File details

Details for the file dvara-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dvara-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dvara-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7617330c111364c9fddf72d10c386f0d99afe8c98a210c729ca95b1a3f283bc4
MD5 1002337ce409dcfadc27397b5bf56d1c
BLAKE2b-256 9827344c81ca3bf907f98a8c033bfc8f4ee0b16d9ea882af93a780f1c7ef48df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page