Skip to main content

Production-ready DNS threat detection using machine learning

Project description

DNS Threat Detector

Python License Version

A production-ready machine learning system for detecting malicious DNS domains including DGA domains, typosquatting attempts, malware C&C domains, and phishing sites.

Features

  • 99.68% F1-Score on comprehensive test dataset
  • 100% Typosquatting Detection with zero false positives
  • Sub-millisecond Latency (0.439ms average inference time)
  • Multi-tier Safelist with O(1) lookup for instant benign classification
  • 99 Protected Brands including Google, Microsoft, PayPal, Amazon, etc.
  • Hybrid Architecture combining LightGBM, LSTM, and meta-learning
  • Easy Integration with Python API and CLI tool

Performance Metrics

Metric Value
F1-Score 99.68%
Accuracy 99.38%
Precision 97.15%
Recall 99.95%
Typosquatting Detection 100%
False Positive Rate 28.5%
False Negative Rate 0.05%
Avg Latency 0.439 ms
Throughput ~2,275 domains/sec

Installation

pip install dns-threat-detector

Quick Start

Python API

from dns_threat_detector import DNS_ThreatDetector

# Initialize detector with safelist enabled
detector = DNS_ThreatDetector(use_safelist=True)
detector.load_models()

# Predict a single domain
result = detector.predict('gooogle.com')
print(result)
# Output:
# {
#   'prediction': 'MALICIOUS',
#   'confidence': 0.9000,
#   'reason': 'Typosquatting (dist=1 to google)',
#   'method': 'typosquatting_rule',
#   'latency_ms': 0.234
# }

# Batch predictions
domains = ['google.com', 'gooogle.com', 'example.com']
results = detector.predict_batch(domains)

Command-Line Interface

# Predict a single domain
dns-detect predict gooogle.com

# Get JSON output
dns-detect predict gooogle.com --json

# Batch process domains from file
dns-detect batch domains.txt --output results.json

# Show model information
dns-detect info

# Run self-tests
dns-detect test

Architecture

The DNS Threat Detector uses a sophisticated hybrid ensemble approach:

1. LightGBM Classifier

  • Gradient-boosted decision trees
  • 11 features (4 FQDN + 7 typosquatting-specific)
  • 200 trees with max depth 7
  • Handles structured feature patterns

2. Bidirectional LSTM

  • Character-level neural network
  • 41-character vocabulary
  • 159K parameters
  • Embedding(41→32) → Bi-LSTM(32→64×2) → FC(128→64→2)
  • Captures sequential patterns

3. Meta-Learner

  • Logistic regression stacking ensemble
  • Combines LightGBM and LSTM predictions
  • Learned weights: LSTM=7.04, LightGBM=2.53
  • Final classification decision

4. Typosquatting Detector

  • Rule-based + ML hybrid approach
  • Edit distance (Levenshtein) to 99 top brands
  • Distance 1-3 → Malicious (typosquatting)
  • Exact brand match → Benign (whitelist)

5. Multi-Tier Safelist (Optional)

  • Tier 1: 30K critical domains (government, finance)
  • Tier 2: 29K high-trust domains (tech, education)
  • Tier 3: 85K general trusted domains
  • O(1) in-memory lookup
  • 322× speedup for safelisted domains

Prediction Flow

Domain Input
    ↓
┌─────────────────────┐
│  Safelist Check     │ → BENIGN (if listed)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  Brand Whitelist    │ → BENIGN (exact match)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  Typosquatting Rule │ → MALICIOUS (edit dist 1-3)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  ML Ensemble        │ → MALICIOUS/BENIGN
│  (LightGBM + LSTM)  │
└─────────────────────┘

Features Extracted

FQDN Features (4)

  1. domain_length - Length of domain name excluding TLD
  2. subdomain_count - Number of subdomains
  3. numeric_chars - Count of numeric characters
  4. entropy - Shannon entropy of character distribution

Typosquatting Features (7)

  1. min_edit_distance - Minimum Levenshtein distance to top brands
  2. edit_distance_ratio - Normalized edit distance by brand length
  3. length_diff_to_closest - Length difference to closest brand
  4. has_extra_char - Binary: domain has 1 extra character
  5. has_missing_char - Binary: domain missing 1 character
  6. has_swapped_char - Binary: adjacent characters swapped
  7. digit_substitution - Binary: contains digit substitution

API Reference

DNS_ThreatDetector

class DNS_ThreatDetector(
    models_dir: Optional[str] = None,
    use_safelist: bool = False,
    safelist_dir: Optional[str] = None,
    safelist_tiers: List[int] = [1, 2, 3]
)

Methods

load_models() Load all model components (LightGBM, LSTM, meta-learner, safelist)

predict(domain: str) -> Dict Predict if a domain is malicious or benign

Returns:

{
    'prediction': 'MALICIOUS' | 'BENIGN',
    'confidence': float,  # 0.0 to 1.0
    'reason': str,  # Human-readable explanation
    'method': str,  # 'safelist' | 'brand_whitelist' | 'typosquatting_rule' | 'ensemble'
    'latency_ms': float  # Inference time in milliseconds
}

predict_batch(domains: List[str]) -> List[Dict] Predict multiple domains

get_model_info() -> Dict Get comprehensive model information and statistics

save_metadata(output_path: str) Save model metadata to JSON file

Advanced Usage

Custom Model Paths

detector = DNS_ThreatDetector(
    models_dir='/path/to/models',
    use_safelist=True,
    safelist_dir='/path/to/safelists',
    safelist_tiers=[1, 2, 3]
)
detector.load_models()

Disable Safelist

# Faster initialization, no safelist loading
detector = DNS_ThreatDetector(use_safelist=False)
detector.load_models()

Batch Processing with Progress

from tqdm import tqdm

domains = ['example1.com', 'example2.com', ...]
results = []

for domain in tqdm(domains):
    result = detector.predict(domain)
    results.append(result)

Model Statistics

info = detector.get_model_info()
print(f"Total predictions: {info['usage_statistics']['total_predictions']}")
print(f"Safelist hits: {info['usage_statistics']['safelist_hits']}")
print(f"Typosquatting detections: {info['usage_statistics']['typosquatting_detections']}")

CLI Reference

Commands

dns-detect predict <domain> Predict a single domain

  • --json: Output as JSON
  • --no-safelist: Disable safelist checking

dns-detect batch <file> Batch process domains from file (one domain per line)

  • --output <file>: Output file path (default: results.json)
  • --no-safelist: Disable safelist checking

dns-detect info Show model information and statistics

  • --no-safelist: Show info without loading safelist

dns-detect test Run built-in self-tests

Examples

Detecting Typosquatting

detector = DNS_ThreatDetector()
detector.load_models()

# Legitimate brand
result = detector.predict('google.com')
# → BENIGN (brand_whitelist)

# Typosquatting attempts
result = detector.predict('gooogle.com')  # Extra 'o'
# → MALICIOUS (typosquatting_rule, dist=1)

result = detector.predict('g00gle.com')  # Digit substitution
# → MALICIOUS (typosquatting_rule, dist=2)

Processing Domain Lists

import pandas as pd

detector = DNS_ThreatDetector(use_safelist=True)
detector.load_models()

# Read domains from CSV
df = pd.read_csv('domains.csv')

# Add predictions
df['prediction'] = df['domain'].apply(
    lambda d: detector.predict(d)['prediction']
)
df['confidence'] = df['domain'].apply(
    lambda d: detector.predict(d)['confidence']
)

# Filter malicious domains
malicious = df[df['prediction'] == 'MALICIOUS']
print(malicious)

Requirements

  • Python ≥ 3.8
  • PyTorch ≥ 2.0.0
  • LightGBM ≥ 4.0.0
  • scikit-learn ≥ 1.3.0
  • pandas ≥ 2.0.0
  • numpy ≥ 1.24.0

Model Size

  • Total package size: ~60 MB
  • LightGBM models: ~10 MB
  • LSTM model: ~5 MB
  • Safelist files (tiers 1-3): ~20 MB
  • Tokenizer: ~1 MB

Dataset

Trained on 51,000 domains:

  • 50% benign (legitimate domains)
  • 50% malicious (DGA, typosquatting, malware C&C)
  • 80/20 train/test split with stratification

Citation

If you use this tool in your research or project, please cite:

@software{dns_threat_detector,
  title = {DNS Threat Detector},
  author = {UMUDGA Project},
  year = {2025},
  version = {1.0.0},
  url = {https://github.com/umudga/dns-threat-detector}
}

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

Support

Changelog

Version 1.0.0 (2025-10-30)

  • Initial release
  • Hybrid ensemble architecture (LightGBM + LSTM + Meta-learner)
  • 99.68% F1-score on test data
  • 100% typosquatting detection
  • Multi-tier safelist integration
  • CLI tool with batch processing
  • Comprehensive API documentation

Acknowledgments

Developed by the UMUDGA Project team as part of a final-year academic research project on DNS threat detection using machine learning.

Disclaimer

This tool is provided for educational and research purposes. While it achieves high accuracy, no detection system is perfect. Always use multiple layers of security in production environments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dns_threat_detector-1.0.0.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dns_threat_detector-1.0.0-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file dns_threat_detector-1.0.0.tar.gz.

File metadata

  • Download URL: dns_threat_detector-1.0.0.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for dns_threat_detector-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2b86c5e7d224b5d6d9ead3e44cfdc066db9a315d8550e976a80276ce50cab302
MD5 146b318f6350f9e26cfe1d5307b54568
BLAKE2b-256 7ef94dab76fb385245789f99923c2a816a0c84c783a80f1dbd9cad5d42167ecf

See more details on using hashes here.

File details

Details for the file dns_threat_detector-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dns_threat_detector-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a5309f2306780702ab13f3e2441d1d0e6d89c54b1eb266c55ef0b153095e72c
MD5 f77946c6cbe5abecb2331fec815c62b5
BLAKE2b-256 715e98340afc891ab588a409a97c322ea0816bb14078c4b594fd085b078b0836

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page