Production-ready DNS threat detection using machine learning
Project description
DNS Threat Detector
A production-ready machine learning system for detecting malicious DNS domains including DGA domains, typosquatting attempts, malware C&C domains, and phishing sites.
Features
- 99.68% F1-Score on comprehensive test dataset
- 100% Typosquatting Detection with zero false positives
- Sub-millisecond Latency (0.439ms average inference time)
- Multi-tier Safelist with O(1) lookup for instant benign classification
- 99 Protected Brands including Google, Microsoft, PayPal, Amazon, etc.
- Hybrid Architecture combining LightGBM, LSTM, and meta-learning
- Easy Integration with Python API and CLI tool
Performance Metrics
| Metric | Value |
|---|---|
| F1-Score | 99.68% |
| Accuracy | 99.38% |
| Precision | 97.15% |
| Recall | 99.95% |
| Typosquatting Detection | 100% |
| False Positive Rate | 28.5% |
| False Negative Rate | 0.05% |
| Avg Latency | 0.439 ms |
| Throughput | ~2,275 domains/sec |
Installation
pip install dns-threat-detector
Quick Start
Python API
from dns_threat_detector import DNS_ThreatDetector
# Initialize detector with safelist enabled
detector = DNS_ThreatDetector(use_safelist=True)
detector.load_models()
# Predict a single domain
result = detector.predict('gooogle.com')
print(result)
# Output:
# {
# 'prediction': 'MALICIOUS',
# 'confidence': 0.9000,
# 'reason': 'Typosquatting (dist=1 to google)',
# 'method': 'typosquatting_rule',
# 'latency_ms': 0.234
# }
# Batch predictions
domains = ['google.com', 'gooogle.com', 'example.com']
results = detector.predict_batch(domains)
Command-Line Interface
# Predict a single domain
dns-detect predict gooogle.com
# Get JSON output
dns-detect predict gooogle.com --json
# Batch process domains from file
dns-detect batch domains.txt --output results.json
# Show model information
dns-detect info
# Run self-tests
dns-detect test
Architecture
The DNS Threat Detector uses a sophisticated hybrid ensemble approach:
1. LightGBM Classifier
- Gradient-boosted decision trees
- 11 features (4 FQDN + 7 typosquatting-specific)
- 200 trees with max depth 7
- Handles structured feature patterns
2. Bidirectional LSTM
- Character-level neural network
- 41-character vocabulary
- 159K parameters
- Embedding(41→32) → Bi-LSTM(32→64×2) → FC(128→64→2)
- Captures sequential patterns
3. Meta-Learner
- Logistic regression stacking ensemble
- Combines LightGBM and LSTM predictions
- Learned weights: LSTM=7.04, LightGBM=2.53
- Final classification decision
4. Typosquatting Detector
- Rule-based + ML hybrid approach
- Edit distance (Levenshtein) to 99 top brands
- Distance 1-3 → Malicious (typosquatting)
- Exact brand match → Benign (whitelist)
5. Multi-Tier Safelist (Optional)
- Tier 1: 30K critical domains (government, finance)
- Tier 2: 29K high-trust domains (tech, education)
- Tier 3: 85K general trusted domains
- O(1) in-memory lookup
- 322× speedup for safelisted domains
Prediction Flow
Domain Input
↓
┌─────────────────────┐
│ Safelist Check │ → BENIGN (if listed)
└─────────────────────┘
↓
┌─────────────────────┐
│ Brand Whitelist │ → BENIGN (exact match)
└─────────────────────┘
↓
┌─────────────────────┐
│ Typosquatting Rule │ → MALICIOUS (edit dist 1-3)
└─────────────────────┘
↓
┌─────────────────────┐
│ ML Ensemble │ → MALICIOUS/BENIGN
│ (LightGBM + LSTM) │
└─────────────────────┘
Features Extracted
FQDN Features (4)
domain_length- Length of domain name excluding TLDsubdomain_count- Number of subdomainsnumeric_chars- Count of numeric charactersentropy- Shannon entropy of character distribution
Typosquatting Features (7)
min_edit_distance- Minimum Levenshtein distance to top brandsedit_distance_ratio- Normalized edit distance by brand lengthlength_diff_to_closest- Length difference to closest brandhas_extra_char- Binary: domain has 1 extra characterhas_missing_char- Binary: domain missing 1 characterhas_swapped_char- Binary: adjacent characters swappeddigit_substitution- Binary: contains digit substitution
API Reference
DNS_ThreatDetector
class DNS_ThreatDetector(
models_dir: Optional[str] = None,
use_safelist: bool = False,
safelist_dir: Optional[str] = None,
safelist_tiers: List[int] = [1, 2, 3]
)
Methods
load_models()
Load all model components (LightGBM, LSTM, meta-learner, safelist)
predict(domain: str) -> Dict
Predict if a domain is malicious or benign
Returns:
{
'prediction': 'MALICIOUS' | 'BENIGN',
'confidence': float, # 0.0 to 1.0
'reason': str, # Human-readable explanation
'method': str, # 'safelist' | 'brand_whitelist' | 'typosquatting_rule' | 'ensemble'
'latency_ms': float # Inference time in milliseconds
}
predict_batch(domains: List[str]) -> List[Dict]
Predict multiple domains
get_model_info() -> Dict
Get comprehensive model information and statistics
save_metadata(output_path: str)
Save model metadata to JSON file
Advanced Usage
Custom Model Paths
detector = DNS_ThreatDetector(
models_dir='/path/to/models',
use_safelist=True,
safelist_dir='/path/to/safelists',
safelist_tiers=[1, 2, 3]
)
detector.load_models()
Disable Safelist
# Faster initialization, no safelist loading
detector = DNS_ThreatDetector(use_safelist=False)
detector.load_models()
Batch Processing with Progress
from tqdm import tqdm
domains = ['example1.com', 'example2.com', ...]
results = []
for domain in tqdm(domains):
result = detector.predict(domain)
results.append(result)
Model Statistics
info = detector.get_model_info()
print(f"Total predictions: {info['usage_statistics']['total_predictions']}")
print(f"Safelist hits: {info['usage_statistics']['safelist_hits']}")
print(f"Typosquatting detections: {info['usage_statistics']['typosquatting_detections']}")
CLI Reference
Commands
dns-detect predict <domain>
Predict a single domain
--json: Output as JSON--no-safelist: Disable safelist checking
dns-detect batch <file>
Batch process domains from file (one domain per line)
--output <file>: Output file path (default: results.json)--no-safelist: Disable safelist checking
dns-detect info
Show model information and statistics
--no-safelist: Show info without loading safelist
dns-detect test
Run built-in self-tests
Examples
Detecting Typosquatting
detector = DNS_ThreatDetector()
detector.load_models()
# Legitimate brand
result = detector.predict('google.com')
# → BENIGN (brand_whitelist)
# Typosquatting attempts
result = detector.predict('gooogle.com') # Extra 'o'
# → MALICIOUS (typosquatting_rule, dist=1)
result = detector.predict('g00gle.com') # Digit substitution
# → MALICIOUS (typosquatting_rule, dist=2)
Processing Domain Lists
import pandas as pd
detector = DNS_ThreatDetector(use_safelist=True)
detector.load_models()
# Read domains from CSV
df = pd.read_csv('domains.csv')
# Add predictions
df['prediction'] = df['domain'].apply(
lambda d: detector.predict(d)['prediction']
)
df['confidence'] = df['domain'].apply(
lambda d: detector.predict(d)['confidence']
)
# Filter malicious domains
malicious = df[df['prediction'] == 'MALICIOUS']
print(malicious)
Requirements
- Python ≥ 3.8
- PyTorch ≥ 2.0.0
- LightGBM ≥ 4.0.0
- scikit-learn ≥ 1.3.0
- pandas ≥ 2.0.0
- numpy ≥ 1.24.0
Model Size
- Total package size: ~60 MB
- LightGBM models: ~10 MB
- LSTM model: ~5 MB
- Safelist files (tiers 1-3): ~20 MB
- Tokenizer: ~1 MB
Dataset
Trained on 51,000 domains:
- 50% benign (legitimate domains)
- 50% malicious (DGA, typosquatting, malware C&C)
- 80/20 train/test split with stratification
Citation
If you use this tool in your research or project, please cite:
@software{dns_threat_detector,
title = {DNS Threat Detector},
author = {UMUDGA Project},
year = {2025},
version = {1.0.0},
url = {https://github.com/umudga/dns-threat-detector}
}
License
MIT License - see LICENSE file for details
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
Support
- GitHub Issues: https://github.com/umudga/dns-threat-detector/issues
- Documentation: https://github.com/umudga/dns-threat-detector/wiki
Changelog
Version 1.0.0 (2025-10-30)
- Initial release
- Hybrid ensemble architecture (LightGBM + LSTM + Meta-learner)
- 99.68% F1-score on test data
- 100% typosquatting detection
- Multi-tier safelist integration
- CLI tool with batch processing
- Comprehensive API documentation
Acknowledgments
Developed by the UMUDGA Project team as part of a final-year academic research project on DNS threat detection using machine learning.
Disclaimer
This tool is provided for educational and research purposes. While it achieves high accuracy, no detection system is perfect. Always use multiple layers of security in production environments.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dns_threat_detector-1.0.0.tar.gz.
File metadata
- Download URL: dns_threat_detector-1.0.0.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b86c5e7d224b5d6d9ead3e44cfdc066db9a315d8550e976a80276ce50cab302
|
|
| MD5 |
146b318f6350f9e26cfe1d5307b54568
|
|
| BLAKE2b-256 |
7ef94dab76fb385245789f99923c2a816a0c84c783a80f1dbd9cad5d42167ecf
|
File details
Details for the file dns_threat_detector-1.0.0-py3-none-any.whl.
File metadata
- Download URL: dns_threat_detector-1.0.0-py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a5309f2306780702ab13f3e2441d1d0e6d89c54b1eb266c55ef0b153095e72c
|
|
| MD5 |
f77946c6cbe5abecb2331fec815c62b5
|
|
| BLAKE2b-256 |
715e98340afc891ab588a409a97c322ea0816bb14078c4b594fd085b078b0836
|