Bot detection and traffic classification for scientific data repository logs
Project description
DeepLogBot
Bot detection and traffic classification for scientific data repository logs.
Overview
DeepLogBot (CLI: deeplogbot) detects and classifies download patterns in scientific data repository logs, distinguishing between:
- Organic users — Human researchers with natural download patterns
- Bots — Automated scrapers, crawlers, and coordinated bot farms
- Download hubs — Legitimate mirrors, institutional pipelines, and data aggregators
Applied to the PRIDE Archive (159M download records), the system identified that 88% of traffic is bot-generated. After filtering, 19.1M clean downloads remain across 34,085 datasets and 213 countries.
Classification Categories
Each geographic location is classified into one of three categories:
- Bot — Automated scrapers, crawlers, and coordinated bot farms
- Hub — Legitimate automation: institutional mirrors, CI/CD pipelines, educational workshops
- Organic — Human researchers with natural download patterns
Classification Methods
DeepLogBot provides 2 classification methods:
| Method | Macro F1 | Speed | Description |
|---|---|---|---|
rules |
0.632 | Fast | YAML-configurable thresholds, no training required |
deep |
0.775 | Medium | Multi-stage learned pipeline with soft priors |
Benchmarked on a 1M-record sample with manually curated ground truth.
Rule-Based (--classification-method rules)
Hierarchical threshold classification using YAML-configurable rules. Fast, interpretable, and requires no training. Best for production use with known patterns.
Deep Architecture (--classification-method deep)
Multi-stage learned pipeline:
- Seed Selection — Identify high-confidence bot/organic/hub seeds from feature distributions
- Organic VAE — Learn the normal-behavior manifold; score reconstruction error
- Deep Isolation Forest — Non-linear anomaly detection on VAE latent space
- Temporal Consistency — Modified z-score spike detection (no fixed thresholds)
- Fusion Meta-Learner — Gradient-boosted combination of all anomaly signals
Additional components:
- Soft priors — Pre-filter signals encoded as continuous features (no hard lockout)
- Reconciliation — Override thresholds for cases where pipeline and pre-filter disagree
- Hub protection — Prevent legitimate automation from being classified as bots
- Post-classification — Hub protection and final label assignment
Installation
pip install -e .
Requirements
- Python 3.9+
- pandas, numpy, scikit-learn, scipy, duckdb
- Optional: torch (for deep method)
Usage
Command Line
# Rule-based classification (default)
deeplogbot -i data.parquet -o output/
# Deep architecture
deeplogbot -i data.parquet -o output/ -m deep
# With sampling for large datasets
deeplogbot -i data.parquet -o output/ -m deep --sample-size 1000000
Options:
| Option | Description | Default |
|---|---|---|
-i, --input |
Input parquet file | Required |
-o, --output-dir |
Output directory | output/bot_analysis |
-m, --classification-method |
rules or deep |
rules |
-c, --contamination |
Anomaly proportion | 0.15 |
-s, --sample-size |
Sample N records | None (use all) |
-p, --provider |
Log provider | ebi |
Python API
from deeplogbot import run_bot_annotator
# Rule-based classification
results = run_bot_annotator(
input_parquet='data.parquet',
output_dir='output/',
classification_method='rules'
)
# Deep architecture
results = run_bot_annotator(
input_parquet='data.parquet',
output_dir='output/',
classification_method='deep'
)
print(f"Bots detected: {results['bot_count']}")
print(f"Hubs detected: {results['hub_count']}")
Project Structure
deeplogbot/
├── __init__.py # Package exports
├── main.py # CLI entry point and pipeline
├── config.py # Configuration loading
├── config.yaml # Main configuration file
│
├── features/ # Feature extraction (~117 features)
│ ├── base.py # Base extractor class
│ ├── schema.py # Log schema definitions
│ ├── registry.py # Feature documentation registry
│ └── providers/
│ └── ebi/ # EBI/PRIDE provider
│ ├── ebi.py # Location feature extraction
│ ├── behavioral.py # Behavioral features
│ ├── discriminative.py # Discriminative features
│ ├── timeseries.py # Time series features
│ └── schema.py # EBI-specific schema
│
├── models/
│ ├── isoforest/ # Isolation Forest anomaly detection
│ │ └── models.py
│ └── classification/ # Classification methods
│ ├── rules.py # Rule-based hierarchical classifier
│ ├── deep_architecture.py # Deep pipeline orchestration
│ ├── seed_selection.py # High-confidence seed identification
│ ├── organic_vae.py # VAE + Deep Isolation Forest
│ ├── temporal_consistency.py # Modified z-score spike detection
│ ├── fusion.py # Gradient-boosted meta-learner
│ ├── post_classification.py # Hub protection & label finalization
│ └── feature_validation.py # Feature usage validation
│
├── reports/ # Output generation
│ ├── reporting.py # Text report generation
│ ├── annotation.py # Parquet annotation
│ ├── statistics.py # Summary statistics
│ ├── html_report.py # Interactive HTML reports
│ └── visualizations.py # Charts and plots
│
├── utils/ # Utilities
│ └── geography.py # Geographic lookups
│
└── providers/
└── base_taxonomy.yaml # Classification taxonomy
Configuration
Configuration is in deeplogbot/config.yaml:
isolation_forest:
contamination: 0.15
n_estimators: 200
random_state: 42
classification:
rule_based:
bots:
require_anomaly: true
patterns:
- downloads_per_user: {max: 100}
unique_users: {min: 5000}
hubs:
require_anomaly: true
patterns:
- downloads_per_user: {min: 500}
deep_reconciliation:
override_threshold: 0.7
strict_threshold: 0.8
Classifying a Download Parquet File
Given a parquet file of download logs (one row per download event), DeepLogBot aggregates records by geographic location, extracts ~117 behavioral and discriminative features, classifies each location as bot/hub/organic, and writes a new annotated parquet with classification columns appended to every row.
Input format
The input parquet must contain at minimum:
| Column | Description |
|---|---|
accession |
Dataset accession (e.g., PXD000001) |
geo_location |
Geographic location string (city/region) |
country |
Country name or code |
year |
Download year |
date |
Download date |
Running classification
# Classify with the deep method (recommended) — writes <input>_annotated.parquet
deeplogbot -i downloads.parquet -o output/ -m deep
# Classify with rules (faster, no torch required)
deeplogbot -i downloads.parquet -o output/ -m rules
# For large files, sample first to speed up classification
deeplogbot -i downloads.parquet -o output/ -m deep --sample-size 5000000
The annotated parquet is written to the output directory with an _annotated suffix (e.g., output/downloads_annotated.parquet). You can also specify an explicit output path:
deeplogbot -i downloads.parquet -o output/ -m deep --output output/classified.parquet
Output strategies
| Strategy | Flag | Behavior |
|---|---|---|
new_file (default) |
--output-strategy new_file |
Creates <input>_annotated.parquet in the output directory |
overwrite |
--output-strategy overwrite |
Rewrites the original parquet in place |
reports_only |
--reports-only |
Generates text/HTML reports without writing a parquet |
Using the annotated parquet
import duckdb
conn = duckdb.connect()
df = conn.execute("""
SELECT accession, country, year,
is_bot, is_hub, is_organic,
is_bot, is_hub, is_organic
FROM read_parquet('output/downloads_annotated.parquet')
LIMIT 10
""").df()
# Filter to clean (non-bot, non-hub) downloads
clean = conn.execute("""
SELECT accession, country, COUNT(*) as downloads
FROM read_parquet('output/downloads_annotated.parquet')
WHERE is_bot = false AND is_hub = false
GROUP BY accession, country
ORDER BY downloads DESC
""").df()
Output Format
The annotated output parquet contains:
| Column | Description |
|---|---|
is_bot |
Bot classification flag |
is_hub |
Download hub classification flag |
is_organic |
Organic user classification flag |
classification_confidence |
Confidence score (0-1) |
Reports generated:
bot_detection_report.txt— Summary with counts and breakdownslocation_analysis.csv— Per-location features and classifications- Interactive HTML report (if enabled)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deeplogbot-0.1.0.tar.gz.
File metadata
- Download URL: deeplogbot-0.1.0.tar.gz
- Upload date:
- Size: 116.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
994e669bf289726bec076b5781a1a9a9284290e45e9d2c22d0c1a3e6ada36b6e
|
|
| MD5 |
fefc06b6b32eaa05f99cd53535d252de
|
|
| BLAKE2b-256 |
4c07cf3e17b432e87bff6098837866138932865a84517afbceb5924cda56c72c
|
Provenance
The following attestation bundles were made for deeplogbot-0.1.0.tar.gz:
Publisher:
publish.yml on ypriverol/deeplogbot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deeplogbot-0.1.0.tar.gz -
Subject digest:
994e669bf289726bec076b5781a1a9a9284290e45e9d2c22d0c1a3e6ada36b6e - Sigstore transparency entry: 941370579
- Sigstore integration time:
-
Permalink:
ypriverol/deeplogbot@e63a6076f6a256e6498fff505cf1ff3104ab64c6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ypriverol
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e63a6076f6a256e6498fff505cf1ff3104ab64c6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file deeplogbot-0.1.0-py3-none-any.whl.
File metadata
- Download URL: deeplogbot-0.1.0-py3-none-any.whl
- Upload date:
- Size: 135.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8d0dd8738ad485e8631a78a437a27a681eeb0e9996c1f065496ff90a7bb8e9a
|
|
| MD5 |
d0c4ebd4d34e72d68398629c64c96c63
|
|
| BLAKE2b-256 |
b72a4bba691dec935ead7df4e9138972ac41f8842211d2d563f570412736f85c
|
Provenance
The following attestation bundles were made for deeplogbot-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on ypriverol/deeplogbot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deeplogbot-0.1.0-py3-none-any.whl -
Subject digest:
c8d0dd8738ad485e8631a78a437a27a681eeb0e9996c1f065496ff90a7bb8e9a - Sigstore transparency entry: 941370581
- Sigstore integration time:
-
Permalink:
ypriverol/deeplogbot@e63a6076f6a256e6498fff505cf1ff3104ab64c6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ypriverol
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e63a6076f6a256e6498fff505cf1ff3104ab64c6 -
Trigger Event:
release
-
Statement type: