An analysis pipeline for parsing and crawling evasive phishing emails
Project description
CrawlerBox
Description
CrawlerBox is an automated analysis framework designed for parsing emails and crawling embedded web resources. This infrastructure was developed to facilitate the study of evasive phishing emails reported by end users.
For more detailed information on CrawlerBox, its functionality, and the results obtained, please refer to our paper "A Closer Look at Modern Evasive Phishing Emails".
Getting started
Installation
CrawlerBox is meant to be run on Windows.
Local installation
Local installation can be done using uv
git clone https://github.com/AmadeusITGroup/CrawlerBox.git
cd CrawlerBox
uv venv -p python3.10
uv pip install -e .
.venv\Scripts\activate.bat
Necessary dependencies and configuration
First you need to install vcredist_x64.exe from the Visual C++ Redistributable Packages for Visual Studio 2013. It is necessary for the working of the library responsible for reading QR codes (QReader).
CrawlerBox relies on external services to operate (e.g., Cisco Umbrella and Shodan for data enrichment). Additionally, it connects to two external servers: one database for retrieving newly user-reported messages and another for storing the obtained results. Before running CrawlerBox, you must configure these dependencies. Please use the config.py file accordingly.
Please also consider rewriting the functions in personalized_config.py: fetch_new_emails_by_date, fetch_new_emails_by_id, and url_rewrite. The two first functions should match your implemetation for fetching newly reported emails, and url_rewrite is designed to extract and return a decoded URL from a given string. In case the URLs within the messages are rewritten (e.g., rewritten by Microsoft's Safe Links or Proofpoint's URL Defense), you might need to decode these URLs before loading them by the crawler.
Running CrawlerBox
You can run CrawlerBox in three manners.
With the -id (--phish_id) option:
The "id" argument corresponds to the id of the message to be analyzed (as is in your input DB). Example:
run_crawlerbox -id xxxx-xxxx-xxxx-xxxxxxx
With the -d (--date) option:
The "d" argument represents a date string. CrawlerBox fetches all the reported emails on date "d" and analyzes them. Example:
run_crawlerbox -d 2025-01-01
With no options:
CrawlerBox runs continously and fetches new reported emails every ten minutes. It automatically starts the analysis for the fetched messages. Example:
run_crawlerbox
Citation
Please consider citing our paper if you find it useful:
@book{boulila2025,
title = {A Closer Look at Modern Evasive Phishing Emails},
author = {Boulila, Elyssa and Dacier, Marc and Vengadessa Peroumal, Siva Prem and Veys, Nicolas and Aonzo, Simone},
booktitle={2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)},
year = {2025},
organization = {IEEE}
}
Contributing
We welcome your contributions. Please feel free to fork the code, play with it, make some patches and send us pull requests using issues.
We do have a Code of conduct. Make sure to check it out before contributing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawlerbox-0.0.3.tar.gz.
File metadata
- Download URL: crawlerbox-0.0.3.tar.gz
- Upload date:
- Size: 124.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c9fbd5ae95471c0d5e572ab3a1e045e37336b60f399baa6a5de2b17fe4c3a56
|
|
| MD5 |
63c8a8ff6f937be2a8237ef2c49a7a4f
|
|
| BLAKE2b-256 |
9ae73098816b93cc6ec422e953d7f521b55cf3598132eb7ac6b06095fbaa5ff7
|
Provenance
The following attestation bundles were made for crawlerbox-0.0.3.tar.gz:
Publisher:
release.yml on AmadeusITGroup/CrawlerBox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crawlerbox-0.0.3.tar.gz -
Subject digest:
7c9fbd5ae95471c0d5e572ab3a1e045e37336b60f399baa6a5de2b17fe4c3a56 - Sigstore transparency entry: 186297508
- Sigstore integration time:
-
Permalink:
AmadeusITGroup/CrawlerBox@87b4dd4ae7e29f119efbefdcd00e69d66ea75308 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/AmadeusITGroup
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@87b4dd4ae7e29f119efbefdcd00e69d66ea75308 -
Trigger Event:
push
-
Statement type:
File details
Details for the file crawlerbox-0.0.3-py3-none-any.whl.
File metadata
- Download URL: crawlerbox-0.0.3-py3-none-any.whl
- Upload date:
- Size: 46.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74a719e64f3ab7aa43324797a676c1731f292a343ee1dea068b03797204eb7d6
|
|
| MD5 |
1bbc2a08cb08a9b4d266da3de5ce7941
|
|
| BLAKE2b-256 |
e68958bad424e815fe96b76a58d16443e64b212e323c1baec637a4932818bf19
|
Provenance
The following attestation bundles were made for crawlerbox-0.0.3-py3-none-any.whl:
Publisher:
release.yml on AmadeusITGroup/CrawlerBox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crawlerbox-0.0.3-py3-none-any.whl -
Subject digest:
74a719e64f3ab7aa43324797a676c1731f292a343ee1dea068b03797204eb7d6 - Sigstore transparency entry: 186297510
- Sigstore integration time:
-
Permalink:
AmadeusITGroup/CrawlerBox@87b4dd4ae7e29f119efbefdcd00e69d66ea75308 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/AmadeusITGroup
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@87b4dd4ae7e29f119efbefdcd00e69d66ea75308 -
Trigger Event:
push
-
Statement type: