Skip to main content

An analysis pipeline for parsing and crawling evasive phishing emails

Project description

CrawlerBox

Description

CrawlerBox is an automated analysis framework designed for parsing emails and crawling embedded web resources. This infrastructure was developed to facilitate the study of evasive phishing emails reported by end users.

For more detailed information on CrawlerBox, its functionality, and the results obtained, please refer to our paper "A Closer Look at Modern Evasive Phishing Emails".

Description of image
Figure 1: CrawlerBox Analysis Pipeline

Getting started

Installation

CrawlerBox is meant to be run on Windows.

Local installation

Local installation can be done using uv

git clone https://github.com/AmadeusITGroup/CrawlerBox.git
cd CrawlerBox
uv venv -p python3.10
uv pip install -e .
.venv\Scripts\activate.bat 

Necessary dependencies and configuration

First you need to install vcredist_x64.exe from the Visual C++ Redistributable Packages for Visual Studio 2013. It is necessary for the working of the library responsible for reading QR codes (QReader).

CrawlerBox relies on external services to operate (e.g., Cisco Umbrella and Shodan for data enrichment). Additionally, it connects to two external servers: one database for retrieving newly user-reported messages and another for storing the obtained results. Before running CrawlerBox, you must configure these dependencies. Please use the config.py file accordingly.

Please also consider rewriting the functions in personalized_config.py: fetch_new_emails_by_date, fetch_new_emails_by_id, and url_rewrite. The two first functions should match your implemetation for fetching newly reported emails, and url_rewrite is designed to extract and return a decoded URL from a given string. In case the URLs within the messages are rewritten (e.g., rewritten by Microsoft's Safe Links or Proofpoint's URL Defense), you might need to decode these URLs before loading them by the crawler.

Running CrawlerBox

You can run CrawlerBox in three manners.

With the -id (--phish_id) option:

The "id" argument corresponds to the id of the message to be analyzed (as is in your input DB). Example:

run_crawlerbox -id xxxx-xxxx-xxxx-xxxxxxx

With the -d (--date) option:

The "d" argument represents a date string. CrawlerBox fetches all the reported emails on date "d" and analyzes them. Example:

run_crawlerbox -d 2025-01-01

With no options:

CrawlerBox runs continously and fetches new reported emails every ten minutes. It automatically starts the analysis for the fetched messages. Example:

run_crawlerbox

Citation

Please consider citing our paper if you find it useful:

@book{boulila2025,
  title = {A Closer Look at Modern Evasive Phishing Emails},
  author = {Boulila, Elyssa and Dacier, Marc and Vengadessa Peroumal, Siva Prem and Veys, Nicolas and Aonzo, Simone},
  booktitle={2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)},
  year = {2025},
  organization = {IEEE}
}

Contributing

We welcome your contributions. Please feel free to fork the code, play with it, make some patches and send us pull requests using issues.

We do have a Code of conduct. Make sure to check it out before contributing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlerbox-0.0.3.tar.gz (124.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawlerbox-0.0.3-py3-none-any.whl (46.1 kB view details)

Uploaded Python 3

File details

Details for the file crawlerbox-0.0.3.tar.gz.

File metadata

  • Download URL: crawlerbox-0.0.3.tar.gz
  • Upload date:
  • Size: 124.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for crawlerbox-0.0.3.tar.gz
Algorithm Hash digest
SHA256 7c9fbd5ae95471c0d5e572ab3a1e045e37336b60f399baa6a5de2b17fe4c3a56
MD5 63c8a8ff6f937be2a8237ef2c49a7a4f
BLAKE2b-256 9ae73098816b93cc6ec422e953d7f521b55cf3598132eb7ac6b06095fbaa5ff7

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawlerbox-0.0.3.tar.gz:

Publisher: release.yml on AmadeusITGroup/CrawlerBox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file crawlerbox-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: crawlerbox-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 46.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for crawlerbox-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 74a719e64f3ab7aa43324797a676c1731f292a343ee1dea068b03797204eb7d6
MD5 1bbc2a08cb08a9b4d266da3de5ce7941
BLAKE2b-256 e68958bad424e815fe96b76a58d16443e64b212e323c1baec637a4932818bf19

See more details on using hashes here.

Provenance

The following attestation bundles were made for crawlerbox-0.0.3-py3-none-any.whl:

Publisher: release.yml on AmadeusITGroup/CrawlerBox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page