Phishing Web Collector
Project description
⚔️ PhishingWebCollector: A Python Library for Phishing Website Collection ⚔️
✨ Why PhishingWebCollector? 📦 Features 🚀 Quick Start 📮 Documentation 📓 Jupyter Notebook examples 🔑 License
Overview
PhishingWebCollector is a Python library that integrates 20 phishing feeds into one solution and offers a platform for collecting and managing malicious website data.
Suitable for practical cybersecurity applications, like updating local blacklists, and research, such as building phishing detection datasets.
It utilizes the asyncio module for efficient parallel processing and data collection.
Users can gather historical data from free feeds to construct extensive datasets without costly API subscriptions.
Its ease of use, scalability, and support for various data formats enhance the threat detection capabilities of cybersecurity teams and researchers while minimizing technical overhead.
- Free software: MIT license,
- Python versions: 3.9 | 3.10 | 3.11
- Tested OS: Windows, Ubuntu, Fedora and CentOS. However, that does not mean it does not work on others.
- All-in-One Solution:: PhishingWebCollector is an all-in-one solution that allows for the collection of a wide range of information about websites.
- Efficiency and Expertise: : Building a similar solution independently would be very time-consuming and require specialized knowledge.
- Open Source Advantage: : Publishing this tool as open source will facilitate many studies, making them simpler and allowing researchers and industry professionals to focus on more advanced tasks.
- Continuous Improvement: : New techniques will be added successively, ensuring continuous growth in this area.
Features
- Integration of 22 Different Sources: Reduces the need to maintain multiple integrations.
- Local Data Collection: Supports building and maintaining local phishing databases.
- Data Export: Allows exporting all collected data in a unified JSON format.
- Asynchronous Performance: Uses asyncio for faster, simultaneous data collection.
Integrations
- AdGuardHome
- BinaryDefence
- BlockListDe
- Botvrij
- C2IntelFeeds
- C2Tracker
- CertPL
- DangerousDomains
- GreenSnow
- MalwareWorld
- MiraiSecurity
- OpenPhish
- PhishTank
- PhishingArmy
- PhishingDatabase
- PhishStats
- Proofpoint
- ThreatView
- TweetFeed
- URLAbuse
- URLHaus
- Valdin
Why PhishingWebCollector?
While many tools and scripts can collect phishing data, none offer a complete all-in-one solution like PhishingWebCollector. It combines comprehensive functionality with high performance, asynchronous data collection, and easy configuration, making it both efficient and user-friendly.
How to use
Library can be installed using pip:
pip install phishing-web-collector
Code usage
Getting all phishing domains from all available sources
import phishing_web_collector as pwc
manager = pwc.FeedManager(
sources=list(pwc.FeedSource),
storage_path="feeds_data"
)
manager.sync_refresh_all()
entries = manager.sync_retrieve_all()
phishing_domains = [pwc.get_domain_from_url(item.url) for item in entries]
for domain in phishing_domains:
print(domain)
and as a results you will get the list of phishing domains.
All modules are exported into main package, so you can use import module and invoke them directly.
Jupyter Notebook Usage
If you would like to test PhishingWebCollector functionalities without installing it on your machine consider using the preconfigured Jupyter notebook. It will show you how to collect phishing domains from all available sources and save them into a CSV file. You can run it in your browser without any installation using Google Colab.
To check how asynchronous data collection is faster than synchronous one, you can run the asynchronous benchmark notebook.
To check how to run feeds directly, you can run the direct feed invocation notebook.
Docker usage
If you want to use PhishingWebCollector in a Docker container, please check this README file.
Contributing
For contributing, refer to its CONTRIBUTING.md file. We are a welcoming community... just follow the Code of Conduct.
Maintainers
Project maintainers are:
- Damian Frąszczak
- Edyta Frąszczak
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phishing_web_collector-0.2.1.tar.gz.
File metadata
- Download URL: phishing_web_collector-0.2.1.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34c8c572efdf98a281a2202bc7816e3c05cd89e7afa7f41a0cf6ad5c7c311a45
|
|
| MD5 |
9d92edafaf5657f72a5d11d27fa7f968
|
|
| BLAKE2b-256 |
e071f0ff277dbdf649e6a0b5af68cd92b24410daa5311893e988d75b75df211b
|
Provenance
The following attestation bundles were made for phishing_web_collector-0.2.1.tar.gz:
Publisher:
python-publish.yml on damianfraszczak/phishing-web-collector
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phishing_web_collector-0.2.1.tar.gz -
Subject digest:
34c8c572efdf98a281a2202bc7816e3c05cd89e7afa7f41a0cf6ad5c7c311a45 - Sigstore transparency entry: 226864258
- Sigstore integration time:
-
Permalink:
damianfraszczak/phishing-web-collector@2ec02a738dbd98827d77b2eb0b2db7ace2845f7b -
Branch / Tag:
refs/tags/0.2.1 - Owner: https://github.com/damianfraszczak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2ec02a738dbd98827d77b2eb0b2db7ace2845f7b -
Trigger Event:
release
-
Statement type:
File details
Details for the file phishing_web_collector-0.2.1-py3-none-any.whl.
File metadata
- Download URL: phishing_web_collector-0.2.1-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdffae8c00eaf6874814d580d0510767c47c5b72c0d88e06d14a2e07a0b4ee6a
|
|
| MD5 |
2e667f4321cdc77eda676be31383cd7c
|
|
| BLAKE2b-256 |
8bbaa7d3231820c173e71437aca56dcac4750200e5170c1444c319f0fbb247e4
|
Provenance
The following attestation bundles were made for phishing_web_collector-0.2.1-py3-none-any.whl:
Publisher:
python-publish.yml on damianfraszczak/phishing-web-collector
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phishing_web_collector-0.2.1-py3-none-any.whl -
Subject digest:
bdffae8c00eaf6874814d580d0510767c47c5b72c0d88e06d14a2e07a0b4ee6a - Sigstore transparency entry: 226864266
- Sigstore integration time:
-
Permalink:
damianfraszczak/phishing-web-collector@2ec02a738dbd98827d77b2eb0b2db7ace2845f7b -
Branch / Tag:
refs/tags/0.2.1 - Owner: https://github.com/damianfraszczak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2ec02a738dbd98827d77b2eb0b2db7ace2845f7b -
Trigger Event:
release
-
Statement type: