Domains blocklist aggregator
Project description
Blocklist aggregator
This python module does the aggregation of several ads/tracking/malware lists, and merges them into a unified list with duplicates removed.
See the blocklist-domains repository for an implementation.
Table of contents
Installation
If you want to generate your own unified blocklist, install this module with the pip command.
pip install blocklist_aggregator
Configuration
See the default configuration file
The configuration contains:
- the ads/tracking/malware URL lists with the pattern (regex) to use
- the domains list to exclude (whitelist)
- additionnal domains list to block (blacklist)
The configuration can be overwritten at runtime.
cfg_yaml = "verbose: true"
unified = blocklist_aggregator.fetch(ext_cfg=cfg_yaml)
Basic example
This basic example enable to get a unified list of domains. You can save-it in a file or do what you want.
import blocklist_aggregator
unified = blocklist_aggregator.fetch()
print(unified)
[ "doubleclick.net", ..., "telemetry.dropbox.com" ]
print(len(unified))
152978
Fetching in verbose mode
Fetching ads/tracking/malware URL lists in debug mode
2020-10-29 06:47:17,220 Starting new HTTPS connection (1): easylist.to:443
2020-10-29 06:47:17,473 https://easylist.to:443 "GET /easylist/easylist.txt HTTP/1.1" 200 472389
2020-10-29 06:47:17,669 *** Searching valid domains...
2020-10-29 06:47:17,710 *** domains=23672 duplicated=10.26%
2020-10-29 06:47:17,710 Starting new HTTPS connection (1): raw.githubusercontent.com:443
2020-10-29 06:47:18,013 https://raw.githubusercontent.com:443 "GET /paulgb/BarbBlock/master/BarbBlock.txt HTTP/1.1" 200 4701
2020-10-29 06:47:18,020 *** Searching valid domains...
2020-10-29 06:47:18,022 *** domains=550 duplicated=1.09%
2020-10-29 06:47:18,027 Starting new HTTPS connection (1): winhelp2002.mvps.org:443
2020-10-29 06:47:19,065 https://winhelp2002.mvps.org:443 "GET /hosts.txt HTTP/1.1" 200 87636
2020-10-29 06:47:19,343 *** Searching valid domains...
2020-10-29 06:47:19,413 *** domains=11730 duplicated=1.49%
2020-10-29 06:47:19,413 Starting new HTTPS connection (1): adaway.org:443
2020-10-29 06:47:19,654 https://adaway.org:443 "GET /hosts.txt HTTP/1.1" 200 48536
2020-10-29 06:47:19,666 *** Searching valid domains...
2020-10-29 06:47:19,714 *** domains=9002 duplicated=7.52%
2020-10-29 06:47:19,721 Starting new HTTPS connection (1): raw.githubusercontent.com:443
2020-10-29 06:47:20,193 https://raw.githubusercontent.com:443 "GET /StevenBlack/hosts/master/data/StevenBlack/hosts HTTP/1.1" 200 21683
2020-10-29 06:47:20,201 *** Searching valid domains...
2020-10-29 06:47:20,212 *** domains=2938 duplicated=0.0%
2020-10-29 06:47:20,212 Starting new HTTPS connection (1): www.malwaredomainlist.com:443
2020-10-29 06:47:20,944 https://www.malwaredomainlist.com:443 "GET /hostslist/hosts.txt HTTP/1.1" 200 35585
2020-10-29 06:47:21,039 *** Searching valid domains...
2020-10-29 06:47:21,057 *** domains=1106 duplicated=0.0%
2020-10-29 06:47:21,057 Starting new HTTPS connection (1): urlhaus.abuse.ch:443
2020-10-29 06:47:21,340 https://urlhaus.abuse.ch:443 "GET /downloads/hostfile/ HTTP/1.1" 200 19478
2020-10-29 06:47:21,459 *** Searching valid domains...
2020-10-29 06:47:21,479 *** domains=2280 duplicated=0.04%
2020-10-29 06:47:21,485 Starting new HTTPS connection (1): pgl.yoyo.org:443
2020-10-29 06:47:21,731 https://pgl.yoyo.org:443 "GET /adservers/serverlist.php?hostformat=hosts;showintro=0 HTTP/1.1" 200 24153
2020-10-29 06:47:21,743 *** Searching valid domains...
2020-10-29 06:47:21,792 *** domains=3568 duplicated=0.14%
2020-10-29 06:47:21,799 Starting new HTTPS connection (1): someonewhocares.org:443
2020-10-29 06:47:22,849 https://someonewhocares.org:443 "GET /hosts/hosts HTTP/1.1" 200 449957
2020-10-29 06:47:24,029 *** Searching valid domains...
2020-10-29 06:47:24,138 *** domains=14662 duplicated=0.85%
2020-10-29 06:47:24,143 Starting new HTTPS connection (1): raw.githubusercontent.com:443
2020-10-29 06:47:24,378 https://raw.githubusercontent.com:443 "GET /notracking/hosts-blocklists/master/hostnames.txt HTTP/1.1" 200 1468999
2020-10-29 06:47:24,738 *** Searching valid domains...
2020-10-29 06:47:25,124 *** domains=194622 duplicated=50.0%
2020-10-29 06:47:25,128 Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-10-29 06:47:25,824 https://s3.amazonaws.com:443 "GET /lists.disconnect.me/simple_ad.txt HTTP/1.1" 200 43616
2020-10-29 06:47:25,866 *** Searching valid domains...
2020-10-29 06:47:25,873 *** domains=2702 duplicated=0.0%
2020-10-29 06:47:25,873 Starting new HTTPS connection (1): s3.amazonaws.com:443
2020-10-29 06:47:26,383 https://s3.amazonaws.com:443 "GET /lists.disconnect.me/simple_tracking.txt HTTP/1.1" 200 613
2020-10-29 06:47:26,390 *** Searching valid domains...
2020-10-29 06:47:26,390 *** domains=35 duplicated=0.0%
2020-10-29 06:47:26,393 Starting new HTTPS connection (1): raw.githubusercontent.com:443
2020-10-29 06:47:26,573 https://raw.githubusercontent.com:443 "GET /davidonzo/Threat-Intel/master/lists/latestdomains.piHole.txt HTTP/1.1" 200 21830
2020-10-29 06:47:26,575 *** Searching valid domains...
2020-10-29 06:47:26,613 *** domains=2193 duplicated=0.05%
2020-10-29 06:47:26,624 Starting new HTTPS connection (1): raw.githubusercontent.com:443
2020-10-29 06:47:26,839 https://raw.githubusercontent.com:443 "GET /mitchellkrogza/Badd-Boyz-Hosts/master/hosts HTTP/1.1" 200 7888
2020-10-29 06:47:26,850 *** Searching valid domains...
2020-10-29 06:47:26,857 *** domains=834 duplicated=0.48%
2020-10-29 06:47:26,893 blocklist total=152978 duplicated=9.57%
2020-10-29 06:47:26,941 blocklist without domains from whitelist total=152977
About
Author | Denis Machard d.machard@gmail.com |
PyPI | https://pypi.org/project/blocklist_aggregator/ |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for blocklist_aggregator-0.0.8.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b5cf98124562e2c9eaba0e502af0d956cbf4c6d32e20f9f77dafa79f8ee5a13 |
|
MD5 | 1e35ec63141abc6e0d9fbd67b0a65dbe |
|
BLAKE2b-256 | 40ccf39ed970807549b98b724044f7a89bd60b00b82fd29ed22d1f062c900e52 |
Close
Hashes for blocklist_aggregator-0.0.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5298af23e5fa0b55b3f4286d23637252d5206c8a2d6dec6d3035a6982da43285 |
|
MD5 | 9297f7fb1aed84adad5bf48c1f75e270 |
|
BLAKE2b-256 | 763af13295b0a7754fca5743f7c847d4d9ac7aa2ecf3274d352b4dbbdbae7e3a |