Skip to main content

Automatically check for GDPR/CCPA consent by running a Playwright headless browser to check for marketing and analytics scripts firing before and after consent.

Project description

PyPI PyPI

ConsentCrawl

Automatically check for GDPR/CCPA consent by running a Playwright headless browser to check for marketing and analytics scripts firing before and after consent.

  • Detect 25+ consent managers
  • Detect unconsented third-party domains and cookies
  • Classify tracking domains based on 7 commonly used ad blocking lists
  • Keep screenshots before and after consent
  • Capture JSON-LD and meta tags for convenience
  • Run multiple URLs in batch
  • Add custom blocklists and consent manager lists

CLI Arguments

usage:

consentcrawl [-h] [--debug] [--headless [HEADLESS]] [--screenshot] [--bootstrap]
                    [--batch_size BATCH_SIZE] [--show_output] [--db_file DB_FILE]
                    [--blocklists BLOCKLISTS]
                    url
Argument Description
url (required) URL or file with URLs to test
--debug Enable debug logging
--headless Run browser in headless mode (true/false)
--screenshot Take screenshots of each page before and after consent is given (ifconsent manager is detected)
--bootstrap Force bootstrap (refresh) of blocklists
--batch_size, -b Number of URLs (and browser windows) to run in each batch. Default: 15, increase or decrease depending on your system capacity.
--show_output, -o Show output of the last results in terminal (max 25 results)
--db_file, -db Path to crawl results and blocklist database
--blocklists, -bf Path to custom blocklists file (YAML)

In action

Download and install with: pip install consentcrawl

The Playwright (headless) browsers are not automatically installed so run playwright install to install all or specify e.g. playwright install chromium

When running consentcrawl You can provide either a single URL, comma separated list or a file (.txt) with one URL per line:

consentcrawl google.com,google.nl,google.de --headless=false -o

If you have jq installed you can pipe the output to jq to directly get, for example, all tracking domains without consent:

consentcrawl leboncoin.fr,marktplaats.nl,ebay.com -o | jq '.[] | .tracking_domains_no_consent'

Returns:

[
  "tiqcdn.com",
  "criteo.net",
  "googlesyndication.com",
  "doubleclick.net"
]
[
  "google-analytics.com",
  "scorecardresearch.com",
  "bing.com",
  "doubleclick.net",
  "googleadservices.com",
  "criteo.net",
  "googletagmanager.com",
  "spotxchange.com"
]
[
  "doubleclick.net"
]

By default the results of your queries will be stored in a SQLite database called crawl_results.db.

Or if you want to import into an existing Python script:

import asyncio
from consentcrawl import crawl

results = asyncio.run(crawl.crawl_single("dumky.net"))

The playwright browser runs asynchronously which is great for running multiple URLs in parallel, but for running a single URL you'll need to use asyncio.run() to run the asynchronous function.

How it works

Playwright allows you to automate browser windows. This script takes a list of URLs, runs a Playwright browser instance and fetches data about cookies and requested domains for each URL. The URLs are fetched asynchronously and in batches to speed up the process. After the URL is fetched, the script tries to identify the consent manager and click 'accept' to determine if and what marketing and analytics tags are fired before and after consent. It uses a 'blocklist' to determine whether a domain is a tracking (marketing/analytics) domain.

Available Consent Managers:

  • OneTrust
  • Optanon
  • CookieLaw (CookiePro)
  • Drupal EU Cookie Compliance
  • JoomlaShaper SP Cookie Consent Extension
  • FastCMP
  • Google Funding Choices
  • Klaro
  • Ensighten
  • GX Software
  • EZ Cookie
  • CookieBot
  • CookieHub
  • TYPO3 Wacon Cookie Management Extension
  • TYPO3 Cookie Consent Extension
  • Commanders Act - Trust Commander
  • CookieFirst
  • Osano
  • Orejime
  • Axceptio
  • Civic UK Cookie Control
  • UserCentrics
  • CookieYes
  • Secure Privacy
  • Quantcast
  • Didomi
  • MediaVine CMP
  • CookieLaw
  • ConsentManager.net
  • HubSpot Cookie Banner
  • LiveRamp PrivacyManager.io
  • TrustArc Truste
  • SFBX AppConsent
  • Piwik PRO GDPR Consent Manager
  • Finsweet Cookie Consent for Webflow
  • Non-specific / Custom (looks for general CSS selectors like "#acceptCookies" or ".cookie-accept")

Are you missing a consent manager? Have a look at the full list and feel free to open an issue or pull request!

Examples

The examples folder shows examples to run ConsentCrawl:

  • as a Github Action
  • on Google Cloud Run with a simple FastAPI server that responds with the ConsentCrawl results on a POST request to a /consentcrawl endpoint.

To Do

  • Follow redirects on URLs
  • Detect consent managers with cookies instead of just CSS selectors
  • Show progress when using CLI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

consentcrawl-0.1.4.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

consentcrawl-0.1.4-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file consentcrawl-0.1.4.tar.gz.

File metadata

  • Download URL: consentcrawl-0.1.4.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.11.5 Darwin/23.3.0

File hashes

Hashes for consentcrawl-0.1.4.tar.gz
Algorithm Hash digest
SHA256 0d549dedaad2bb7745c4fac0b29920249bace4950583ac83045c40b043639d63
MD5 fae709066f7e00473bd51c0024041678
BLAKE2b-256 6ee8d97551768222d054af1b19c72f8fee620349b3660ca752c23b53d8a09e12

See more details on using hashes here.

File details

Details for the file consentcrawl-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: consentcrawl-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.11.5 Darwin/23.3.0

File hashes

Hashes for consentcrawl-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b45b8a22f45c25a1dff965864e474574756bd1934a81d54c6b79a496bafe54b6
MD5 c684e72c876432231b5ed044eae353b4
BLAKE2b-256 f9d620ef044ff285c6ad94ebb555b4caefb2b6e658775250c2afc94995a0854a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page