A parallel web crawler with consent management using Playwright
Project description
FSM-Crawl
A high-performance parallel web crawler built with Playwright and Python. Features automatic cookie consent management and configurable crawling strategies.
Features
- Parallel Tab Crawling: Open up to 10 tabs simultaneously for faster crawling
- Automatic Cookie Consent: Intelligently accepts cookies across multiple sites
- Multiple Crawling Strategies:
- Normal crawl (no consent)
- Normal crawl with cookies
- Explorative crawl (probability-based navigation)
- Request/Response Logging: Detailed CSV logs of all network activity
- Distributed Crawling: Support for sharded crawls across multiple machines
- Headless & Headed Modes: Run with or without browser UI
Installation
Option 1: PyPI (Recommended)
pip install fsm-crawl
Option 2: From Git
pip install git+https://github.com/yourusername/fsm-crawl.git
Option 3: Development Install
git clone https://github.com/yourusername/fsm-crawl.git
cd fsm-crawl
pip install -e .
Quick Start
Basic Usage
# Run default normal crawl with 10 parallel tabs on first 1000 URLs
fsm-crawl
# Run with cookies enabled
fsm-crawl --experiment normal_with_cookies
# Run explorative crawl strategy
fsm-crawl --experiment explorative
Configuration
# Specify input URL file
fsm-crawl --path urls.csv
# Set output prefix for logs
fsm-crawl --prefix my_crawl
# Custom number of parallel tabs (1-20 recommended)
fsm-crawl --num-tabs 5
# Run in headless mode (no browser window)
fsm-crawl --headless
# Distributed crawling with shards
fsm-crawl --shard-index 0 --shard-count 4 # First shard of 4
fsm-crawl --shard-index 1 --shard-count 4 # Second shard of 4
CLI Commands
usage: fsm-crawl [-h] [--shard-index SHARD_INDEX] [--shard-count SHARD_COUNT]
[-e {normal,normal_with_cookies,explorative}]
[-p PATH] [--prefix PREFIX] [--path2 PATH2]
[--prefix2 PREFIX2] [--headless] [--engine {playwright}]
[--num-tabs NUM_TABS]
Run FSM web crawler experiments
optional arguments:
-h, --help show this help message and exit
--shard-index SHARD_INDEX
Shard ID (for distributed runs)
--shard-count SHARD_COUNT
Total number of shards (for distributed runs)
-e {normal,normal_with_cookies,explorative}, --experiment {normal,normal_with_cookies,explorative}
Which experiment to run
-p PATH, --path PATH Path to input CSV for URL manager
--prefix PREFIX Filename prefix for output logs
--path2 PATH2 Path to second CSV for two-run mode
--prefix2 PREFIX2 Filename prefix for second run
--headless Run browser in headless mode
--engine {playwright}
Browser engine to use
--num-tabs NUM_TABS Number of parallel tabs (default: 10)
Input Format
The input CSV file should have URLs, one per line or in standard CSV format:
https://example.com
https://another-site.com
https://third-site.org
...
Output
The crawler generates two CSV files in the crawl_logs/ directory:
- request_*.csv: All HTTP requests made during crawling
- response_*.csv: All HTTP responses received
Each row contains:
- Timestamp
- Request method, URL, headers
- Response status, headers, cookies
- Classification labels for blocking
Examples
Crawl with custom tabs and output prefix
fsm-crawl --experiment normal_with_cookies --path my_urls.csv --prefix my_crawl --num-tabs 8
Distributed crawling across 4 machines
# Machine 1
fsm-crawl --shard-index 0 --shard-count 4 --prefix distributed_crawl
# Machine 2
fsm-crawl --shard-index 1 --shard-count 4 --prefix distributed_crawl
# Machine 3
fsm-crawl --shard-index 2 --shard-count 4 --prefix distributed_crawl
# Machine 4
fsm-crawl --shard-index 3 --shard-count 4 --prefix distributed_crawl
Headless mode with explorative strategy
fsm-crawl --experiment explorative --headless --num-tabs 10
Development
Install dev dependencies
pip install -e ".[dev]"
Run tests
pytest
Run with Poetry
poetry run fsm-crawl --help
Configuration Files
The crawler uses a blocking/consent-manager.yaml file to define CMP (Consent Management Platform) detection and cookie acceptance rules.
Architecture
- PlaywrightEngine: Manages browser tabs and page interactions
- BrowserManager: Coordinates parallel crawling across tabs
- ConsentManager: Handles cookie acceptance automation
- RequestResponseLoggingPipeline: Logs all network activity to CSV
Requirements
- Python 3.11+
- Chromium browser (installed automatically via Playwright)
- 2GB RAM minimum (recommended 4GB+ for 10 tabs)
License
MIT
Support
For issues, please open a GitHub issue or contact the maintainers.
Citation
If you use FSM-Crawl in your research, please cite:
@software{fsm-crawl,
title={FSM-Crawl: A Parallel Web Crawler with Consent Management},
author={Schwerdtner, Henry},
year={2026},
url={https://github.com/yourusername/fsm-crawl}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fsm_crawl-0.2.0.tar.gz.
File metadata
- Download URL: fsm_crawl-0.2.0.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24becc1f5e198c378825cc66643e656ed6da73a6c8a712bb254e90dd47510e27
|
|
| MD5 |
6dd02e0cb17e804ddc6dbe748eb6148f
|
|
| BLAKE2b-256 |
f06036e3c45abe913848ae39629b0356c7300888d9aa05083df66dce97eee80b
|
Provenance
The following attestation bundles were made for fsm_crawl-0.2.0.tar.gz:
Publisher:
deploy.yaml on tracker-detector/fsm-crawl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fsm_crawl-0.2.0.tar.gz -
Subject digest:
24becc1f5e198c378825cc66643e656ed6da73a6c8a712bb254e90dd47510e27 - Sigstore transparency entry: 864170908
- Sigstore integration time:
-
Permalink:
tracker-detector/fsm-crawl@a7ff027d189222dd67ef20b6f0132b435c88d3de -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/tracker-detector
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
deploy.yaml@a7ff027d189222dd67ef20b6f0132b435c88d3de -
Trigger Event:
release
-
Statement type:
File details
Details for the file fsm_crawl-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fsm_crawl-0.2.0-py3-none-any.whl
- Upload date:
- Size: 785.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b25ea930a23ca57269a7deb3ab394c3ed968031752c1c89740669fe9d096002c
|
|
| MD5 |
6bf667143f9f2e2f0fb399d5666f051d
|
|
| BLAKE2b-256 |
c8bfda0dfa439c0af48a8a90a2538d7e00e56af5dbc912a1e285cab4d56a8635
|
Provenance
The following attestation bundles were made for fsm_crawl-0.2.0-py3-none-any.whl:
Publisher:
deploy.yaml on tracker-detector/fsm-crawl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fsm_crawl-0.2.0-py3-none-any.whl -
Subject digest:
b25ea930a23ca57269a7deb3ab394c3ed968031752c1c89740669fe9d096002c - Sigstore transparency entry: 864170919
- Sigstore integration time:
-
Permalink:
tracker-detector/fsm-crawl@a7ff027d189222dd67ef20b6f0132b435c88d3de -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/tracker-detector
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
deploy.yaml@a7ff027d189222dd67ef20b6f0132b435c88d3de -
Trigger Event:
release
-
Statement type: