Skip to main content

A parallel web crawler with consent management using Playwright

Project description

FSM-Crawl

A high-performance parallel web crawler built with Playwright and Python. Features automatic cookie consent management and configurable crawling strategies.

Features

  • Parallel Tab Crawling: Open up to 10 tabs simultaneously for faster crawling
  • Automatic Cookie Consent: Intelligently accepts cookies across multiple sites
  • Multiple Crawling Strategies:
    • Normal crawl (no consent)
    • Normal crawl with cookies
    • Explorative crawl (probability-based navigation)
  • Request/Response Logging: Detailed CSV logs of all network activity
  • Distributed Crawling: Support for sharded crawls across multiple machines
  • Headless & Headed Modes: Run with or without browser UI

Installation

Option 1: PyPI (Recommended)

pip install fsm-crawl

Option 2: From Git

pip install git+https://github.com/yourusername/fsm-crawl.git

Option 3: Development Install

git clone https://github.com/yourusername/fsm-crawl.git
cd fsm-crawl
pip install -e .

Quick Start

Basic Usage

# Run default normal crawl with 10 parallel tabs on first 1000 URLs
fsm-crawl

# Run with cookies enabled
fsm-crawl --experiment normal_with_cookies

# Run explorative crawl strategy
fsm-crawl --experiment explorative

Configuration

# Specify input URL file
fsm-crawl --path urls.csv

# Set output prefix for logs
fsm-crawl --prefix my_crawl

# Custom number of parallel tabs (1-20 recommended)
fsm-crawl --num-tabs 5

# Run in headless mode (no browser window)
fsm-crawl --headless

# Distributed crawling with shards
fsm-crawl --shard-index 0 --shard-count 4  # First shard of 4
fsm-crawl --shard-index 1 --shard-count 4  # Second shard of 4

CLI Commands

usage: fsm-crawl [-h] [--shard-index SHARD_INDEX] [--shard-count SHARD_COUNT]
                  [-e {normal,normal_with_cookies,explorative}]
                  [-p PATH] [--prefix PREFIX] [--path2 PATH2]
                  [--prefix2 PREFIX2] [--headless] [--engine {playwright}]
                  [--num-tabs NUM_TABS]

Run FSM web crawler experiments

optional arguments:
  -h, --help            show this help message and exit
  --shard-index SHARD_INDEX
                        Shard ID (for distributed runs)
  --shard-count SHARD_COUNT
                        Total number of shards (for distributed runs)
  -e {normal,normal_with_cookies,explorative}, --experiment {normal,normal_with_cookies,explorative}
                        Which experiment to run
  -p PATH, --path PATH  Path to input CSV for URL manager
  --prefix PREFIX       Filename prefix for output logs
  --path2 PATH2         Path to second CSV for two-run mode
  --prefix2 PREFIX2     Filename prefix for second run
  --headless            Run browser in headless mode
  --engine {playwright}
                        Browser engine to use
  --num-tabs NUM_TABS   Number of parallel tabs (default: 10)

Input Format

The input CSV file should have URLs, one per line or in standard CSV format:

https://example.com
https://another-site.com
https://third-site.org
...

Output

The crawler generates two CSV files in the crawl_logs/ directory:

  • request_*.csv: All HTTP requests made during crawling
  • response_*.csv: All HTTP responses received

Each row contains:

  • Timestamp
  • Request method, URL, headers
  • Response status, headers, cookies
  • Classification labels for blocking

Examples

Crawl with custom tabs and output prefix

fsm-crawl --experiment normal_with_cookies --path my_urls.csv --prefix my_crawl --num-tabs 8

Distributed crawling across 4 machines

# Machine 1
fsm-crawl --shard-index 0 --shard-count 4 --prefix distributed_crawl

# Machine 2
fsm-crawl --shard-index 1 --shard-count 4 --prefix distributed_crawl

# Machine 3
fsm-crawl --shard-index 2 --shard-count 4 --prefix distributed_crawl

# Machine 4
fsm-crawl --shard-index 3 --shard-count 4 --prefix distributed_crawl

Headless mode with explorative strategy

fsm-crawl --experiment explorative --headless --num-tabs 10

Development

Install dev dependencies

pip install -e ".[dev]"

Run tests

pytest

Run with Poetry

poetry run fsm-crawl --help

Configuration Files

The crawler uses a blocking/consent-manager.yaml file to define CMP (Consent Management Platform) detection and cookie acceptance rules.

Architecture

  • PlaywrightEngine: Manages browser tabs and page interactions
  • BrowserManager: Coordinates parallel crawling across tabs
  • ConsentManager: Handles cookie acceptance automation
  • RequestResponseLoggingPipeline: Logs all network activity to CSV

Requirements

  • Python 3.11+
  • Chromium browser (installed automatically via Playwright)
  • 2GB RAM minimum (recommended 4GB+ for 10 tabs)

License

MIT

Support

For issues, please open a GitHub issue or contact the maintainers.

Citation

If you use FSM-Crawl in your research, please cite:

@software{fsm-crawl,
  title={FSM-Crawl: A Parallel Web Crawler with Consent Management},
  author={Schwerdtner, Henry},
  year={2026},
  url={https://github.com/yourusername/fsm-crawl}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fsm_crawl-0.2.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fsm_crawl-0.2.0-py3-none-any.whl (785.0 kB view details)

Uploaded Python 3

File details

Details for the file fsm_crawl-0.2.0.tar.gz.

File metadata

  • Download URL: fsm_crawl-0.2.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fsm_crawl-0.2.0.tar.gz
Algorithm Hash digest
SHA256 24becc1f5e198c378825cc66643e656ed6da73a6c8a712bb254e90dd47510e27
MD5 6dd02e0cb17e804ddc6dbe748eb6148f
BLAKE2b-256 f06036e3c45abe913848ae39629b0356c7300888d9aa05083df66dce97eee80b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fsm_crawl-0.2.0.tar.gz:

Publisher: deploy.yaml on tracker-detector/fsm-crawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fsm_crawl-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: fsm_crawl-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 785.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fsm_crawl-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b25ea930a23ca57269a7deb3ab394c3ed968031752c1c89740669fe9d096002c
MD5 6bf667143f9f2e2f0fb399d5666f051d
BLAKE2b-256 c8bfda0dfa439c0af48a8a90a2538d7e00e56af5dbc912a1e285cab4d56a8635

See more details on using hashes here.

Provenance

The following attestation bundles were made for fsm_crawl-0.2.0-py3-none-any.whl:

Publisher: deploy.yaml on tracker-detector/fsm-crawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page