Skip to main content

A parallel web crawler with consent management using Playwright

Project description

FSM-Crawl

A high-performance parallel web crawler built with Playwright and Python. Features automatic cookie consent management and configurable crawling strategies.

Features

  • Parallel Tab Crawling: Open up to 10 tabs simultaneously for faster crawling
  • Automatic Cookie Consent: Intelligently accepts cookies across multiple sites
  • Multiple Crawling Strategies:
    • Normal crawl (no consent)
    • Normal crawl with cookies
    • Explorative crawl (probability-based navigation)
  • Request/Response Logging: Detailed CSV logs of all network activity
  • Distributed Crawling: Support for sharded crawls across multiple machines
  • Headless & Headed Modes: Run with or without browser UI

Installation

Option 1: PyPI (Recommended)

pip install fsm-crawl

Option 2: From Git

pip install git+https://github.com/yourusername/fsm-crawl.git

Option 3: Development Install

git clone https://github.com/yourusername/fsm-crawl.git
cd fsm-crawl
pip install -e .

Quick Start

Basic Usage

# Run default normal crawl with 10 parallel tabs on first 1000 URLs
fsm-crawl

# Run with cookies enabled
fsm-crawl --experiment normal_with_cookies

# Run explorative crawl strategy
fsm-crawl --experiment explorative

Configuration

# Specify input URL file
fsm-crawl --path urls.csv

# Set output prefix for logs
fsm-crawl --prefix my_crawl

# Custom number of parallel tabs (1-20 recommended)
fsm-crawl --num-tabs 5

# Run in headless mode (no browser window)
fsm-crawl --headless

# Distributed crawling with shards
fsm-crawl --shard-index 0 --shard-count 4  # First shard of 4
fsm-crawl --shard-index 1 --shard-count 4  # Second shard of 4

CLI Commands

usage: fsm-crawl [-h] [--shard-index SHARD_INDEX] [--shard-count SHARD_COUNT]
                  [-e {normal,normal_with_cookies,explorative}]
                  [-p PATH] [--prefix PREFIX] [--path2 PATH2]
                  [--prefix2 PREFIX2] [--headless] [--engine {playwright}]
                  [--num-tabs NUM_TABS]

Run FSM web crawler experiments

optional arguments:
  -h, --help            show this help message and exit
  --shard-index SHARD_INDEX
                        Shard ID (for distributed runs)
  --shard-count SHARD_COUNT
                        Total number of shards (for distributed runs)
  -e {normal,normal_with_cookies,explorative}, --experiment {normal,normal_with_cookies,explorative}
                        Which experiment to run
  -p PATH, --path PATH  Path to input CSV for URL manager
  --prefix PREFIX       Filename prefix for output logs
  --path2 PATH2         Path to second CSV for two-run mode
  --prefix2 PREFIX2     Filename prefix for second run
  --headless            Run browser in headless mode
  --engine {playwright}
                        Browser engine to use
  --num-tabs NUM_TABS   Number of parallel tabs (default: 10)

Input Format

The input CSV file should have URLs, one per line or in standard CSV format:

https://example.com
https://another-site.com
https://third-site.org
...

Output

The crawler generates two CSV files in the crawl_logs/ directory:

  • request_*.csv: All HTTP requests made during crawling
  • response_*.csv: All HTTP responses received

Each row contains:

  • Timestamp
  • Request method, URL, headers
  • Response status, headers, cookies
  • Classification labels for blocking

Examples

Crawl with custom tabs and output prefix

fsm-crawl --experiment normal_with_cookies --path my_urls.csv --prefix my_crawl --num-tabs 8

Distributed crawling across 4 machines

# Machine 1
fsm-crawl --shard-index 0 --shard-count 4 --prefix distributed_crawl

# Machine 2
fsm-crawl --shard-index 1 --shard-count 4 --prefix distributed_crawl

# Machine 3
fsm-crawl --shard-index 2 --shard-count 4 --prefix distributed_crawl

# Machine 4
fsm-crawl --shard-index 3 --shard-count 4 --prefix distributed_crawl

Headless mode with explorative strategy

fsm-crawl --experiment explorative --headless --num-tabs 10

Development

Install dev dependencies

pip install -e ".[dev]"

Run tests

pytest

Run with Poetry

poetry run fsm-crawl --help

Configuration Files

The crawler uses a blocking/consent-manager.yaml file to define CMP (Consent Management Platform) detection and cookie acceptance rules.

Architecture

  • PlaywrightEngine: Manages browser tabs and page interactions
  • BrowserManager: Coordinates parallel crawling across tabs
  • ConsentManager: Handles cookie acceptance automation
  • RequestResponseLoggingPipeline: Logs all network activity to CSV

Requirements

  • Python 3.11+
  • Chromium browser (installed automatically via Playwright)
  • 2GB RAM minimum (recommended 4GB+ for 10 tabs)

License

MIT

Support

For issues, please open a GitHub issue or contact the maintainers.

Citation

If you use FSM-Crawl in your research, please cite:

@software{fsm-crawl,
  title={FSM-Crawl: A Parallel Web Crawler with Consent Management},
  author={Schwerdtner, Henry},
  year={2026},
  url={https://github.com/yourusername/fsm-crawl}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fsm_crawl-0.4.0.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fsm_crawl-0.4.0-py3-none-any.whl (785.2 kB view details)

Uploaded Python 3

File details

Details for the file fsm_crawl-0.4.0.tar.gz.

File metadata

  • Download URL: fsm_crawl-0.4.0.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fsm_crawl-0.4.0.tar.gz
Algorithm Hash digest
SHA256 39021d84d577a807e1337983e3f875462705796a0353f5a496322b3aa9842093
MD5 01d729dc87533c087b837f7bbcd71fc6
BLAKE2b-256 d76b371bf13f632a994e906de622ea452c69176b463b69426b18d9c608f107c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fsm_crawl-0.4.0.tar.gz:

Publisher: deploy.yaml on tracker-detector/fsm-crawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fsm_crawl-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: fsm_crawl-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 785.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fsm_crawl-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c00ce747bb43f86b71c1144b85a429fe76c1cf4a7851a48a4fd4afaab5be27ab
MD5 f800e97affd5b80858cd287ebe2e1b19
BLAKE2b-256 ee6d7e02ca9fa56319ef99720b29d98bf9d288086605824e1da0d981f731e5bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for fsm_crawl-0.4.0-py3-none-any.whl:

Publisher: deploy.yaml on tracker-detector/fsm-crawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page