A parallel web crawler with consent management using Playwright

These details have not been verified by PyPI

Project links

Repository

Project description

FSM-Crawl

A high-performance parallel web crawler built with Playwright and Python. Features automatic cookie consent management and configurable crawling strategies.

Features

Parallel Tab Crawling: Open up to 10 tabs simultaneously for faster crawling
Automatic Cookie Consent: Intelligently accepts cookies across multiple sites
Multiple Crawling Strategies:
- Normal crawl (no consent)
- Normal crawl with cookies
- Explorative crawl (probability-based navigation)
Request/Response Logging: Detailed CSV logs of all network activity
Distributed Crawling: Support for sharded crawls across multiple machines
Headless & Headed Modes: Run with or without browser UI

Installation

Option 1: PyPI (Recommended)

pip install fsm-crawl

Option 2: From Git

pip install git+https://github.com/yourusername/fsm-crawl.git

Option 3: Development Install

git clone https://github.com/yourusername/fsm-crawl.git
cd fsm-crawl
pip install -e .

Quick Start

Basic Usage

# Run default normal crawl with 10 parallel tabs on first 1000 URLs
fsm-crawl

# Run with cookies enabled
fsm-crawl --experiment normal_with_cookies

# Run explorative crawl strategy
fsm-crawl --experiment explorative

Configuration

# Specify input URL file
fsm-crawl --path urls.csv

# Set output prefix for logs
fsm-crawl --prefix my_crawl

# Custom number of parallel tabs (1-20 recommended)
fsm-crawl --num-tabs 5

# Run in headless mode (no browser window)
fsm-crawl --headless

# Distributed crawling with shards
fsm-crawl --shard-index 0 --shard-count 4  # First shard of 4
fsm-crawl --shard-index 1 --shard-count 4  # Second shard of 4

CLI Commands

usage: fsm-crawl [-h] [--shard-index SHARD_INDEX] [--shard-count SHARD_COUNT]
                  [-e {normal,normal_with_cookies,explorative}]
                  [-p PATH] [--prefix PREFIX] [--path2 PATH2]
                  [--prefix2 PREFIX2] [--headless] [--engine {playwright}]
                  [--num-tabs NUM_TABS]

Run FSM web crawler experiments

optional arguments:
  -h, --help            show this help message and exit
  --shard-index SHARD_INDEX
                        Shard ID (for distributed runs)
  --shard-count SHARD_COUNT
                        Total number of shards (for distributed runs)
  -e {normal,normal_with_cookies,explorative}, --experiment {normal,normal_with_cookies,explorative}
                        Which experiment to run
  -p PATH, --path PATH  Path to input CSV for URL manager
  --prefix PREFIX       Filename prefix for output logs
  --path2 PATH2         Path to second CSV for two-run mode
  --prefix2 PREFIX2     Filename prefix for second run
  --headless            Run browser in headless mode
  --engine {playwright}
                        Browser engine to use
  --num-tabs NUM_TABS   Number of parallel tabs (default: 10)

Input Format

The input CSV file should have URLs, one per line or in standard CSV format:

https://example.com
https://another-site.com
https://third-site.org
...

Output

The crawler generates two CSV files in the crawl_logs/ directory:

request_*.csv: All HTTP requests made during crawling
response_*.csv: All HTTP responses received

Each row contains:

Timestamp
Request method, URL, headers
Response status, headers, cookies
Classification labels for blocking

Examples

Crawl with custom tabs and output prefix

fsm-crawl --experiment normal_with_cookies --path my_urls.csv --prefix my_crawl --num-tabs 8

Distributed crawling across 4 machines

# Machine 1
fsm-crawl --shard-index 0 --shard-count 4 --prefix distributed_crawl

# Machine 2
fsm-crawl --shard-index 1 --shard-count 4 --prefix distributed_crawl

# Machine 3
fsm-crawl --shard-index 2 --shard-count 4 --prefix distributed_crawl

# Machine 4
fsm-crawl --shard-index 3 --shard-count 4 --prefix distributed_crawl

Headless mode with explorative strategy

fsm-crawl --experiment explorative --headless --num-tabs 10

Development

Install dev dependencies

pip install -e ".[dev]"

Run tests

pytest

Run with Poetry

poetry run fsm-crawl --help

Configuration Files

The crawler uses a blocking/consent-manager.yaml file to define CMP (Consent Management Platform) detection and cookie acceptance rules.

Architecture

PlaywrightEngine: Manages browser tabs and page interactions
BrowserManager: Coordinates parallel crawling across tabs
ConsentManager: Handles cookie acceptance automation
RequestResponseLoggingPipeline: Logs all network activity to CSV

Requirements

Python 3.11+
Chromium browser (installed automatically via Playwright)
2GB RAM minimum (recommended 4GB+ for 10 tabs)

License

MIT

Support

For issues, please open a GitHub issue or contact the maintainers.

Citation

If you use FSM-Crawl in your research, please cite:

@software{fsm-crawl,
  title={FSM-Crawl: A Parallel Web Crawler with Consent Management},
  author={Schwerdtner, Henry},
  year={2026},
  url={https://github.com/yourusername/fsm-crawl}
}

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.4.0

Jan 27, 2026

This version

0.2.0

Jan 27, 2026

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fsm_crawl-0.2.0.tar.gz (18.2 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fsm_crawl-0.2.0-py3-none-any.whl (785.0 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file fsm_crawl-0.2.0.tar.gz.

File metadata

Download URL: fsm_crawl-0.2.0.tar.gz
Upload date: Jan 27, 2026
Size: 18.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fsm_crawl-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`24becc1f5e198c378825cc66643e656ed6da73a6c8a712bb254e90dd47510e27`
MD5	`6dd02e0cb17e804ddc6dbe748eb6148f`
BLAKE2b-256	`f06036e3c45abe913848ae39629b0356c7300888d9aa05083df66dce97eee80b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fsm_crawl-0.2.0.tar.gz:

Publisher: deploy.yaml on tracker-detector/fsm-crawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fsm_crawl-0.2.0.tar.gz
- Subject digest: 24becc1f5e198c378825cc66643e656ed6da73a6c8a712bb254e90dd47510e27
- Sigstore transparency entry: 864170908
- Sigstore integration time: Jan 27, 2026
Source repository:
- Permalink: tracker-detector/fsm-crawl@a7ff027d189222dd67ef20b6f0132b435c88d3de
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/tracker-detector
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: deploy.yaml@a7ff027d189222dd67ef20b6f0132b435c88d3de
- Trigger Event: release

File details

Details for the file fsm_crawl-0.2.0-py3-none-any.whl.

File metadata

Download URL: fsm_crawl-0.2.0-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 785.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fsm_crawl-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b25ea930a23ca57269a7deb3ab394c3ed968031752c1c89740669fe9d096002c`
MD5	`6bf667143f9f2e2f0fb399d5666f051d`
BLAKE2b-256	`c8bfda0dfa439c0af48a8a90a2538d7e00e56af5dbc912a1e285cab4d56a8635`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fsm_crawl-0.2.0-py3-none-any.whl:

Publisher: deploy.yaml on tracker-detector/fsm-crawl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fsm_crawl-0.2.0-py3-none-any.whl
- Subject digest: b25ea930a23ca57269a7deb3ab394c3ed968031752c1c89740669fe9d096002c
- Sigstore transparency entry: 864170919
- Sigstore integration time: Jan 27, 2026
Source repository:
- Permalink: tracker-detector/fsm-crawl@a7ff027d189222dd67ef20b6f0132b435c88d3de
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/tracker-detector
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: deploy.yaml@a7ff027d189222dd67ef20b6f0132b435c88d3de
- Trigger Event: release

fsm-crawl 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FSM-Crawl

Features

Installation

Option 1: PyPI (Recommended)

Option 2: From Git

Option 3: Development Install

Quick Start

Basic Usage

Configuration

CLI Commands

Input Format

Output

Examples

Crawl with custom tabs and output prefix

Distributed crawling across 4 machines

Headless mode with explorative strategy

Development

Install dev dependencies

Run tests

Run with Poetry

Configuration Files

Architecture

Requirements

License

Support

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance