Skip to main content

A stealth web crawler using headless Chrome

Project description

Stealth Crawler

A headless-Chrome web crawler that discovers same-host links and optionally saves HTML, Markdown, PDF, or screenshots. Use as a library or via the stealth-crawler CLI.


Features

  • Asynchronous, headless Chrome browsing via pydoll
  • Discovers internal links starting from a root URL
  • Optional content saving:
    • HTML
    • Markdown (via html2text)
    • PDF snapshots
    • PNG screenshots
  • Rich progress bars with rich
  • Configurable URL filtering (base, exclude)
  • Pure-Python API and CLI

Installation

Install the latest stable release for everyday use:

pip install stealth-crawler

Or in an isolated environment with pipx:

pipx install stealth-crawler

Or via Poetry:

poetry add stealth-crawler

Quickstart

Command-Line

# Discover URLs only
stealth-crawler crawl https://example.com --urls-only

# Crawl and save HTML + Markdown
stealth-crawler crawl https://example.com \
  --save-html --save-md \
  --output-dir ./output

# Exclude specific paths
stealth-crawler crawl https://example.com \
  --exclude /private,/logout

Run stealth-crawler --help for full options.

Python API

import asyncio
from stealthcrawler import StealthCrawler

crawler = StealthCrawler(
    base="https://example.com",
    exclude=["/admin"],
    save_html=True,
    save_md=True,
    output_dir="export"
)
urls = asyncio.run(crawler.crawl("https://example.com"))
print(urls)

Configuration

Option CLI flag API param Default
Base URL(s) --base base start URL
Exclude paths --exclude exclude none
Save HTML --save-html save_html False
Save Markdown --save-md save_md False
URLs only --urls-only urls_only False
Output folder --output-dir output_dir ./output

Testing & Quality

  • Run tests:

    pytest
    
  • Check formatting & linting:

    black src tests
    ruff check src tests
    

Contributing

  1. Fork the repository and create a feature branch.

  2. Set up your development environment:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install -e ".[dev]"
    

    Or with uv:

    uv venv .venv
    source .venv/bin/activate
    uv pip install -e ".[dev]"
    
  3. Implement your changes, add tests, run:

    black src tests
    ruff check src tests
    pytest
    
  4. Open a pull request against main.


License

This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). You are free to use, modify, and redistribute under the terms of the GPL. See LICENSE for full details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stealth_crawler-0.9.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stealth_crawler-0.9.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file stealth_crawler-0.9.0.tar.gz.

File metadata

  • Download URL: stealth_crawler-0.9.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for stealth_crawler-0.9.0.tar.gz
Algorithm Hash digest
SHA256 87d0a565bd8cef0ed5e0f174a00fba5efe613b56ee308cff6ae0215fe951b44c
MD5 6a6ee29817b555b3f979d1b9f7b0f216
BLAKE2b-256 7e6bde61031fcff80ef57fef0620a3c337128d4708b04531b82346f3cf5bc28e

See more details on using hashes here.

File details

Details for the file stealth_crawler-0.9.0-py3-none-any.whl.

File metadata

File hashes

Hashes for stealth_crawler-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25c1ec7a41ea86d549f153bfe548999f5d67c07879d9a250cefb69fccfdc360c
MD5 077e8b396d45102b0e241b1a2f575498
BLAKE2b-256 2130af0acf97f4dce00c8c8d05a6e5bf1f555226ee4aa724954b21580da4ffd4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page