A stealth web crawler using headless Chrome
Project description
Stealth Crawler
A headless-Chrome web crawler that discovers same-host links and optionally saves HTML, Markdown, PDF, or screenshots. Use as a library or via the stealth-crawler CLI.
Features
- Asynchronous, headless Chrome browsing via
pydoll - Discovers internal links starting from a root URL
- Optional content saving:
- HTML
- Markdown (via
html2text) - PDF snapshots
- PNG screenshots
- Rich progress bars with
rich - Configurable URL filtering (base, exclude)
- Pure-Python API and CLI
Installation
Install the latest stable release for everyday use:
pip install stealth-crawler
Or in an isolated environment with pipx:
pipx install stealth-crawler
Or via Poetry:
poetry add stealth-crawler
Quickstart
Command-Line
# Discover URLs only
stealth-crawler crawl https://example.com --urls-only
# Crawl and save HTML + Markdown
stealth-crawler crawl https://example.com \
--save-html --save-md \
--output-dir ./output
# Exclude specific paths
stealth-crawler crawl https://example.com \
--exclude /private,/logout
Run stealth-crawler --help for full options.
Python API
import asyncio
from stealthcrawler import StealthCrawler
crawler = StealthCrawler(
base="https://example.com",
exclude=["/admin"],
save_html=True,
save_md=True,
output_dir="export"
)
urls = asyncio.run(crawler.crawl("https://example.com"))
print(urls)
Configuration
| Option | CLI flag | API param | Default |
|---|---|---|---|
| Base URL(s) | --base |
base |
start URL |
| Exclude paths | --exclude |
exclude |
none |
| Save HTML | --save-html |
save_html |
False |
| Save Markdown | --save-md |
save_md |
False |
| URLs only | --urls-only |
urls_only |
False |
| Output folder | --output-dir |
output_dir |
./output |
Testing & Quality
-
Run tests:
pytest
-
Check formatting & linting:
black src tests ruff check src tests
Contributing
-
Fork the repository and create a feature branch.
-
Set up your development environment:
python3 -m venv .venv source .venv/bin/activate pip install -e ".[dev]"
Or with uv:
uv venv .venv source .venv/bin/activate uv pip install -e ".[dev]"
-
Implement your changes, add tests, run:
black src tests ruff check src tests pytest
-
Open a pull request against
main.
License
This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). You are free to use, modify, and redistribute under the terms of the GPL. See LICENSE for full details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stealth_crawler-0.9.0.tar.gz.
File metadata
- Download URL: stealth_crawler-0.9.0.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87d0a565bd8cef0ed5e0f174a00fba5efe613b56ee308cff6ae0215fe951b44c
|
|
| MD5 |
6a6ee29817b555b3f979d1b9f7b0f216
|
|
| BLAKE2b-256 |
7e6bde61031fcff80ef57fef0620a3c337128d4708b04531b82346f3cf5bc28e
|
File details
Details for the file stealth_crawler-0.9.0-py3-none-any.whl.
File metadata
- Download URL: stealth_crawler-0.9.0-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25c1ec7a41ea86d549f153bfe548999f5d67c07879d9a250cefb69fccfdc360c
|
|
| MD5 |
077e8b396d45102b0e241b1a2f575498
|
|
| BLAKE2b-256 |
2130af0acf97f4dce00c8c8d05a6e5bf1f555226ee4aa724954b21580da4ffd4
|