A flexible and advanced web crawler for modern SPAs and traditional websites.
Project description
Sitewise Crawler 🕷️
An advanced, flexible, and production-ready web crawler for modern websites. Automatically detects SPAs (Single Page Applications) and switches between fast requests fetching and full JavaScript rendering with Playwright.
✨ Features
- 🚀 Hybrid Rendering: Automatically detects React, Vue, Angular, and Next.js to switch rendering engines on the fly.
- 🧠 Smart Extraction: Built-in main content extraction that removes headers, footers, and sidebars.
- 🔗 SPA Link Discovery: Discovers links even in complex client-side routers.
- 🛠️ Fully Configurable: Control depth, concurrency, rate limits, and custom wait selectors.
- 📝 Pydantic Models: Type-safe configuration and results.
📦 Installation
pip install sitewise-crawler
playwright install chromium
🚀 Quick Start
import asyncio
from sitewise_crawler import SPACrawler, CrawlerConfig
async def main():
# 1. Configure the crawler
config = CrawlerConfig(
start_url="https://example.com",
max_depth=2,
max_pages=10,
use_playwright=True,
headless=True
)
# 2. Initialize and run
crawler = SPACrawler(config)
# Optional: Add a callback for each page crawled
crawler.on_page_crawled = lambda page: print(f"Crawled: {page.url} | Title: {page.title}")
result = await crawler.crawl()
# 3. Process results
if result.success:
print(f"\n✅ Crawl complete! Found {result.total_pages} pages.")
for page in result.pages_all:
print(f"- {page.url} ({len(page.content)} chars)")
if __name__ == "__main__":
asyncio.run(main())
⚙️ Configuration Options
The CrawlerConfig class supports the following parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
start_url |
str |
Required | The entry point for the crawler. |
max_depth |
int |
3 |
Maximum crawl depth from the start URL. |
max_pages |
int |
100 |
Stop crawling after this many pages. |
use_playwright |
bool |
True |
Enable JavaScript rendering for SPAs. |
headless |
bool |
True |
Run browser in headless mode. |
rate_limit_delay |
float |
1.0 |
Seconds to wait between requests. |
wait_for_selector |
str |
None |
CSS selector to wait for before extracting SPA content. |
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
sitewise_crawler
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitewise_crawler-0.1.1.tar.gz.
File metadata
- Download URL: sitewise_crawler-0.1.1.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cb7299bb06dd88e34433c8deb123a31cf525b4e3da210a7089dca07b7c297a2
|
|
| MD5 |
986273a093826a624c1dd07ae592d341
|
|
| BLAKE2b-256 |
7218dcc796d36d638514f9d865dcda0601f49972ba5221c2cbcae309cb128c66
|
File details
Details for the file sitewise_crawler-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sitewise_crawler-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4630d44e550dbd6aa9247a687a796e1afee629f9b336dfb4e1352ef1c52a27a5
|
|
| MD5 |
54fe5dcbe4a87ae195446b5aabcebc6c
|
|
| BLAKE2b-256 |
e720151d8ba2b5e8b4fec5383c2523e356d24574048bd369dca748820428b0de
|