A powerful and extensible web crawling framework built with Scrapy and Playwright.
Project description
Lazy Crawler
Extensible web crawling and data extraction framework.
A technical foundation for building scalable data pipelines using Scrapy and Playwright.
Lazy Crawler is an extensible web crawling framework designed for developers who need to build high-performance data extraction pipelines. It combines the speed of Scrapy with the dynamic rendering capabilities of Playwright to handle complex web environments.
Features
- Automated Workflows: Streamlined project structure for rapid deployment of new spiders.
- Dynamic Content Support: Built-in Playwright integration for rendering JavaScript-heavy applications.
- Multi-backend Support: Native export functionality for MongoDB, PostgreSQL, CSV, JSON, and Google Sheets.
- Developer First: Focuses on clean abstractions and extensibility over rigid configurations.
- Resilient Extraction: Integrated support for proxy rotation and anti-detection measures.
Quick Start
1. Installation
This project uses uv for dependency management.
uv pip install .
For development:
# Initialize and install in editable mode
uv pip install -e .
[!NOTE] Install Playwright browser binaries after the initial setup:
playwright install
2. Static Site Crawler
Create my_agent.py:
import scrapy
from lazy_crawler.crawler.spiders.base_crawler import LazyBaseCrawler
from scrapy.crawler import CrawlerProcess
class MyAgent(LazyBaseCrawler):
name = "my_agent"
def start_requests(self):
yield scrapy.Request("https://example.com", self.parse)
def parse(self, response):
yield {
"title": response.css("h1::text").get(),
"url": response.url
}
process = CrawlerProcess()
process.crawl(MyAgent)
process.start()
3. Dynamic Content (JavaScript)
Leverage Playwright for sites that require browser rendering:
class DynamicAgent(LazyBaseCrawler):
name = "dynamic"
def start_requests(self):
yield scrapy.Request(
"https://dynamic-site.com",
meta={"playwright": True},
callback=self.parse
)
def parse(self, response):
data = response.css(".rendered-content::text").get()
yield {"content": data}
Data Management
1. MongoDB Integration
Configuration (.env):
MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lazy_crawler_db
Settings:
ITEM_PIPELINES = {
"lazy_crawler.crawler.pipelines.MongoPipeline": 400,
}
2. Google Sheets Export
Configuration (.env):
GOOGLE_SHEETS_CREDS_FILE=creds.json
GOOGLE_SHEETS_SPREADSHEET_NAME=CrawlData
GOOGLE_SHEETS_WORKSHEET_NAME=Results
Dashboard & API
The project includes a dashboard for monitoring crawl progress and exploring extracted data.
Start the service:
uv run python -m lazy_crawler.app.main
- Dashboard:
http://localhost:8000/ - API Documentation:
http://localhost:8000/docs
Docker Deployment (Production)
Deploy using the provided orchestration files:
# Quick deployment
./deploy.sh
# Manual startup
docker compose up --build -d
- Dashboard:
http://localhost/ - API Docs:
http://localhost/docs - Health:
http://localhost/health
Customization
The framework is designed to be modified. You can extend LazyBaseCrawler or implement custom pipelines to handle specific data requirements.
Contributing
Technical contributions and bug reports are welcome. Please refer to CONTRIBUTING.md.
License
Lazy Crawler is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lazy_crawler-2.0.0.tar.gz.
File metadata
- Download URL: lazy_crawler-2.0.0.tar.gz
- Upload date:
- Size: 60.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6c1a74454fe717b36ade8bcc9e99c972cd21c01d53a191830f5c98f796ed682
|
|
| MD5 |
d06a7cf190673488d58f02d40b9592ad
|
|
| BLAKE2b-256 |
bd98f797b99d39431ff81143b1e6a21daad45b85ec67c3262f925d55c35ce49c
|
File details
Details for the file lazy_crawler-2.0.0-py3-none-any.whl.
File metadata
- Download URL: lazy_crawler-2.0.0-py3-none-any.whl
- Upload date:
- Size: 81.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc650f7440110a6e51730cca4202f502fb9080f500e0ac9abb26dc9d8831304e
|
|
| MD5 |
1782636eeafb0f24c84e79678741d415
|
|
| BLAKE2b-256 |
2842a057f95d23bc00aaebf6ad38071f89827e6b23b5cd1baddc76d181a6a338
|