A powerful and extensible web crawling framework built with Scrapy and Playwright.

Project description

Lazy Crawler

Extensible web crawling and data extraction framework.

A technical foundation for building scalable data pipelines using Scrapy and Playwright.

Core Stack
Code Quality
Documentation

Lazy Crawler is an extensible web crawling framework designed for developers who need to build high-performance data extraction pipelines. It combines the speed of Scrapy with the dynamic rendering capabilities of Playwright to handle complex web environments.

Features

Automated Workflows: Streamlined project structure for rapid deployment of new spiders.
Dynamic Content Support: Built-in Playwright integration for rendering JavaScript-heavy applications.
Multi-backend Support: Native export functionality for MongoDB, PostgreSQL, CSV, JSON, and Google Sheets.
Developer First: Focuses on clean abstractions and extensibility over rigid configurations.
Resilient Extraction: Integrated support for proxy rotation and anti-detection measures.

Quick Start

1. Installation

This project uses uv for dependency management.

uv pip install .

For development:

# Initialize and install in editable mode
uv pip install -e .

[!NOTE] Install Playwright browser binaries after the initial setup: playwright install

2. Static Site Crawler

Create my_agent.py:

import scrapy
from lazy_crawler.crawler.spiders.base_crawler import LazyBaseCrawler
from scrapy.crawler import CrawlerProcess

class MyAgent(LazyBaseCrawler):
    name = "my_agent"

    def start_requests(self):
        yield scrapy.Request("https://example.com", self.parse)

    def parse(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "url": response.url
        }

process = CrawlerProcess()
process.crawl(MyAgent)
process.start()

3. Dynamic Content (JavaScript)

Leverage Playwright for sites that require browser rendering:

class DynamicAgent(LazyBaseCrawler):
    name = "dynamic"

    def start_requests(self):
        yield scrapy.Request(
            "https://dynamic-site.com",
            meta={"playwright": True},
            callback=self.parse
        )

    def parse(self, response):
        data = response.css(".rendered-content::text").get()
        yield {"content": data}

Data Management

1. MongoDB Integration

Configuration (.env):

MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lazy_crawler_db

Settings:

ITEM_PIPELINES = {
    "lazy_crawler.crawler.pipelines.MongoPipeline": 400,
}

2. Google Sheets Export

Configuration (.env):

GOOGLE_SHEETS_CREDS_FILE=creds.json
GOOGLE_SHEETS_SPREADSHEET_NAME=CrawlData
GOOGLE_SHEETS_WORKSHEET_NAME=Results

Dashboard & API

The project includes a dashboard for monitoring crawl progress and exploring extracted data.

Start the service:

uv run python -m lazy_crawler.app.main

Dashboard: http://localhost:8000/
API Documentation: http://localhost:8000/docs

Docker Deployment (Production)

Deploy using the provided orchestration files:

# Quick deployment
./deploy.sh

# Manual startup
docker compose up --build -d

Dashboard: http://localhost/
API Docs: http://localhost/docs
Health: http://localhost/health

Customization

The framework is designed to be modified. You can extend LazyBaseCrawler or implement custom pipelines to handle specific data requirements.

Contributing

Technical contributions and bug reports are welcome. Please refer to CONTRIBUTING.md.

License

Lazy Crawler is licensed under the MIT License.

Created by Pradip P.

Project details

Release history Release notifications | RSS feed

2.0.1

Dec 27, 2025

This version

2.0.0

Dec 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazy_crawler-2.0.0.tar.gz (60.3 kB view details)

Uploaded Dec 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lazy_crawler-2.0.0-py3-none-any.whl (81.9 kB view details)

Uploaded Dec 27, 2025 Python 3

File details

Details for the file lazy_crawler-2.0.0.tar.gz.

File metadata

Download URL: lazy_crawler-2.0.0.tar.gz
Upload date: Dec 27, 2025
Size: 60.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lazy_crawler-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e6c1a74454fe717b36ade8bcc9e99c972cd21c01d53a191830f5c98f796ed682`
MD5	`d06a7cf190673488d58f02d40b9592ad`
BLAKE2b-256	`bd98f797b99d39431ff81143b1e6a21daad45b85ec67c3262f925d55c35ce49c`

See more details on using hashes here.

File details

Details for the file lazy_crawler-2.0.0-py3-none-any.whl.

File metadata

Download URL: lazy_crawler-2.0.0-py3-none-any.whl
Upload date: Dec 27, 2025
Size: 81.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lazy_crawler-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc650f7440110a6e51730cca4202f502fb9080f500e0ac9abb26dc9d8831304e`
MD5	`1782636eeafb0f24c84e79678741d415`
BLAKE2b-256	`2842a057f95d23bc00aaebf6ad38071f89827e6b23b5cd1baddc76d181a6a338`

See more details on using hashes here.

lazy-crawler 2.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

Lazy Crawler

Features

Quick Start

1. Installation

2. Static Site Crawler

3. Dynamic Content (JavaScript)

Data Management

1. MongoDB Integration

2. Google Sheets Export

Dashboard & API

Docker Deployment (Production)

Customization

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes