Skip to main content

A powerful and extensible web crawling framework built with Scrapy and Playwright.

Project description

Lazy Crawler

Extensible web crawling and data extraction framework.

A technical foundation for building scalable data pipelines using Scrapy and Playwright.

Release Version

🌐 Architecture Overview

graph TD
    A[Target Website] -->|Spider Scans| B(Lazy Crawler Core)
    B -->|Dynamic Rendering| C{Playwright}
    C -->|Rendered HTML| B
    B -->|Extracted Data| D{Data Pipelines}
    D -->|Export| E[Google Sheets]
    D -->|Export| F[PostgreSQL / MongoDB]
    D -->|Export| G[JSON / CSV]
    H[Admin Dashboard] -->|Monitor| B

Core Stack Python Scrapy Playwright
Code Quality PEP8 Style pre-commit
Documentation docs license

Lazy Crawler is an extensible web crawling framework designed for both developers and organizations that need robust data extraction pipelines. It combines the speed of Scrapy with the dynamic rendering capabilities of Playwright to handle modern websites that use heavy JavaScript.

What is Lazy Crawler?

If you need to collect data from websites—whether it's product prices, news articles, or social media updates—Lazy Crawler handles the hard parts for you:

  • Automatic Scrolling & Clicking: It can "browse" like a human to see content that only appears when you scroll or click.
  • Multiple Save Locations: Send your data directly to Excel (CSV), Google Sheets, or professional databases (PostgreSQL/MongoDB).
  • Security & Reliability: Built-in protection against being blocked, including smart rate limiting and proxy support.
  • Easy Dashboard: A simple web interface to see how your data collection is going in real-time.

Features

  • Automated Workflows: Fast setup for new data collection tasks ("spiders").
  • Modern Web Support: Built-in Playwright integration for sites like Amazon, Twitter, or React apps.
  • Google Sheets Integration: Push data directly to your spreadsheets for easy sharing.
  • Smart Rate Limiting (Enhanced): Protects the application and target websites from abuse by ensuring fair usage, now with advanced IP and user identification that cannot be easily bypassed.
  • Integrated Proxy Manager: Built-in system for automatic proxy rotation and health checks, compatible with both Scrapy and Playwright.
  • Developer First: Clean, modular code that is easy to extend.
  • Production Ready: Full Docker support for stable, long-running deployments.

Quick Start

1. Installation

This project uses uv for dependency management.

# Install from PyPI
pip install lazy-crawler

# OR install locally with uv
uv add lazy-crawler

For development:

# Initialize and sync dependencies
uv sync

[!NOTE] Install Playwright browser binaries after the initial setup: playwright install

2. Static Site Crawler

Create my_agent.py:

import scrapy
from lazy_crawler.crawler.spiders.base_crawler import LazyBaseCrawler
from scrapy.crawler import CrawlerProcess

class MyAgent(LazyBaseCrawler):
    name = "my_agent"

    def start_requests(self):
        yield scrapy.Request("https://example.com", self.parse)

    def parse(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "url": response.url
        }

process = CrawlerProcess()
process.crawl(MyAgent)
process.start()

3. Dynamic Content (JavaScript)

Leverage Playwright for sites that require browser rendering:

class DynamicAgent(LazyBaseCrawler):
    name = "dynamic"

    def start_requests(self):
        yield scrapy.Request(
            "https://dynamic-site.com",
            meta={"playwright": True},
            callback=self.parse
        )

    def parse(self, response):
        data = response.css(".rendered-content::text").get()
        yield {"content": data}

Data Management

1. MongoDB Integration

Configuration (.env):

MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lazy_crawler_db

Settings:

ITEM_PIPELINES = {
    "lazy_crawler.crawler.pipelines.MongoPipeline": 400,
}

2. Google Sheets Export

Configuration (.env):

GOOGLE_SHEETS_CREDS_FILE=creds.json
GOOGLE_SHEETS_SPREADSHEET_NAME=CrawlData
GOOGLE_SHEETS_WORKSHEET_NAME=Results

3. JSON & CSV Export

Enable the built-in pipelines to save to local files:

custom_settings = {
    "ITEM_PIPELINES": {
        # Export to scraped_data.json
        "lazy_crawler.crawler.pipelines.JsonWriterPipeline": 300,

        # Export to scraped_data_{timestamp}.csv
        "lazy_crawler.crawler.pipelines.CSVPipeline": 301,
    }
}

4. Excel Export

Enable the Excel pipeline to save data as .xlsx:

custom_settings = {
    "ITEM_PIPELINES": {
        "lazy_crawler.crawler.pipelines.ExcelWriterPipeline": 302,
    }
}

Dashboard & API

The project includes a dashboard for monitoring crawl progress and exploring extracted data.

Start the service:

uv run python -m lazy_crawler.app.main
  • Dashboard: http://localhost:8000/
  • API Documentation: http://localhost:8000/docs

Docker Deployment (Production)

Deploy using the provided orchestration files:

# Manual startup
docker compose up --build -d
  • Dashboard: http://localhost/
  • API Docs: http://localhost/docs
  • Health: http://localhost/health

Customization

The framework is designed to be modified. You can extend LazyBaseCrawler or implement custom pipelines to handle specific data requirements.

Contributing

Technical contributions and bug reports are welcome.

License

Lazy Crawler is licensed under the MIT License.


Created by Pradip P.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazy_crawler-2.0.1.tar.gz (62.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lazy_crawler-2.0.1-py3-none-any.whl (85.0 kB view details)

Uploaded Python 3

File details

Details for the file lazy_crawler-2.0.1.tar.gz.

File metadata

  • Download URL: lazy_crawler-2.0.1.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lazy_crawler-2.0.1.tar.gz
Algorithm Hash digest
SHA256 b4232e05f528fbd4974adc843764dcec4867528dcfca2e562fec68ff8555d5aa
MD5 a4f15bdeb0f31635ae7cf40d029d0aef
BLAKE2b-256 176c50566d3d9af81c975210de9b34f0616ad6ec7ecffd578be2063561928fb5

See more details on using hashes here.

File details

Details for the file lazy_crawler-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: lazy_crawler-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 85.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lazy_crawler-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d2bd823c9bad26b48cc6709d1817b4bd04ab11121f316c1357891b312c67d139
MD5 a2ced20c0ee2c67b0e626bb870064114
BLAKE2b-256 5f927893add95f9bfe5c9badf3348ed323284e2b839f9825309c5fa7f9f22bf1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page