Skip to main content

A powerful and extensible web crawling framework built with Scrapy and Playwright.

Project description

Lazy Crawler

Extensible web crawling and data extraction framework.

A technical foundation for building scalable data pipelines using Scrapy and Playwright.

Release Version

Core Stack Python Scrapy Playwright
Code Quality PEP8 Style pre-commit
Documentation docs license

Lazy Crawler is an extensible web crawling framework designed for developers who need to build high-performance data extraction pipelines. It combines the speed of Scrapy with the dynamic rendering capabilities of Playwright to handle complex web environments.

Features

  • Automated Workflows: Streamlined project structure for rapid deployment of new spiders.
  • Dynamic Content Support: Built-in Playwright integration for rendering JavaScript-heavy applications.
  • Multi-backend Support: Native export functionality for MongoDB, PostgreSQL, CSV, JSON, and Google Sheets.
  • Developer First: Focuses on clean abstractions and extensibility over rigid configurations.
  • Resilient Extraction: Integrated support for proxy rotation and anti-detection measures.

Quick Start

1. Installation

This project uses uv for dependency management.

uv pip install .

For development:

# Initialize and install in editable mode
uv pip install -e .

[!NOTE] Install Playwright browser binaries after the initial setup: playwright install

2. Static Site Crawler

Create my_agent.py:

import scrapy
from lazy_crawler.crawler.spiders.base_crawler import LazyBaseCrawler
from scrapy.crawler import CrawlerProcess

class MyAgent(LazyBaseCrawler):
    name = "my_agent"

    def start_requests(self):
        yield scrapy.Request("https://example.com", self.parse)

    def parse(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "url": response.url
        }

process = CrawlerProcess()
process.crawl(MyAgent)
process.start()

3. Dynamic Content (JavaScript)

Leverage Playwright for sites that require browser rendering:

class DynamicAgent(LazyBaseCrawler):
    name = "dynamic"

    def start_requests(self):
        yield scrapy.Request(
            "https://dynamic-site.com",
            meta={"playwright": True},
            callback=self.parse
        )

    def parse(self, response):
        data = response.css(".rendered-content::text").get()
        yield {"content": data}

Data Management

1. MongoDB Integration

Configuration (.env):

MONGO_URI=mongodb://localhost:27017
MONGO_DATABASE=lazy_crawler_db

Settings:

ITEM_PIPELINES = {
    "lazy_crawler.crawler.pipelines.MongoPipeline": 400,
}

2. Google Sheets Export

Configuration (.env):

GOOGLE_SHEETS_CREDS_FILE=creds.json
GOOGLE_SHEETS_SPREADSHEET_NAME=CrawlData
GOOGLE_SHEETS_WORKSHEET_NAME=Results

Dashboard & API

The project includes a dashboard for monitoring crawl progress and exploring extracted data.

Start the service:

uv run python -m lazy_crawler.app.main
  • Dashboard: http://localhost:8000/
  • API Documentation: http://localhost:8000/docs

Docker Deployment (Production)

Deploy using the provided orchestration files:

# Quick deployment
./deploy.sh

# Manual startup
docker compose up --build -d
  • Dashboard: http://localhost/
  • API Docs: http://localhost/docs
  • Health: http://localhost/health

Customization

The framework is designed to be modified. You can extend LazyBaseCrawler or implement custom pipelines to handle specific data requirements.

Contributing

Technical contributions and bug reports are welcome. Please refer to CONTRIBUTING.md.

License

Lazy Crawler is licensed under the MIT License.


Created by Pradip P.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazy_crawler-2.0.0.tar.gz (60.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lazy_crawler-2.0.0-py3-none-any.whl (81.9 kB view details)

Uploaded Python 3

File details

Details for the file lazy_crawler-2.0.0.tar.gz.

File metadata

  • Download URL: lazy_crawler-2.0.0.tar.gz
  • Upload date:
  • Size: 60.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lazy_crawler-2.0.0.tar.gz
Algorithm Hash digest
SHA256 e6c1a74454fe717b36ade8bcc9e99c972cd21c01d53a191830f5c98f796ed682
MD5 d06a7cf190673488d58f02d40b9592ad
BLAKE2b-256 bd98f797b99d39431ff81143b1e6a21daad45b85ec67c3262f925d55c35ce49c

See more details on using hashes here.

File details

Details for the file lazy_crawler-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: lazy_crawler-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 81.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lazy_crawler-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dc650f7440110a6e51730cca4202f502fb9080f500e0ac9abb26dc9d8831304e
MD5 1782636eeafb0f24c84e79678741d415
BLAKE2b-256 2842a057f95d23bc00aaebf6ad38071f89827e6b23b5cd1baddc76d181a6a338

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page