Skip to main content

The Definitive Web Scraping Framework for Python

Project description

Scrava

Scrava is a powerful, composable web scraping framework for Python that provides a unified API for building scalable web scrapers by orchestrating the best tools in the Python ecosystem.

๐Ÿข Built by Nextract Data Solutions - Your partner for enterprise web scraping and data extraction.

PyPI version Python versions License

๐ŸŽฏ Philosophy

Scrava doesn't reinvent the wheel. Instead, it provides a composition-over-invention approach:

  • Unifying Force: Eliminates boilerplate and integration complexity
  • Battle-Tested Libraries: Built on httpx, Playwright, parsel, and more
  • Developer Experience: Designed to be intuitive and "piece of cake" for newcomers
  • Production-Ready: Structured logging, statistics, error handling, and more

โœจ Features

  • ๐Ÿš€ Async-First: Built on asyncio for maximum performance
  • ๐Ÿ”„ Dual-Mode Fetching: HTTP (httpx) and Browser (Playwright) support
  • ๐Ÿ“ฆ Flexible Queuing: In-memory or Redis-backed with duplicate filtering
  • ๐Ÿช Powerful Hooks: Intercept and modify requests, responses, and data flow
  • ๐Ÿ’พ Pipeline System: MongoDB, JSON, or custom data storage
  • ๐ŸŽฏ Pydantic Integration: Type-safe data models with validation
  • ๐Ÿ“Š Structured Logging: Production-grade logging with structlog
  • โš™๏ธ Config Management: YAML + Pydantic for type-safe configuration
  • ๐Ÿ› ๏ธ CLI Tools: Project scaffolding, bot runner, and interactive shell

๐Ÿ“ฆ Installation

Prerequisites

  • Python 3.8 or higher
  • pip (latest version recommended)

Platform-Specific Notes

macOS (Apple Silicon - M1/M2/M3/M4):

# Use native ARM64 Python for best performance
arch -arm64 pip install scrava

macOS (Intel):

pip install scrava

Windows:

pip install scrava

Linux:

pip install scrava

Installation Options

# Basic installation (works on all platforms)
pip install scrava

# With browser support (Playwright)
pip install scrava[browser]

# With Redis queue support
pip install scrava[redis]

# With MongoDB pipeline support
pip install scrava[mongodb]

# Install everything
pip install scrava[all]

Development Installation

# Clone and install in editable mode
git clone https://github.com/yourusername/scrava.git
cd scrava
pip install -e .

# With all optional dependencies
pip install -e ".[all]"

Quick Installation Scripts

For easier installation, use our platform-specific scripts:

macOS/Linux:

# Auto-detects architecture and installs correctly
curl -sSL https://raw.githubusercontent.com/yourusername/scrava/main/install.sh | bash

# Or download and run manually
chmod +x install.sh
./install.sh

Windows (PowerShell):

# Download and run the installation script
iwr -useb https://raw.githubusercontent.com/yourusername/scrava/main/install.ps1 | iex

# Or download and run manually
.\install.ps1

Verify Installation

# Check if Scrava is properly installed
scrava version

# Run the welcome screen
scrava

Troubleshooting

If you encounter installation issues, see PLATFORM.md for detailed platform-specific instructions.

๐Ÿš€ Quick Start

1. Create a New Project

scrava new my_project
cd my_project

2. Define Your Bot

# bots/book_bot.py
from pydantic import BaseModel, HttpUrl
from scrava import BaseBot, Request, Response


class Book(BaseModel):
    """A scraped book record."""
    title: str
    price: float
    url: HttpUrl
    in_stock: bool = True


class BookBot(BaseBot):
    """Bot for scraping books.toscrape.com"""
    
    start_urls = ['https://books.toscrape.com']
    
    async def process(self, response: Response):
        """Extract book data from the page."""
        # Extract books using parsel selectors
        for book in response.selector.css('article.product_pod'):
            title = book.css('h3 a::attr(title)').get()
            price_text = book.css('.price_color::text').get()
            price = float(price_text.replace('ยฃ', ''))
            url = response.urljoin(book.css('h3 a::attr(href)').get())
            
            yield Book(
                title=title,
                price=price,
                url=url
            )
        
        # Follow pagination
        next_page = response.selector.css('.next a::attr(href)').get()
        if next_page:
            yield Request(response.urljoin(next_page))

3. Run Your Bot

scrava run book_bot

๐Ÿ—๏ธ Core Components

Request & Response

from scrava import Request, Response

# Create a request
request = Request(
    url='https://example.com',
    method='GET',
    headers={'User-Agent': 'MyBot/1.0'},
    priority=10,  # Higher priority = processed first
    meta={'browser': True}  # Use browser rendering
)

# Response provides powerful selectors
async def process(self, response: Response):
    # CSS selectors
    title = response.selector.css('h1::text').get()
    
    # XPath selectors
    links = response.selector.xpath('//a/@href').getall()
    
    # Join relative URLs
    absolute_url = response.urljoin('/path')

Bot Lifecycle

from scrava import BaseBot, Response

class MyBot(BaseBot):
    start_urls = ['https://example.com']
    
    async def setup(self):
        """Called before crawling starts."""
        self.session_data = {}
    
    async def process(self, response: Response):
        """Main processing method."""
        yield Record(...)
        yield Request(...)
    
    async def teardown(self):
        """Called after crawling completes."""
        pass

Queue System

from scrava import Crawler
from scrava.queue import MemoryQueue, RedisQueue

# In-memory queue (default)
crawler = Crawler(queue=MemoryQueue())

# Redis-backed queue for distributed crawls
crawler = Crawler(queue=RedisQueue(redis_url="redis://localhost:6379/0"))

Fetchers

# HTTP fetcher (default)
from scrava.fetchers import HttpxFetcher

crawler = Crawler(
    fetcher=HttpxFetcher(
        timeout=30.0,
        follow_redirects=True,
        verify_ssl=True
    )
)

# Browser fetcher for JavaScript-heavy sites
from scrava.fetchers import PlaywrightFetcher

crawler = Crawler(
    browser_fetcher=PlaywrightFetcher(
        headless=True,
        browser_type='chromium',
        context_pool_size=5
    ),
    enable_browser=True
)

# Use browser for specific requests
yield Request(url, meta={'browser': True})

Hooks

Request Hooks

from scrava.hooks import RequestHook

class UserAgentHook(RequestHook):
    async def process_req(self, request, bot):
        # Modify request before fetching
        request.headers['User-Agent'] = 'MyBot/1.0'
        return None
    
    async def process_res(self, request, response, bot):
        # Process response after fetching
        print(f"Got {response.status} from {response.url}")
        return None

crawler = Crawler(request_hooks=[UserAgentHook()])

Built-in Cache Hook

from scrava.hooks import CacheHook

# Enable caching
crawler = Crawler(
    request_hooks=[
        CacheHook(expiration=86400)  # Cache for 1 day
    ]
)

# Disable caching for specific requests
yield Request(url, meta={'cache': False})

Pipelines

from scrava.pipelines import JsonPipeline, MongoPipeline

# JSON output
crawler = Crawler(
    pipelines=[JsonPipeline(output_file='output.jsonl')]
)

# MongoDB with batching
crawler = Crawler(
    pipelines=[
        MongoPipeline(
            uri='mongodb://localhost:27017',
            database='scrava',
            batch_size=100,
            batch_timeout=5.0
        )
    ]
)

# Custom pipeline
from scrava.pipelines import BasePipeline

class CustomPipeline(BasePipeline):
    async def process_rec(self, record, bot):
        # Process and store record
        await self.save_to_db(record)
        return record

Configuration

# config/settings.yaml
project_name: "my_project"

scrava:
  concurrent_reqs: 16
  download_delay: 0.0
  enable_browser: false

cache:
  enabled: true
  path: ".scrava_cache"
  expiration_secs: 86400

queue:
  backend: "scrava.queue.memory.MemoryQueue"
  redis_url: "redis://localhost:6379/0"

pipeline:
  enabled:
    - scrava.pipelines.json.JsonPipeline
  mongodb_uri: "mongodb://localhost:27017"
  mongodb_database: "scrava"

logging:
  level: "INFO"
  format: "console"  # or "json" for production
  use_colors: true
from scrava.config import load_settings

settings = load_settings('config/settings.yaml')

Logging

from scrava.logging import setup_logging, get_logger

# Setup logging
setup_logging(
    level="INFO",
    format="console",  # "json" for production
    use_colors=True
)

# Get logger
logger = get_logger(__name__)

logger.info("Bot started", bot_name="my_bot", url="https://example.com")
# Output: 2024-10-27 10:30:05 [info] Bot started bot_name=my_bot url=https://example.com

๐Ÿ”ง CLI Commands

# Create a new project
scrava new <project_name>

# Run a bot
scrava run <bot_name>

# List all bots
scrava list

# Interactive selector shell
scrava shell <url>
scrava shell <url> --browser  # Use browser rendering

# Show version
scrava version

๐Ÿ“š Advanced Examples

Custom Callback Methods

class ProductBot(BaseBot):
    start_urls = ['https://shop.example.com']
    
    async def process(self, response: Response):
        # Extract category links
        for category in response.selector.css('.category'):
            url = response.urljoin(category.css('a::attr(href)').get())
            yield Request(url, callback=self.parse_category)
    
    async def parse_category(self, response: Response):
        # Extract products
        for product in response.selector.css('.product'):
            yield Request(
                response.urljoin(product.css('a::attr(href)').get()),
                callback=self.parse_product
            )
    
    async def parse_product(self, response: Response):
        yield Product(
            name=response.selector.css('h1::text').get(),
            price=float(response.selector.css('.price::text').get())
        )

Browser Automation

async def process(self, response: Response):
    # Scroll page, click buttons, etc. with JavaScript
    yield Request(
        url='https://spa-site.com',
        meta={
            'browser': True,
            'wait_for': '.dynamic-content',
            'scroll': True
        }
    )

Error Handling Hook

class RetryHook(RequestHook):
    async def process_exc(self, request, exception, bot):
        if request.meta.get('retry_count', 0) < 3:
            # Retry with incremented counter
            request.meta['retry_count'] = request.meta.get('retry_count', 0) + 1
            await bot.queue.push(request)
        return None

Data Validation Pipeline

class ValidationPipeline(BasePipeline):
    async def process_rec(self, record, bot):
        # Pydantic automatically validates
        if record.price < 0:
            logger.warning("Invalid price", record=record)
            return None  # Filter out
        return record

๐ŸŽฏ Best Practices

  1. Use Pydantic Models: Define clear schemas for your scraped data
  2. Leverage Hooks: Keep bot logic clean by using hooks for cross-cutting concerns
  3. Configure Delays: Be respectful with download_delay to avoid overwhelming servers
  4. Enable Caching: Speed up development with the built-in CacheHook
  5. Structure Logs: Use structured logging for easy debugging and monitoring
  6. Handle Errors: Implement retry logic and error hooks for robust crawls
  7. Test Selectors: Use scrava shell <url> to test CSS/XPath selectors interactively

๐Ÿ”— Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Bot      โ”‚  โ† Your scraping logic
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
       โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Core     โ”‚  โ† Orchestrator (asyncio event loop)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
       โ”œโ†’ Queue      (MemoryQueue / RedisQueue)
       โ”œโ†’ Fetcher    (HttpxFetcher / PlaywrightFetcher)
       โ”œโ†’ Hooks      (RequestHook / BotHook)
       โ””โ†’ Pipelines  (MongoPipeline / JsonPipeline)

๐Ÿ“– Documentation

For full documentation, visit: https://scrava.readthedocs.io

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“„ License

MIT License - see LICENSE file for details

๐Ÿ™ Acknowledgments

Scrava is built on the shoulders of giants:


๐Ÿข About Nextract Data Solutions

Scrava is developed and maintained by Nextract Data Solutions, a leading provider of enterprise web scraping and data extraction services.

Need enterprise-grade data extraction?

While Scrava is perfect for developers building their own scrapers, Nextract Data Solutions offers done-for-you web scraping and data pipelines for businesses that need:

  • โœ… Custom enterprise scraping solutions
  • โœ… Data-as-a-Service (DaaS) subscriptions
  • โœ… Data enrichment and validation
  • โœ… 99.9% accuracy and reliability
  • โœ… Dedicated support and SLA guarantees

๐Ÿ“ž Contact Nextract

Schedule a Free Strategy Call | Download Capabilities Deck


Happy Scraping! ๐Ÿ•ท๏ธ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrava-0.1.0.tar.gz (35.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrava-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file scrava-0.1.0.tar.gz.

File metadata

  • Download URL: scrava-0.1.0.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for scrava-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ee927a98996b4da0046975b8b73362e53344f95540e12e943622e2eb86d4a4c8
MD5 a3922753dbbd486715d1e0f8e4d335c9
BLAKE2b-256 966dbf7245ab864258e79cb0b4d5ca79e7cbec9b3c3fa9d087d66a44351c9004

See more details on using hashes here.

File details

Details for the file scrava-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrava-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for scrava-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8d9fee2c79519848f0dc122424caa362fb493bbe1347b437e79657f2a876d401
MD5 6345bd69397ab50645cc887779d78b18
BLAKE2b-256 c3f7160d746d6e90e4e9f2866fb306757beb622f14e305e144338362295e524b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page