The Definitive Web Scraping Framework for Python

These details have not been verified by PyPI

Project links

Project description

Scrava

Scrava is a powerful, composable web scraping framework for Python that provides a unified API for building scalable web scrapers by orchestrating the best tools in the Python ecosystem.

🏢 Built by Nextract Data Solutions - Your partner for enterprise web scraping and data extraction.

🎯 Philosophy

Scrava doesn't reinvent the wheel. Instead, it provides a composition-over-invention approach:

Unifying Force: Eliminates boilerplate and integration complexity
Battle-Tested Libraries: Built on httpx, Playwright, parsel, and more
Developer Experience: Designed to be intuitive and "piece of cake" for newcomers
Production-Ready: Structured logging, statistics, error handling, and more

✨ Features

🚀 Async-First: Built on asyncio for maximum performance
🔄 Dual-Mode Fetching: HTTP (httpx) and Browser (Playwright) support
📦 Flexible Queuing: In-memory or Redis-backed with duplicate filtering
🪝 Powerful Hooks: Intercept and modify requests, responses, and data flow
💾 Pipeline System: MongoDB, JSON, or custom data storage
🎯 Pydantic Integration: Type-safe data models with validation
📊 Structured Logging: Production-grade logging with structlog
⚙️ Config Management: YAML + Pydantic for type-safe configuration
🛠️ CLI Tools: Project scaffolding, bot runner, and interactive shell

📦 Installation

Prerequisites

Python 3.8 or higher
pip (latest version recommended)

Platform-Specific Notes

macOS (Apple Silicon - M1/M2/M3/M4):

# Use native ARM64 Python for best performance
arch -arm64 pip install scrava

macOS (Intel):

pip install scrava

Windows:

pip install scrava

Linux:

pip install scrava

Installation Options

# Basic installation (works on all platforms)
pip install scrava

# With browser support (Playwright)
pip install scrava[browser]

# With Redis queue support
pip install scrava[redis]

# With MongoDB pipeline support
pip install scrava[mongodb]

# Install everything
pip install scrava[all]

Development Installation

# Clone and install in editable mode
git clone https://github.com/yourusername/scrava.git
cd scrava
pip install -e .

# With all optional dependencies
pip install -e ".[all]"

Quick Installation Scripts

For easier installation, use our platform-specific scripts:

macOS/Linux:

# Auto-detects architecture and installs correctly
curl -sSL https://raw.githubusercontent.com/yourusername/scrava/main/install.sh | bash

# Or download and run manually
chmod +x install.sh
./install.sh

Windows (PowerShell):

# Download and run the installation script
iwr -useb https://raw.githubusercontent.com/yourusername/scrava/main/install.ps1 | iex

# Or download and run manually
.\install.ps1

Verify Installation

# Check if Scrava is properly installed
scrava version

# Run the welcome screen
scrava

Troubleshooting

If you encounter installation issues, see PLATFORM.md for detailed platform-specific instructions.

🚀 Quick Start

1. Create a New Project

scrava new my_project
cd my_project

2. Define Your Bot

# bots/book_bot.py
from pydantic import BaseModel, HttpUrl
from scrava import BaseBot, Request, Response


class Book(BaseModel):
    """A scraped book record."""
    title: str
    price: float
    url: HttpUrl
    in_stock: bool = True


class BookBot(BaseBot):
    """Bot for scraping books.toscrape.com"""
    
    start_urls = ['https://books.toscrape.com']
    
    async def process(self, response: Response):
        """Extract book data from the page."""
        # Extract books using parsel selectors
        for book in response.selector.css('article.product_pod'):
            title = book.css('h3 a::attr(title)').get()
            price_text = book.css('.price_color::text').get()
            price = float(price_text.replace('£', ''))
            url = response.urljoin(book.css('h3 a::attr(href)').get())
            
            yield Book(
                title=title,
                price=price,
                url=url
            )
        
        # Follow pagination
        next_page = response.selector.css('.next a::attr(href)').get()
        if next_page:
            yield Request(response.urljoin(next_page))

3. Run Your Bot

scrava run book_bot

🏗️ Core Components

Request & Response

from scrava import Request, Response

# Create a request
request = Request(
    url='https://example.com',
    method='GET',
    headers={'User-Agent': 'MyBot/1.0'},
    priority=10,  # Higher priority = processed first
    meta={'browser': True}  # Use browser rendering
)

# Response provides powerful selectors
async def process(self, response: Response):
    # CSS selectors
    title = response.selector.css('h1::text').get()
    
    # XPath selectors
    links = response.selector.xpath('//a/@href').getall()
    
    # Join relative URLs
    absolute_url = response.urljoin('/path')

Bot Lifecycle

from scrava import BaseBot, Response

class MyBot(BaseBot):
    start_urls = ['https://example.com']
    
    async def setup(self):
        """Called before crawling starts."""
        self.session_data = {}
    
    async def process(self, response: Response):
        """Main processing method."""
        yield Record(...)
        yield Request(...)
    
    async def teardown(self):
        """Called after crawling completes."""
        pass

Queue System

from scrava import Crawler
from scrava.queue import MemoryQueue, RedisQueue

# In-memory queue (default)
crawler = Crawler(queue=MemoryQueue())

# Redis-backed queue for distributed crawls
crawler = Crawler(queue=RedisQueue(redis_url="redis://localhost:6379/0"))

Fetchers

# HTTP fetcher (default)
from scrava.fetchers import HttpxFetcher

crawler = Crawler(
    fetcher=HttpxFetcher(
        timeout=30.0,
        follow_redirects=True,
        verify_ssl=True
    )
)

# Browser fetcher for JavaScript-heavy sites
from scrava.fetchers import PlaywrightFetcher

crawler = Crawler(
    browser_fetcher=PlaywrightFetcher(
        headless=True,
        browser_type='chromium',
        context_pool_size=5
    ),
    enable_browser=True
)

# Use browser for specific requests
yield Request(url, meta={'browser': True})

Hooks

Request Hooks

from scrava.hooks import RequestHook

class UserAgentHook(RequestHook):
    async def process_req(self, request, bot):
        # Modify request before fetching
        request.headers['User-Agent'] = 'MyBot/1.0'
        return None
    
    async def process_res(self, request, response, bot):
        # Process response after fetching
        print(f"Got {response.status} from {response.url}")
        return None

crawler = Crawler(request_hooks=[UserAgentHook()])

Built-in Cache Hook

from scrava.hooks import CacheHook

# Enable caching
crawler = Crawler(
    request_hooks=[
        CacheHook(expiration=86400)  # Cache for 1 day
    ]
)

# Disable caching for specific requests
yield Request(url, meta={'cache': False})

Pipelines

from scrava.pipelines import JsonPipeline, MongoPipeline

# JSON output
crawler = Crawler(
    pipelines=[JsonPipeline(output_file='output.jsonl')]
)

# MongoDB with batching
crawler = Crawler(
    pipelines=[
        MongoPipeline(
            uri='mongodb://localhost:27017',
            database='scrava',
            batch_size=100,
            batch_timeout=5.0
        )
    ]
)

# Custom pipeline
from scrava.pipelines import BasePipeline

class CustomPipeline(BasePipeline):
    async def process_rec(self, record, bot):
        # Process and store record
        await self.save_to_db(record)
        return record

Configuration

# config/settings.yaml
project_name: "my_project"

scrava:
  concurrent_reqs: 16
  download_delay: 0.0
  enable_browser: false

cache:
  enabled: true
  path: ".scrava_cache"
  expiration_secs: 86400

queue:
  backend: "scrava.queue.memory.MemoryQueue"
  redis_url: "redis://localhost:6379/0"

pipeline:
  enabled:
    - scrava.pipelines.json.JsonPipeline
  mongodb_uri: "mongodb://localhost:27017"
  mongodb_database: "scrava"

logging:
  level: "INFO"
  format: "console"  # or "json" for production
  use_colors: true

from scrava.config import load_settings

settings = load_settings('config/settings.yaml')

Logging

from scrava.logging import setup_logging, get_logger

# Setup logging
setup_logging(
    level="INFO",
    format="console",  # "json" for production
    use_colors=True
)

# Get logger
logger = get_logger(__name__)

logger.info("Bot started", bot_name="my_bot", url="https://example.com")
# Output: 2024-10-27 10:30:05 [info] Bot started bot_name=my_bot url=https://example.com

🔧 CLI Commands

# Create a new project
scrava new <project_name>

# Run a bot
scrava run <bot_name>

# List all bots
scrava list

# Interactive selector shell
scrava shell <url>
scrava shell <url> --browser  # Use browser rendering

# Show version
scrava version

📚 Advanced Examples

Custom Callback Methods

class ProductBot(BaseBot):
    start_urls = ['https://shop.example.com']
    
    async def process(self, response: Response):
        # Extract category links
        for category in response.selector.css('.category'):
            url = response.urljoin(category.css('a::attr(href)').get())
            yield Request(url, callback=self.parse_category)
    
    async def parse_category(self, response: Response):
        # Extract products
        for product in response.selector.css('.product'):
            yield Request(
                response.urljoin(product.css('a::attr(href)').get()),
                callback=self.parse_product
            )
    
    async def parse_product(self, response: Response):
        yield Product(
            name=response.selector.css('h1::text').get(),
            price=float(response.selector.css('.price::text').get())
        )

Browser Automation

async def process(self, response: Response):
    # Scroll page, click buttons, etc. with JavaScript
    yield Request(
        url='https://spa-site.com',
        meta={
            'browser': True,
            'wait_for': '.dynamic-content',
            'scroll': True
        }
    )

Error Handling Hook

class RetryHook(RequestHook):
    async def process_exc(self, request, exception, bot):
        if request.meta.get('retry_count', 0) < 3:
            # Retry with incremented counter
            request.meta['retry_count'] = request.meta.get('retry_count', 0) + 1
            await bot.queue.push(request)
        return None

Data Validation Pipeline

class ValidationPipeline(BasePipeline):
    async def process_rec(self, record, bot):
        # Pydantic automatically validates
        if record.price < 0:
            logger.warning("Invalid price", record=record)
            return None  # Filter out
        return record

🎯 Best Practices

Use Pydantic Models: Define clear schemas for your scraped data
Leverage Hooks: Keep bot logic clean by using hooks for cross-cutting concerns
Configure Delays: Be respectful with download_delay to avoid overwhelming servers
Enable Caching: Speed up development with the built-in CacheHook
Structure Logs: Use structured logging for easy debugging and monitoring
Handle Errors: Implement retry logic and error hooks for robust crawls
Test Selectors: Use scrava shell <url> to test CSS/XPath selectors interactively

🔗 Architecture

┌─────────────┐
│    Bot      │  ← Your scraping logic
└──────┬──────┘
       │
       ↓
┌─────────────┐
│    Core     │  ← Orchestrator (asyncio event loop)
└──────┬──────┘
       │
       ├→ Queue      (MemoryQueue / RedisQueue)
       ├→ Fetcher    (HttpxFetcher / PlaywrightFetcher)
       ├→ Hooks      (RequestHook / BotHook)
       └→ Pipelines  (MongoPipeline / JsonPipeline)

📖 Documentation

For full documentation, visit: https://scrava.readthedocs.io

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

MIT License - see LICENSE file for details

🙏 Acknowledgments

Scrava is built on the shoulders of giants:

httpx - HTTP client
Playwright - Browser automation
parsel - Data extraction
Pydantic - Data validation
structlog - Structured logging
Typer - CLI framework

🏢 About Nextract Data Solutions

Scrava is developed and maintained by Nextract Data Solutions, a leading provider of enterprise web scraping and data extraction services.

Need enterprise-grade data extraction?

While Scrava is perfect for developers building their own scrapers, Nextract Data Solutions offers done-for-you web scraping and data pipelines for businesses that need:

✅ Custom enterprise scraping solutions
✅ Data-as-a-Service (DaaS) subscriptions
✅ Data enrichment and validation
✅ 99.9% accuracy and reliability
✅ Dedicated support and SLA guarantees

📞 Contact Nextract

Website: https://nextract.dev
Email: hello@nextract.dev
Phone: +91 85110-98799
GitHub: @nextractdevelopers

Schedule a Free Strategy Call | Download Capabilities Deck

Happy Scraping! 🕷️

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Oct 12, 2025

This version

0.1.0

Oct 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrava-0.1.0.tar.gz (35.1 kB view details)

Uploaded Oct 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrava-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Oct 12, 2025 Python 3

File details

Details for the file scrava-0.1.0.tar.gz.

File metadata

Download URL: scrava-0.1.0.tar.gz
Upload date: Oct 12, 2025
Size: 35.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for scrava-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ee927a98996b4da0046975b8b73362e53344f95540e12e943622e2eb86d4a4c8`
MD5	`a3922753dbbd486715d1e0f8e4d335c9`
BLAKE2b-256	`966dbf7245ab864258e79cb0b4d5ca79e7cbec9b3c3fa9d087d66a44351c9004`

See more details on using hashes here.

File details

Details for the file scrava-0.1.0-py3-none-any.whl.

File metadata

Download URL: scrava-0.1.0-py3-none-any.whl
Upload date: Oct 12, 2025
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for scrava-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d9fee2c79519848f0dc122424caa362fb493bbe1347b437e79657f2a876d401`
MD5	`6345bd69397ab50645cc887779d78b18`
BLAKE2b-256	`c3f7160d746d6e90e4e9f2866fb306757beb622f14e305e144338362295e524b`

See more details on using hashes here.

scrava 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Scrava

🎯 Philosophy

✨ Features

📦 Installation

Prerequisites

Platform-Specific Notes

Installation Options

Development Installation

Quick Installation Scripts

Verify Installation

Troubleshooting

🚀 Quick Start

1. Create a New Project

2. Define Your Bot

3. Run Your Bot

🏗️ Core Components

Request & Response

Bot Lifecycle

Queue System

Fetchers

Hooks

Request Hooks

Built-in Cache Hook

Pipelines

Configuration

Logging

🔧 CLI Commands

📚 Advanced Examples

Custom Callback Methods

Browser Automation

Error Handling Hook

Data Validation Pipeline

🎯 Best Practices

🔗 Architecture

📖 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

🏢 About Nextract Data Solutions

📞 Contact Nextract

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes