The Definitive Web Scraping Framework for Python
Project description
Scrava
Scrava is a powerful, composable web scraping framework for Python that provides a unified API for building scalable web scrapers by orchestrating the best tools in the Python ecosystem.
๐ข Built by Nextract Data Solutions - Your partner for enterprise web scraping and data extraction.
๐ฏ Philosophy
Scrava doesn't reinvent the wheel. Instead, it provides a composition-over-invention approach:
- Unifying Force: Eliminates boilerplate and integration complexity
- Battle-Tested Libraries: Built on httpx, Playwright, parsel, and more
- Developer Experience: Designed to be intuitive and "piece of cake" for newcomers
- Production-Ready: Structured logging, statistics, error handling, and more
โจ Features
- ๐ Async-First: Built on asyncio for maximum performance
- ๐ Dual-Mode Fetching: HTTP (httpx) and Browser (Playwright) support
- ๐ฆ Flexible Queuing: In-memory or Redis-backed with duplicate filtering
- ๐ช Powerful Hooks: Intercept and modify requests, responses, and data flow
- ๐พ Pipeline System: MongoDB, JSON, or custom data storage
- ๐ฏ Pydantic Integration: Type-safe data models with validation
- ๐ Structured Logging: Production-grade logging with structlog
- โ๏ธ Config Management: YAML + Pydantic for type-safe configuration
- ๐ ๏ธ CLI Tools: Project scaffolding, bot runner, and interactive shell
๐ฆ Installation
Prerequisites
- Python 3.8 or higher
- pip (latest version recommended)
Platform-Specific Notes
macOS (Apple Silicon - M1/M2/M3/M4):
# Use native ARM64 Python for best performance
arch -arm64 pip install scrava
macOS (Intel):
pip install scrava
Windows:
pip install scrava
Linux:
pip install scrava
Installation Options
# Basic installation (works on all platforms)
pip install scrava
# With browser support (Playwright)
pip install scrava[browser]
# With Redis queue support
pip install scrava[redis]
# With MongoDB pipeline support
pip install scrava[mongodb]
# Install everything
pip install scrava[all]
Development Installation
# Clone and install in editable mode
git clone https://github.com/yourusername/scrava.git
cd scrava
pip install -e .
# With all optional dependencies
pip install -e ".[all]"
Quick Installation Scripts
For easier installation, use our platform-specific scripts:
macOS/Linux:
# Auto-detects architecture and installs correctly
curl -sSL https://raw.githubusercontent.com/yourusername/scrava/main/install.sh | bash
# Or download and run manually
chmod +x install.sh
./install.sh
Windows (PowerShell):
# Download and run the installation script
iwr -useb https://raw.githubusercontent.com/yourusername/scrava/main/install.ps1 | iex
# Or download and run manually
.\install.ps1
Verify Installation
# Check if Scrava is properly installed
scrava version
# Run the welcome screen
scrava
Troubleshooting
If you encounter installation issues, see PLATFORM.md for detailed platform-specific instructions.
๐ Quick Start
1. Create a New Project
scrava new my_project
cd my_project
2. Define Your Bot
# bots/book_bot.py
from pydantic import BaseModel, HttpUrl
from scrava import BaseBot, Request, Response
class Book(BaseModel):
"""A scraped book record."""
title: str
price: float
url: HttpUrl
in_stock: bool = True
class BookBot(BaseBot):
"""Bot for scraping books.toscrape.com"""
start_urls = ['https://books.toscrape.com']
async def process(self, response: Response):
"""Extract book data from the page."""
# Extract books using parsel selectors
for book in response.selector.css('article.product_pod'):
title = book.css('h3 a::attr(title)').get()
price_text = book.css('.price_color::text').get()
price = float(price_text.replace('ยฃ', ''))
url = response.urljoin(book.css('h3 a::attr(href)').get())
yield Book(
title=title,
price=price,
url=url
)
# Follow pagination
next_page = response.selector.css('.next a::attr(href)').get()
if next_page:
yield Request(response.urljoin(next_page))
3. Run Your Bot
scrava run book_bot
๐๏ธ Core Components
Request & Response
from scrava import Request, Response
# Create a request
request = Request(
url='https://example.com',
method='GET',
headers={'User-Agent': 'MyBot/1.0'},
priority=10, # Higher priority = processed first
meta={'browser': True} # Use browser rendering
)
# Response provides powerful selectors
async def process(self, response: Response):
# CSS selectors
title = response.selector.css('h1::text').get()
# XPath selectors
links = response.selector.xpath('//a/@href').getall()
# Join relative URLs
absolute_url = response.urljoin('/path')
Bot Lifecycle
from scrava import BaseBot, Response
class MyBot(BaseBot):
start_urls = ['https://example.com']
async def setup(self):
"""Called before crawling starts."""
self.session_data = {}
async def process(self, response: Response):
"""Main processing method."""
yield Record(...)
yield Request(...)
async def teardown(self):
"""Called after crawling completes."""
pass
Queue System
from scrava import Crawler
from scrava.queue import MemoryQueue, RedisQueue
# In-memory queue (default)
crawler = Crawler(queue=MemoryQueue())
# Redis-backed queue for distributed crawls
crawler = Crawler(queue=RedisQueue(redis_url="redis://localhost:6379/0"))
Fetchers
# HTTP fetcher (default)
from scrava.fetchers import HttpxFetcher
crawler = Crawler(
fetcher=HttpxFetcher(
timeout=30.0,
follow_redirects=True,
verify_ssl=True
)
)
# Browser fetcher for JavaScript-heavy sites
from scrava.fetchers import PlaywrightFetcher
crawler = Crawler(
browser_fetcher=PlaywrightFetcher(
headless=True,
browser_type='chromium',
context_pool_size=5
),
enable_browser=True
)
# Use browser for specific requests
yield Request(url, meta={'browser': True})
Hooks
Request Hooks
from scrava.hooks import RequestHook
class UserAgentHook(RequestHook):
async def process_req(self, request, bot):
# Modify request before fetching
request.headers['User-Agent'] = 'MyBot/1.0'
return None
async def process_res(self, request, response, bot):
# Process response after fetching
print(f"Got {response.status} from {response.url}")
return None
crawler = Crawler(request_hooks=[UserAgentHook()])
Built-in Cache Hook
from scrava.hooks import CacheHook
# Enable caching
crawler = Crawler(
request_hooks=[
CacheHook(expiration=86400) # Cache for 1 day
]
)
# Disable caching for specific requests
yield Request(url, meta={'cache': False})
Pipelines
from scrava.pipelines import JsonPipeline, MongoPipeline
# JSON output
crawler = Crawler(
pipelines=[JsonPipeline(output_file='output.jsonl')]
)
# MongoDB with batching
crawler = Crawler(
pipelines=[
MongoPipeline(
uri='mongodb://localhost:27017',
database='scrava',
batch_size=100,
batch_timeout=5.0
)
]
)
# Custom pipeline
from scrava.pipelines import BasePipeline
class CustomPipeline(BasePipeline):
async def process_rec(self, record, bot):
# Process and store record
await self.save_to_db(record)
return record
Configuration
# config/settings.yaml
project_name: "my_project"
scrava:
concurrent_reqs: 16
download_delay: 0.0
enable_browser: false
cache:
enabled: true
path: ".scrava_cache"
expiration_secs: 86400
queue:
backend: "scrava.queue.memory.MemoryQueue"
redis_url: "redis://localhost:6379/0"
pipeline:
enabled:
- scrava.pipelines.json.JsonPipeline
mongodb_uri: "mongodb://localhost:27017"
mongodb_database: "scrava"
logging:
level: "INFO"
format: "console" # or "json" for production
use_colors: true
from scrava.config import load_settings
settings = load_settings('config/settings.yaml')
Logging
from scrava.logging import setup_logging, get_logger
# Setup logging
setup_logging(
level="INFO",
format="console", # "json" for production
use_colors=True
)
# Get logger
logger = get_logger(__name__)
logger.info("Bot started", bot_name="my_bot", url="https://example.com")
# Output: 2024-10-27 10:30:05 [info] Bot started bot_name=my_bot url=https://example.com
๐ง CLI Commands
# Create a new project
scrava new <project_name>
# Run a bot
scrava run <bot_name>
# List all bots
scrava list
# Interactive selector shell
scrava shell <url>
scrava shell <url> --browser # Use browser rendering
# Show version
scrava version
๐ Advanced Examples
Custom Callback Methods
class ProductBot(BaseBot):
start_urls = ['https://shop.example.com']
async def process(self, response: Response):
# Extract category links
for category in response.selector.css('.category'):
url = response.urljoin(category.css('a::attr(href)').get())
yield Request(url, callback=self.parse_category)
async def parse_category(self, response: Response):
# Extract products
for product in response.selector.css('.product'):
yield Request(
response.urljoin(product.css('a::attr(href)').get()),
callback=self.parse_product
)
async def parse_product(self, response: Response):
yield Product(
name=response.selector.css('h1::text').get(),
price=float(response.selector.css('.price::text').get())
)
Browser Automation
async def process(self, response: Response):
# Scroll page, click buttons, etc. with JavaScript
yield Request(
url='https://spa-site.com',
meta={
'browser': True,
'wait_for': '.dynamic-content',
'scroll': True
}
)
Error Handling Hook
class RetryHook(RequestHook):
async def process_exc(self, request, exception, bot):
if request.meta.get('retry_count', 0) < 3:
# Retry with incremented counter
request.meta['retry_count'] = request.meta.get('retry_count', 0) + 1
await bot.queue.push(request)
return None
Data Validation Pipeline
class ValidationPipeline(BasePipeline):
async def process_rec(self, record, bot):
# Pydantic automatically validates
if record.price < 0:
logger.warning("Invalid price", record=record)
return None # Filter out
return record
๐ฏ Best Practices
- Use Pydantic Models: Define clear schemas for your scraped data
- Leverage Hooks: Keep bot logic clean by using hooks for cross-cutting concerns
- Configure Delays: Be respectful with
download_delayto avoid overwhelming servers - Enable Caching: Speed up development with the built-in CacheHook
- Structure Logs: Use structured logging for easy debugging and monitoring
- Handle Errors: Implement retry logic and error hooks for robust crawls
- Test Selectors: Use
scrava shell <url>to test CSS/XPath selectors interactively
๐ Architecture
โโโโโโโโโโโโโโโ
โ Bot โ โ Your scraping logic
โโโโโโโโฌโโโโโโโ
โ
โ
โโโโโโโโโโโโโโโ
โ Core โ โ Orchestrator (asyncio event loop)
โโโโโโโโฌโโโโโโโ
โ
โโ Queue (MemoryQueue / RedisQueue)
โโ Fetcher (HttpxFetcher / PlaywrightFetcher)
โโ Hooks (RequestHook / BotHook)
โโ Pipelines (MongoPipeline / JsonPipeline)
๐ Documentation
For full documentation, visit: https://scrava.readthedocs.io
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ License
MIT License - see LICENSE file for details
๐ Acknowledgments
Scrava is built on the shoulders of giants:
- httpx - HTTP client
- Playwright - Browser automation
- parsel - Data extraction
- Pydantic - Data validation
- structlog - Structured logging
- Typer - CLI framework
๐ข About Nextract Data Solutions
Scrava is developed and maintained by Nextract Data Solutions, a leading provider of enterprise web scraping and data extraction services.
Need enterprise-grade data extraction?
While Scrava is perfect for developers building their own scrapers, Nextract Data Solutions offers done-for-you web scraping and data pipelines for businesses that need:
- โ Custom enterprise scraping solutions
- โ Data-as-a-Service (DaaS) subscriptions
- โ Data enrichment and validation
- โ 99.9% accuracy and reliability
- โ Dedicated support and SLA guarantees
๐ Contact Nextract
- Website: https://nextract.dev
- Email: hello@nextract.dev
- Phone: +91 85110-98799
- GitHub: @nextractdevelopers
Schedule a Free Strategy Call | Download Capabilities Deck
Happy Scraping! ๐ท๏ธ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrava-0.1.0.tar.gz.
File metadata
- Download URL: scrava-0.1.0.tar.gz
- Upload date:
- Size: 35.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee927a98996b4da0046975b8b73362e53344f95540e12e943622e2eb86d4a4c8
|
|
| MD5 |
a3922753dbbd486715d1e0f8e4d335c9
|
|
| BLAKE2b-256 |
966dbf7245ab864258e79cb0b4d5ca79e7cbec9b3c3fa9d087d66a44351c9004
|
File details
Details for the file scrava-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrava-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d9fee2c79519848f0dc122424caa362fb493bbe1347b437e79657f2a876d401
|
|
| MD5 |
6345bd69397ab50645cc887779d78b18
|
|
| BLAKE2b-256 |
c3f7160d746d6e90e4e9f2866fb306757beb622f14e305e144338362295e524b
|