A powerful Python SDK for web scraping, crawling, and data extraction - inspired by Firecrawl
Project description
RapidCrawl
A powerful Python SDK for web scraping, crawling, and data extraction. RapidCrawl provides a comprehensive toolkit for extracting data from websites, handling dynamic content, and converting web pages into clean, structured formats suitable for AI and LLM applications.
๐ Features
- ๐ Scrape: Convert any URL into clean markdown, HTML, text, or structured data
- ๐ท๏ธ Crawl: Recursively crawl websites with depth control and filtering
- ๐บ๏ธ Map: Quickly discover all URLs on a website
- ๐ Search: Web search with automatic result scraping
- ๐ธ Screenshot: Capture full-page screenshots
- ๐ญ Dynamic Content: Handle JavaScript-rendered pages with Playwright
- ๐ Multiple Formats: Support for Markdown, HTML, PDF, images, and more
- ๐ Async Support: High-performance asynchronous operations
- ๐ก๏ธ Error Handling: Comprehensive error handling and retry logic
- ๐ฆ CLI Tool: Feature-rich command-line interface
๐ Table of Contents
- Installation
- Quick Start
- Features
- CLI Usage
- Configuration
- API Reference
- Examples
- Advanced Usage
- Performance
- Troubleshooting
- Development
- Contributing
- Security
- License
- Support
๐ฆ Installation
Using pip
pip install rapid-crawl
Using pip with all optional dependencies
pip install rapid-crawl[dev]
From source
git clone https://github.com/aoneahsan/rapid-crawl.git
cd rapid-crawl
pip install -e .
Install Playwright browsers (required for dynamic content)
playwright install chromium
๐ Quick Start
Python SDK
from rapidcrawl import RapidCrawlApp
# Initialize the client
app = RapidCrawlApp()
# Scrape a single page
result = app.scrape_url("https://example.com")
print(result.content["markdown"])
# Crawl a website
crawl_result = app.crawl_url(
"https://example.com",
max_pages=10,
max_depth=2
)
# Map all URLs
map_result = app.map_url("https://example.com")
print(f"Found {map_result.total_urls} URLs")
# Search and scrape
search_result = app.search(
"python web scraping",
num_results=5,
scrape_results=True
)
Command Line
# Scrape a URL
rapidcrawl scrape https://example.com
# Crawl a website
rapidcrawl crawl https://example.com --max-pages 10
# Map URLs
rapidcrawl map https://example.com --limit 100
# Search
rapidcrawl search "python tutorials" --scrape
๐ฏ Features
Scraping
Convert any web page into clean, structured data:
from rapidcrawl import RapidCrawlApp, OutputFormat
app = RapidCrawlApp()
# Basic scraping
result = app.scrape_url("https://example.com")
# Multiple formats
result = app.scrape_url(
"https://example.com",
formats=["markdown", "html", "screenshot"],
wait_for=".content", # Wait for element
timeout=60000, # 60 seconds timeout
)
# Extract structured data
result = app.scrape_url(
"https://example.com/product",
extract_schema=[
{"name": "title", "selector": "h1"},
{"name": "price", "selector": ".price", "type": "number"},
{"name": "description", "selector": ".description"}
]
)
print(result.structured_data)
# {'title': 'Product Name', 'price': 29.99, 'description': '...'}
# Mobile viewport
result = app.scrape_url(
"https://example.com",
mobile=True
)
# With actions (click, type, scroll)
result = app.scrape_url(
"https://example.com",
actions=[
{"type": "click", "selector": ".load-more"},
{"type": "wait", "value": 2000},
{"type": "scroll", "value": 1000}
]
)
Crawling
Recursively crawl websites with advanced filtering:
# Basic crawling
result = app.crawl_url(
"https://example.com",
max_pages=50,
max_depth=3
)
# With URL filtering
result = app.crawl_url(
"https://example.com",
include_patterns=[r"/blog/.*", r"/docs/.*"],
exclude_patterns=[r".*\.pdf$", r".*/tag/.*"]
)
# Async crawling for better performance
import asyncio
async def crawl_async():
result = await app.crawl_url_async(
"https://example.com",
max_pages=100,
max_depth=5,
allow_subdomains=True
)
return result
result = asyncio.run(crawl_async())
# With webhook notifications
result = app.crawl_url(
"https://example.com",
webhook_url="https://your-webhook.com/progress"
)
Mapping
Quickly discover all URLs on a website:
# Basic mapping
result = app.map_url("https://example.com")
print(f"Found {result.total_urls} URLs")
# Filter URLs by search term
result = app.map_url(
"https://example.com",
search="product",
limit=1000
)
# Include subdomains
result = app.map_url(
"https://example.com",
include_subdomains=True,
ignore_sitemap=False # Use sitemap.xml if available
)
# Access the URLs
for url in result.urls[:10]:
print(url)
Searching
Search the web and optionally scrape results:
# Basic search
result = app.search("python web scraping tutorial")
# Search with scraping
result = app.search(
"latest AI news",
num_results=10,
scrape_results=True,
formats=["markdown", "text"]
)
# Access results
for item in result.results:
print(f"{item.position}. {item.title}")
print(f" URL: {item.url}")
if item.scraped_content:
print(f" Content: {item.scraped_content.content['markdown'][:200]}...")
# Different search engines
result = app.search(
"machine learning",
engine="duckduckgo", # or "google", "bing"
num_results=20
)
# With date filtering
from datetime import datetime, timedelta
result = app.search(
"tech news",
start_date=datetime.now() - timedelta(days=7),
end_date=datetime.now()
)
๐ป CLI Usage
RapidCrawl provides a comprehensive command-line interface:
Setup Wizard
# Interactive setup
rapidcrawl setup
Scraping
# Basic scrape
rapidcrawl scrape https://example.com
# Save to file
rapidcrawl scrape https://example.com -o output.md
# Multiple formats
rapidcrawl scrape https://example.com -f markdown -f html -f screenshot
# Wait for element
rapidcrawl scrape https://example.com --wait-for ".content"
# Extract structured data
rapidcrawl scrape https://example.com \
--extract-schema '[{"name": "title", "selector": "h1"}]'
Crawling
# Basic crawl
rapidcrawl crawl https://example.com
# Advanced crawl
rapidcrawl crawl https://example.com \
--max-pages 100 \
--max-depth 3 \
--include "*/blog/*" \
--exclude "*.pdf" \
--output ./crawl-results/
Mapping
# Map all URLs
rapidcrawl map https://example.com
# Filter and save
rapidcrawl map https://example.com \
--search "product" \
--limit 1000 \
--output urls.txt
Searching
# Basic search
rapidcrawl search "python tutorials"
# Search and scrape
rapidcrawl search "machine learning" \
--scrape \
--num-results 20 \
--engine google \
--output results/
โ๏ธ Configuration
Environment Variables
Create a .env file in your project root:
# API Configuration
RAPIDCRAWL_API_KEY=your_api_key_here
RAPIDCRAWL_BASE_URL=https://api.rapidcrawl.io/v1
RAPIDCRAWL_TIMEOUT=30
# Optional
RAPIDCRAWL_MAX_RETRIES=3
Python Configuration
from rapidcrawl import RapidCrawlApp
# Custom configuration
app = RapidCrawlApp(
api_key="your_api_key",
base_url="https://custom-api.example.com",
timeout=60.0,
max_retries=5,
debug=True
)
Manual Configuration Options
If the automated setup doesn't work, you can manually configure RapidCrawl:
- API Key: Set via environment variable or pass to constructor
- Base URL: For self-hosted instances
- Timeout: Request timeout in seconds
- SSL Verification: Disable for self-signed certificates
- Debug Mode: Enable verbose logging
๐ API Reference
RapidCrawlApp
The main client class for interacting with RapidCrawl.
Constructor
RapidCrawlApp(
api_key: Optional[str] = None,
base_url: Optional[str] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
verify_ssl: bool = True,
debug: bool = False
)
Methods
scrape_url(url, **options): Scrape a single URLcrawl_url(url, **options): Crawl a websitecrawl_url_async(url, **options): Async crawlmap_url(url, **options): Map website URLssearch(query, **options): Search the webextract(urls, schema, prompt): Extract structured data
Models
ScrapeOptions
from rapidcrawl.models import ScrapeOptions, OutputFormat
options = ScrapeOptions(
url="https://example.com",
formats=[OutputFormat.MARKDOWN, OutputFormat.HTML],
wait_for=".content",
timeout=30000,
mobile=False,
actions=[...],
extract_schema=[...],
headers={"User-Agent": "Custom UA"}
)
CrawlOptions
from rapidcrawl.models import CrawlOptions
options = CrawlOptions(
url="https://example.com",
max_pages=100,
max_depth=3,
include_patterns=["*/blog/*"],
exclude_patterns=["*.pdf"],
allow_subdomains=False,
webhook_url="https://webhook.example.com"
)
๐ง Examples
For comprehensive examples, see the examples directory:
- Basic Scraping - Getting started with web scraping
- Web Crawling - Crawling websites recursively
- Search and Map - Search and URL mapping
- Data Extraction - Structured data extraction
- Advanced Usage - Production patterns
E-commerce Price Monitoring
from rapidcrawl import RapidCrawlApp
import json
app = RapidCrawlApp()
# Define extraction schema
schema = [
{"name": "title", "selector": "h1.product-title"},
{"name": "price", "selector": ".price-now", "type": "number"},
{"name": "stock", "selector": ".availability"},
{"name": "image", "selector": "img.product-image", "attribute": "src"}
]
# Monitor multiple products
products = [
"https://shop.example.com/product1",
"https://shop.example.com/product2",
]
results = []
for url in products:
result = app.scrape_url(url, extract_schema=schema)
if result.success:
results.append({
"url": url,
"data": result.structured_data,
"timestamp": result.scraped_at
})
# Save results
with open("prices.json", "w") as f:
json.dump(results, f, indent=2, default=str)
Content Aggregation
import asyncio
from rapidcrawl import RapidCrawlApp
app = RapidCrawlApp()
async def aggregate_news():
# Search multiple queries
queries = [
"artificial intelligence breakthroughs",
"quantum computing news",
"robotics innovation"
]
all_articles = []
for query in queries:
result = app.search(
query,
num_results=5,
scrape_results=True,
formats=["markdown"]
)
for item in result.results:
if item.scraped_content and item.scraped_content.success:
all_articles.append({
"title": item.title,
"url": item.url,
"content": item.scraped_content.content["markdown"],
"query": query
})
return all_articles
# Run aggregation
articles = asyncio.run(aggregate_news())
Website Change Detection
import hashlib
import time
from rapidcrawl import RapidCrawlApp
app = RapidCrawlApp()
def monitor_changes(url, interval=3600):
"""Monitor a webpage for changes."""
previous_hash = None
while True:
result = app.scrape_url(url, formats=["text"])
if result.success:
content = result.content["text"]
current_hash = hashlib.md5(content.encode()).hexdigest()
if previous_hash and current_hash != previous_hash:
print(f"Change detected at {url}!")
# Send notification, save diff, etc.
previous_hash = current_hash
time.sleep(interval)
# Monitor a page
monitor_changes("https://example.com/status", interval=300) # Check every 5 minutes
๐ Advanced Usage
Rate Limiting
import time
from rapidcrawl import RapidCrawlApp
class RateLimitedScraper:
def __init__(self, requests_per_second=2):
self.app = RapidCrawlApp()
self.min_interval = 1.0 / requests_per_second
self.last_request = 0
def scrape_url(self, url):
current = time.time()
elapsed = current - self.last_request
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request = time.time()
return self.app.scrape_url(url)
Caching Results
from functools import lru_cache
import hashlib
class CachedScraper:
def __init__(self):
self.app = RapidCrawlApp()
self.cache = {}
def scrape_with_cache(self, url, max_age_hours=24):
cache_key = hashlib.md5(url.encode()).hexdigest()
if cache_key in self.cache:
cached_time, cached_result = self.cache[cache_key]
age_hours = (time.time() - cached_time) / 3600
if age_hours < max_age_hours:
return cached_result
result = self.app.scrape_url(url)
self.cache[cache_key] = (time.time(), result)
return result
Error Handling
from rapidcrawl.exceptions import (
RateLimitError,
TimeoutError,
NetworkError
)
def robust_scrape(url, max_retries=3):
app = RapidCrawlApp()
for attempt in range(max_retries):
try:
return app.scrape_url(url)
except RateLimitError as e:
wait_time = e.retry_after or 60
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
except TimeoutError:
print(f"Timeout on attempt {attempt + 1}")
if attempt == max_retries - 1:
raise
except NetworkError as e:
print(f"Network error: {e}")
time.sleep(2 ** attempt) # Exponential backoff
Concurrent Scraping
from concurrent.futures import ThreadPoolExecutor, as_completed
def concurrent_scrape(urls, max_workers=5):
app = RapidCrawlApp()
results = {}
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_url = {
executor.submit(app.scrape_url, url): url
for url in urls
}
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
results[url] = future.result()
except Exception as e:
results[url] = {"error": str(e)}
return results
For more advanced patterns, see the Advanced Usage Guide.
โก Performance
Benchmarks
| Operation | URLs | Time | Speed |
|---|---|---|---|
| Sequential Scraping | 10 | 12.3s | 0.8 pages/sec |
| Concurrent Scraping | 10 | 3.1s | 3.2 pages/sec |
| Async Crawling | 100 | 28.5s | 3.5 pages/sec |
| URL Mapping | 1000 | 5.2s | 192 URLs/sec |
Optimization Tips
- Use Async Operations: For crawling large sites, use
crawl_url_async() - Enable Connection Pooling: Reuse HTTP connections
- Limit Concurrent Requests: Prevent overwhelming servers
- Cache Results: Avoid re-scraping unchanged content
- Use Specific Formats: Only request needed output formats
๐ Troubleshooting
Common Issues
Installation Problems
# Update pip
python -m pip install --upgrade pip
# Install in virtual environment
python -m venv venv
source venv/bin/activate
pip install rapid-crawl
Playwright Issues
# Install browser dependencies
playwright install-deps chromium
# Or use Firefox
playwright install firefox
SSL Certificate Errors
# For self-signed certificates (development only!)
app = RapidCrawlApp(verify_ssl=False)
Rate Limiting
# Handle rate limits gracefully
try:
result = app.scrape_url(url)
except RateLimitError as e:
time.sleep(e.retry_after or 60)
result = app.scrape_url(url)
For detailed troubleshooting, see the Troubleshooting Guide.
๐ ๏ธ Development
Setting up development environment
# Clone the repository
git clone https://github.com/aoneahsan/rapid-crawl.git
cd rapid-crawl
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
Running tests
# Run all tests
pytest
# Run with coverage
pytest --cov=rapidcrawl
# Run specific test file
pytest tests/test_scrape.py
Code formatting
# Format code
black src/rapidcrawl
# Run linter
ruff check src/rapidcrawl
# Type checking
mypy src/rapidcrawl
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Quick Start
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Development Guidelines
- Write tests for new features
- Follow PEP 8 style guide
- Update documentation
- Add type hints
- Run tests before submitting
๐ Security
Security is important to us. Please see our Security Policy for details on:
- Reporting vulnerabilities
- Security best practices
- API key management
- Data privacy
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ฌ Support
Documentation
- ๐ Full Documentation
- ๐ API Reference
- ๐ก Examples
- ๐ง Advanced Usage
- โ Troubleshooting
Community
- ๐ Report Issues
- ๐ฌ Discussions
- ๐ง Email Support
Resources
- ๐ Changelog
- ๐ Security Policy
- ๐ค Contributing Guide
- โ๏ธ License
๐จโ๐ป Developer
Ahsan Mahmood
- ๐ Website: https://aoneahsan.com
- ๐ง Email: aoneahsan@gmail.com
- ๐ผ LinkedIn: https://linkedin.com/in/aoneahsan
- ๐ฆ Twitter: @aoneahsan
๐ Acknowledgments
- Inspired by Firecrawl
- Built with Playwright for dynamic content handling
- Uses BeautifulSoup for HTML parsing
- Click for the CLI interface
- Rich for beautiful terminal output
๐ Statistics
RapidCrawl - Fast, reliable web scraping for Python
Made with โค๏ธ by Ahsan Mahmood
โญ Star us on GitHub โข ๐ฆ Install from PyPI โข ๐ Report a Bug
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rapid_crawl-0.1.0.tar.gz.
File metadata
- Download URL: rapid_crawl-0.1.0.tar.gz
- Upload date:
- Size: 84.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db4b603d8461df3a9f71f792c7ee74fdf328ab2b906963de9143599a84ee8cd9
|
|
| MD5 |
e7fd45838dc4f10f9fe86b229c1a4bff
|
|
| BLAKE2b-256 |
048da11e4be19dfa64d3e5073d725937208f030543c9494cc3c81c5b08ad4f98
|
File details
Details for the file rapid_crawl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rapid_crawl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59a3c7b9895fa747d1d6b41ea7509e2207f8044ca7c6788380364173fd10ae45
|
|
| MD5 |
2595f2aa075eb77465ff50786f1f6227
|
|
| BLAKE2b-256 |
4a9256633c1d5ba4274dcfb0dd791636a63e2c6877a39c301dd5a94b64070454
|