A powerful Python SDK for web scraping, crawling, and data extraction - inspired by Firecrawl

These details have not been verified by PyPI

Project links

Project description

RapidCrawl

PyPI version Python versions License Stars

A powerful Python SDK for web scraping, crawling, and data extraction. RapidCrawl provides a comprehensive toolkit for extracting data from websites, handling dynamic content, and converting web pages into clean, structured formats suitable for AI and LLM applications.

🚀 Features

🔍 Scrape: Convert any URL into clean markdown, HTML, text, or structured data
🕷️ Crawl: Recursively crawl websites with depth control and filtering
🗺️ Map: Quickly discover all URLs on a website
🔎 Search: Web search with automatic result scraping
📸 Screenshot: Capture full-page screenshots
🎭 Dynamic Content: Handle JavaScript-rendered pages with Playwright
📄 Multiple Formats: Support for Markdown, HTML, PDF, images, and more
🚄 Async Support: High-performance asynchronous operations
🛡️ Error Handling: Comprehensive error handling and retry logic
📦 CLI Tool: Feature-rich command-line interface

📋 Table of Contents

Installation
Quick Start
Features
- Scraping
- Crawling
- Mapping
- Searching
CLI Usage
Configuration
API Reference
Examples
Advanced Usage
Performance
Troubleshooting
Development
Contributing
Security
License
Support

📦 Installation

Using pip

pip install rapid-crawl

Using pip with all optional dependencies

pip install rapid-crawl[dev]

From source

git clone https://github.com/aoneahsan/rapid-crawl.git
cd rapid-crawl
pip install -e .

Install Playwright browsers (required for dynamic content)

playwright install chromium

🚀 Quick Start

Python SDK

from rapidcrawl import RapidCrawlApp

# Initialize the client
app = RapidCrawlApp()

# Scrape a single page
result = app.scrape_url("https://example.com")
print(result.content["markdown"])

# Crawl a website
crawl_result = app.crawl_url(
    "https://example.com",
    max_pages=10,
    max_depth=2
)

# Map all URLs
map_result = app.map_url("https://example.com")
print(f"Found {map_result.total_urls} URLs")

# Search and scrape
search_result = app.search(
    "python web scraping",
    num_results=5,
    scrape_results=True
)

Command Line

# Scrape a URL
rapidcrawl scrape https://example.com

# Crawl a website
rapidcrawl crawl https://example.com --max-pages 10

# Map URLs
rapidcrawl map https://example.com --limit 100

# Search
rapidcrawl search "python tutorials" --scrape

🎯 Features

Scraping

Convert any web page into clean, structured data:

from rapidcrawl import RapidCrawlApp, OutputFormat

app = RapidCrawlApp()

# Basic scraping
result = app.scrape_url("https://example.com")

# Multiple formats
result = app.scrape_url(
    "https://example.com",
    formats=["markdown", "html", "screenshot"],
    wait_for=".content",  # Wait for element
    timeout=60000,        # 60 seconds timeout
)

# Extract structured data
result = app.scrape_url(
    "https://example.com/product",
    extract_schema=[
        {"name": "title", "selector": "h1"},
        {"name": "price", "selector": ".price", "type": "number"},
        {"name": "description", "selector": ".description"}
    ]
)

print(result.structured_data)
# {'title': 'Product Name', 'price': 29.99, 'description': '...'}

# Mobile viewport
result = app.scrape_url(
    "https://example.com",
    mobile=True
)

# With actions (click, type, scroll)
result = app.scrape_url(
    "https://example.com",
    actions=[
        {"type": "click", "selector": ".load-more"},
        {"type": "wait", "value": 2000},
        {"type": "scroll", "value": 1000}
    ]
)

Crawling

Recursively crawl websites with advanced filtering:

# Basic crawling
result = app.crawl_url(
    "https://example.com",
    max_pages=50,
    max_depth=3
)

# With URL filtering
result = app.crawl_url(
    "https://example.com",
    include_patterns=[r"/blog/.*", r"/docs/.*"],
    exclude_patterns=[r".*\.pdf$", r".*/tag/.*"]
)

# Async crawling for better performance
import asyncio

async def crawl_async():
    result = await app.crawl_url_async(
        "https://example.com",
        max_pages=100,
        max_depth=5,
        allow_subdomains=True
    )
    return result

result = asyncio.run(crawl_async())

# With webhook notifications
result = app.crawl_url(
    "https://example.com",
    webhook_url="https://your-webhook.com/progress"
)

Mapping

Quickly discover all URLs on a website:

# Basic mapping
result = app.map_url("https://example.com")
print(f"Found {result.total_urls} URLs")

# Filter URLs by search term
result = app.map_url(
    "https://example.com",
    search="product",
    limit=1000
)

# Include subdomains
result = app.map_url(
    "https://example.com",
    include_subdomains=True,
    ignore_sitemap=False  # Use sitemap.xml if available
)

# Access the URLs
for url in result.urls[:10]:
    print(url)

Searching

Search the web and optionally scrape results:

# Basic search
result = app.search("python web scraping tutorial")

# Search with scraping
result = app.search(
    "latest AI news",
    num_results=10,
    scrape_results=True,
    formats=["markdown", "text"]
)

# Access results
for item in result.results:
    print(f"{item.position}. {item.title}")
    print(f"   URL: {item.url}")
    if item.scraped_content:
        print(f"   Content: {item.scraped_content.content['markdown'][:200]}...")

# Different search engines
result = app.search(
    "machine learning",
    engine="duckduckgo",  # or "google", "bing"
    num_results=20
)

# With date filtering
from datetime import datetime, timedelta

result = app.search(
    "tech news",
    start_date=datetime.now() - timedelta(days=7),
    end_date=datetime.now()
)

💻 CLI Usage

RapidCrawl provides a comprehensive command-line interface:

Setup Wizard

# Interactive setup
rapidcrawl setup

Scraping

# Basic scrape
rapidcrawl scrape https://example.com

# Save to file
rapidcrawl scrape https://example.com -o output.md

# Multiple formats
rapidcrawl scrape https://example.com -f markdown -f html -f screenshot

# Wait for element
rapidcrawl scrape https://example.com --wait-for ".content"

# Extract structured data
rapidcrawl scrape https://example.com \
  --extract-schema '[{"name": "title", "selector": "h1"}]'

Crawling

# Basic crawl
rapidcrawl crawl https://example.com

# Advanced crawl
rapidcrawl crawl https://example.com \
  --max-pages 100 \
  --max-depth 3 \
  --include "*/blog/*" \
  --exclude "*.pdf" \
  --output ./crawl-results/

Mapping

# Map all URLs
rapidcrawl map https://example.com

# Filter and save
rapidcrawl map https://example.com \
  --search "product" \
  --limit 1000 \
  --output urls.txt

Searching

# Basic search
rapidcrawl search "python tutorials"

# Search and scrape
rapidcrawl search "machine learning" \
  --scrape \
  --num-results 20 \
  --engine google \
  --output results/

⚙️ Configuration

Environment Variables

Create a .env file in your project root:

# API Configuration
RAPIDCRAWL_API_KEY=your_api_key_here
RAPIDCRAWL_BASE_URL=https://api.rapidcrawl.io/v1
RAPIDCRAWL_TIMEOUT=30

# Optional
RAPIDCRAWL_MAX_RETRIES=3

Python Configuration

from rapidcrawl import RapidCrawlApp

# Custom configuration
app = RapidCrawlApp(
    api_key="your_api_key",
    base_url="https://custom-api.example.com",
    timeout=60.0,
    max_retries=5,
    debug=True
)

Manual Configuration Options

If the automated setup doesn't work, you can manually configure RapidCrawl:

API Key: Set via environment variable or pass to constructor
Base URL: For self-hosted instances
Timeout: Request timeout in seconds
SSL Verification: Disable for self-signed certificates
Debug Mode: Enable verbose logging

📚 API Reference

RapidCrawlApp

The main client class for interacting with RapidCrawl.

Constructor

RapidCrawlApp(
    api_key: Optional[str] = None,
    base_url: Optional[str] = None,
    timeout: Optional[float] = None,
    max_retries: Optional[int] = None,
    verify_ssl: bool = True,
    debug: bool = False
)

Methods

scrape_url(url, **options): Scrape a single URL
crawl_url(url, **options): Crawl a website
crawl_url_async(url, **options): Async crawl
map_url(url, **options): Map website URLs
search(query, **options): Search the web
extract(urls, schema, prompt): Extract structured data

Models

ScrapeOptions

from rapidcrawl.models import ScrapeOptions, OutputFormat

options = ScrapeOptions(
    url="https://example.com",
    formats=[OutputFormat.MARKDOWN, OutputFormat.HTML],
    wait_for=".content",
    timeout=30000,
    mobile=False,
    actions=[...],
    extract_schema=[...],
    headers={"User-Agent": "Custom UA"}
)

CrawlOptions

from rapidcrawl.models import CrawlOptions

options = CrawlOptions(
    url="https://example.com",
    max_pages=100,
    max_depth=3,
    include_patterns=["*/blog/*"],
    exclude_patterns=["*.pdf"],
    allow_subdomains=False,
    webhook_url="https://webhook.example.com"
)

🔧 Examples

For comprehensive examples, see the examples directory:

Basic Scraping - Getting started with web scraping
Web Crawling - Crawling websites recursively
Search and Map - Search and URL mapping
Data Extraction - Structured data extraction
Advanced Usage - Production patterns

E-commerce Price Monitoring

from rapidcrawl import RapidCrawlApp
import json

app = RapidCrawlApp()

# Define extraction schema
schema = [
    {"name": "title", "selector": "h1.product-title"},
    {"name": "price", "selector": ".price-now", "type": "number"},
    {"name": "stock", "selector": ".availability"},
    {"name": "image", "selector": "img.product-image", "attribute": "src"}
]

# Monitor multiple products
products = [
    "https://shop.example.com/product1",
    "https://shop.example.com/product2",
]

results = []
for url in products:
    result = app.scrape_url(url, extract_schema=schema)
    if result.success:
        results.append({
            "url": url,
            "data": result.structured_data,
            "timestamp": result.scraped_at
        })

# Save results
with open("prices.json", "w") as f:
    json.dump(results, f, indent=2, default=str)

Content Aggregation

import asyncio
from rapidcrawl import RapidCrawlApp

app = RapidCrawlApp()

async def aggregate_news():
    # Search multiple queries
    queries = [
        "artificial intelligence breakthroughs",
        "quantum computing news",
        "robotics innovation"
    ]
    
    all_articles = []
    
    for query in queries:
        result = app.search(
            query,
            num_results=5,
            scrape_results=True,
            formats=["markdown"]
        )
        
        for item in result.results:
            if item.scraped_content and item.scraped_content.success:
                all_articles.append({
                    "title": item.title,
                    "url": item.url,
                    "content": item.scraped_content.content["markdown"],
                    "query": query
                })
    
    return all_articles

# Run aggregation
articles = asyncio.run(aggregate_news())

Website Change Detection

import hashlib
import time
from rapidcrawl import RapidCrawlApp

app = RapidCrawlApp()

def monitor_changes(url, interval=3600):
    """Monitor a webpage for changes."""
    previous_hash = None
    
    while True:
        result = app.scrape_url(url, formats=["text"])
        
        if result.success:
            content = result.content["text"]
            current_hash = hashlib.md5(content.encode()).hexdigest()
            
            if previous_hash and current_hash != previous_hash:
                print(f"Change detected at {url}!")
                # Send notification, save diff, etc.
            
            previous_hash = current_hash
        
        time.sleep(interval)

# Monitor a page
monitor_changes("https://example.com/status", interval=300)  # Check every 5 minutes

🚀 Advanced Usage

Rate Limiting

import time
from rapidcrawl import RapidCrawlApp

class RateLimitedScraper:
    def __init__(self, requests_per_second=2):
        self.app = RapidCrawlApp()
        self.min_interval = 1.0 / requests_per_second
        self.last_request = 0
    
    def scrape_url(self, url):
        current = time.time()
        elapsed = current - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        
        self.last_request = time.time()
        return self.app.scrape_url(url)

Caching Results

from functools import lru_cache
import hashlib

class CachedScraper:
    def __init__(self):
        self.app = RapidCrawlApp()
        self.cache = {}
    
    def scrape_with_cache(self, url, max_age_hours=24):
        cache_key = hashlib.md5(url.encode()).hexdigest()
        
        if cache_key in self.cache:
            cached_time, cached_result = self.cache[cache_key]
            age_hours = (time.time() - cached_time) / 3600
            if age_hours < max_age_hours:
                return cached_result
        
        result = self.app.scrape_url(url)
        self.cache[cache_key] = (time.time(), result)
        return result

Error Handling

from rapidcrawl.exceptions import (
    RateLimitError,
    TimeoutError,
    NetworkError
)

def robust_scrape(url, max_retries=3):
    app = RapidCrawlApp()
    
    for attempt in range(max_retries):
        try:
            return app.scrape_url(url)
        except RateLimitError as e:
            wait_time = e.retry_after or 60
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
        except TimeoutError:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise
        except NetworkError as e:
            print(f"Network error: {e}")
            time.sleep(2 ** attempt)  # Exponential backoff

Concurrent Scraping

from concurrent.futures import ThreadPoolExecutor, as_completed

def concurrent_scrape(urls, max_workers=5):
    app = RapidCrawlApp()
    results = {}
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_url = {
            executor.submit(app.scrape_url, url): url 
            for url in urls
        }
        
        for future in as_completed(future_to_url):
            url = future_to_url[future]
            try:
                results[url] = future.result()
            except Exception as e:
                results[url] = {"error": str(e)}
    
    return results

For more advanced patterns, see the Advanced Usage Guide.

⚡ Performance

Benchmarks

Operation	URLs	Time	Speed
Sequential Scraping	10	12.3s	0.8 pages/sec
Concurrent Scraping	10	3.1s	3.2 pages/sec
Async Crawling	100	28.5s	3.5 pages/sec
URL Mapping	1000	5.2s	192 URLs/sec

Optimization Tips

Use Async Operations: For crawling large sites, use crawl_url_async()
Enable Connection Pooling: Reuse HTTP connections
Limit Concurrent Requests: Prevent overwhelming servers
Cache Results: Avoid re-scraping unchanged content
Use Specific Formats: Only request needed output formats

🔍 Troubleshooting

Common Issues

Installation Problems

# Update pip
python -m pip install --upgrade pip

# Install in virtual environment
python -m venv venv
source venv/bin/activate
pip install rapid-crawl

Playwright Issues

# Install browser dependencies
playwright install-deps chromium

# Or use Firefox
playwright install firefox

SSL Certificate Errors

# For self-signed certificates (development only!)
app = RapidCrawlApp(verify_ssl=False)

Rate Limiting

# Handle rate limits gracefully
try:
    result = app.scrape_url(url)
except RateLimitError as e:
    time.sleep(e.retry_after or 60)
    result = app.scrape_url(url)

For detailed troubleshooting, see the Troubleshooting Guide.

🛠️ Development

Setting up development environment

# Clone the repository
git clone https://github.com/aoneahsan/rapid-crawl.git
cd rapid-crawl

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Running tests

# Run all tests
pytest

# Run with coverage
pytest --cov=rapidcrawl

# Run specific test file
pytest tests/test_scrape.py

Code formatting

# Format code
black src/rapidcrawl

# Run linter
ruff check src/rapidcrawl

# Type checking
mypy src/rapidcrawl

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Start

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Development Guidelines

Write tests for new features
Follow PEP 8 style guide
Update documentation
Add type hints
Run tests before submitting

🔒 Security

Security is important to us. Please see our Security Policy for details on:

Reporting vulnerabilities
Security best practices
API key management
Data privacy

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

💬 Support

Documentation

Community

Resources

👨‍💻 Developer

Ahsan Mahmood

🌐 Website: https://aoneahsan.com
📧 Email: aoneahsan@gmail.com
💼 LinkedIn: https://linkedin.com/in/aoneahsan
🐦 Twitter: @aoneahsan

🙏 Acknowledgments

Inspired by Firecrawl
Built with Playwright for dynamic content handling
Uses BeautifulSoup for HTML parsing
Click for the CLI interface
Rich for beautiful terminal output

📊 Statistics

GitHub Stars GitHub Forks PyPI Downloads GitHub Issues GitHub Pull Requests

RapidCrawl - Fast, reliable web scraping for Python
Made with ❤️ by Ahsan Mahmood

⭐ Star us on GitHub • 📦 Install from PyPI • 🐛 Report a Bug

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rapid_crawl-0.1.0.tar.gz (84.7 kB view details)

Uploaded Jul 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rapid_crawl-0.1.0-py3-none-any.whl (39.0 kB view details)

Uploaded Jul 11, 2025 Python 3

File details

Details for the file rapid_crawl-0.1.0.tar.gz.

File metadata

Download URL: rapid_crawl-0.1.0.tar.gz
Upload date: Jul 11, 2025
Size: 84.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for rapid_crawl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`db4b603d8461df3a9f71f792c7ee74fdf328ab2b906963de9143599a84ee8cd9`
MD5	`e7fd45838dc4f10f9fe86b229c1a4bff`
BLAKE2b-256	`048da11e4be19dfa64d3e5073d725937208f030543c9494cc3c81c5b08ad4f98`

See more details on using hashes here.

File details

Details for the file rapid_crawl-0.1.0-py3-none-any.whl.

File metadata

Download URL: rapid_crawl-0.1.0-py3-none-any.whl
Upload date: Jul 11, 2025
Size: 39.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for rapid_crawl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`59a3c7b9895fa747d1d6b41ea7509e2207f8044ca7c6788380364173fd10ae45`
MD5	`2595f2aa075eb77465ff50786f1f6227`
BLAKE2b-256	`4a9256633c1d5ba4274dcfb0dd791636a63e2c6877a39c301dd5a94b64070454`

See more details on using hashes here.

rapid-crawl 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

RapidCrawl

🚀 Features

📋 Table of Contents

📦 Installation

Using pip

Using pip with all optional dependencies

From source

Install Playwright browsers (required for dynamic content)

🚀 Quick Start

Python SDK

Command Line

🎯 Features

Scraping

Crawling

Mapping

Searching

💻 CLI Usage

Setup Wizard

Scraping

Crawling

Mapping

Searching

⚙️ Configuration

Environment Variables

Python Configuration

Manual Configuration Options

📚 API Reference

RapidCrawlApp

Constructor

Methods

Models

ScrapeOptions

CrawlOptions

🔧 Examples

E-commerce Price Monitoring

Content Aggregation

Website Change Detection

🚀 Advanced Usage

Rate Limiting

Caching Results

Error Handling

Concurrent Scraping

⚡ Performance

Benchmarks

Optimization Tips

🔍 Troubleshooting

Common Issues

Installation Problems

Playwright Issues

SSL Certificate Errors

Rate Limiting

🛠️ Development

Setting up development environment

Running tests

Code formatting

🤝 Contributing

Quick Start

Development Guidelines

🔒 Security

📄 License

💬 Support

Documentation

Community

Resources

👨‍💻 Developer

🙏 Acknowledgments

📊 Statistics

Project details

Verified details

Maintainers