Skip to main content

A powerful, local-first Python SDK for web scraping, crawling, and data extraction

Project description

RapidCrawl

PyPI version Python versions License Stars

A powerful, local-first Python SDK for web scraping, crawling, and data extraction. RapidCrawl provides a comprehensive toolkit for extracting data from websites, handling dynamic content, and converting web pages into clean, structured formats suitable for AI and LLM applications.

Local-first: RapidCrawl runs entirely on your machine. There is no required API key and no hosted service โ€” pip install and start scraping. (api_key/base_url exist only as optional hooks for a possible future hosted backend.)

๐Ÿš€ Features

  • ๐Ÿ” Scrape: Convert any URL into clean markdown, HTML, text, or structured data
  • ๐Ÿ•ท๏ธ Crawl: Recursively crawl websites with depth control, filtering, and concurrency
  • ๐Ÿ—บ๏ธ Map: Quickly discover all URLs on a website (with sitemap-index support)
  • ๐Ÿ”Ž Search: Web search (Google/Bing/DuckDuckGo) with automatic result scraping
  • ๐Ÿ“ธ Screenshot: Capture full-page screenshots
  • ๐ŸŽญ Dynamic Content: Handle JavaScript-rendered pages with Playwright
  • ๐Ÿ“„ Multiple Formats: Markdown, HTML, text, links, screenshots, PDF, and images
  • ๐Ÿš„ Async: Genuine concurrency for scrape, crawl, map, search, and extract
  • ๐ŸŒ Proxies: Per-client or per-request proxies for both HTTP and the browser
  • ๐Ÿ—ƒ๏ธ Caching: Optional TTL cache (in-memory + on-disk) to skip repeat fetches
  • ๐Ÿ’พ Export: Save any result to JSON, CSV, or Markdown
  • ๐Ÿค– LLM extraction (optional): Natural-language extraction via OpenAI/Anthropic
  • ๐Ÿ›ก๏ธ Error Handling: Typed exceptions and automatic retries
  • ๐Ÿ“ฆ CLI Tool: Feature-rich command-line interface

๐Ÿ“‹ Table of Contents

๐Ÿ“ฆ Installation

Using pip

pip install rapid-crawl

With optional LLM extraction (OpenAI / Anthropic)

pip install "rapid-crawl[llm]"

With development dependencies

pip install "rapid-crawl[dev]"

From source

git clone https://github.com/aoneahsan/rapid-crawl.git
cd rapid-crawl
pip install -e .

Install Playwright browsers (required for dynamic content)

playwright install chromium

๐Ÿš€ Quick Start

Python SDK

from rapidcrawl import RapidCrawlApp

# Initialize the client
app = RapidCrawlApp()

# Scrape a single page
result = app.scrape_url("https://example.com")
print(result.content["markdown"])

# Crawl a website
crawl_result = app.crawl_url(
    "https://example.com",
    max_pages=10,
    max_depth=2
)

# Map all URLs
map_result = app.map_url("https://example.com")
print(f"Found {map_result.total_urls} URLs")

# Search and scrape
search_result = app.search(
    "python web scraping",
    num_results=5,
    scrape_results=True
)

Command Line

# Scrape a URL
rapidcrawl scrape https://example.com

# Crawl a website
rapidcrawl crawl https://example.com --max-pages 10

# Map URLs
rapidcrawl map https://example.com --limit 100

# Search
rapidcrawl search "python tutorials" --scrape

๐ŸŽฏ Features

Scraping

Convert any web page into clean, structured data:

from rapidcrawl import RapidCrawlApp, OutputFormat

app = RapidCrawlApp()

# Basic scraping
result = app.scrape_url("https://example.com")

# Multiple formats
result = app.scrape_url(
    "https://example.com",
    formats=["markdown", "html", "screenshot"],
    wait_for=".content",  # Wait for element
    timeout=60000,        # 60 seconds timeout
)

# Extract structured data
result = app.scrape_url(
    "https://example.com/product",
    extract_schema=[
        {"name": "title", "selector": "h1"},
        {"name": "price", "selector": ".price", "type": "number"},
        {"name": "description", "selector": ".description"}
    ]
)

print(result.structured_data)
# {'title': 'Product Name', 'price': 29.99, 'description': '...'}

# Mobile viewport
result = app.scrape_url(
    "https://example.com",
    mobile=True
)

# With actions (click, type, scroll)
result = app.scrape_url(
    "https://example.com",
    actions=[
        {"type": "click", "selector": ".load-more"},
        {"type": "wait", "value": 2000},
        {"type": "scroll", "value": 1000}
    ]
)

Crawling

Recursively crawl websites with advanced filtering:

# Basic crawling
result = app.crawl_url(
    "https://example.com",
    max_pages=50,
    max_depth=3
)

# With URL filtering
result = app.crawl_url(
    "https://example.com",
    include_patterns=[r"/blog/.*", r"/docs/.*"],
    exclude_patterns=[r".*\.pdf$", r".*/tag/.*"]
)

# Async crawling for better performance
import asyncio

async def crawl_async():
    result = await app.crawl_url_async(
        "https://example.com",
        max_pages=100,
        max_depth=5,
        allow_subdomains=True
    )
    return result

result = asyncio.run(crawl_async())

# With webhook notifications
result = app.crawl_url(
    "https://example.com",
    webhook_url="https://your-webhook.com/progress"
)

Mapping

Quickly discover all URLs on a website:

# Basic mapping
result = app.map_url("https://example.com")
print(f"Found {result.total_urls} URLs")

# Filter URLs by search term
result = app.map_url(
    "https://example.com",
    search="product",
    limit=1000
)

# Include subdomains
result = app.map_url(
    "https://example.com",
    include_subdomains=True,
    ignore_sitemap=False  # Use sitemap.xml if available
)

# Access the URLs
for url in result.urls[:10]:
    print(url)

Searching

Search the web and optionally scrape results:

# Basic search
result = app.search("python web scraping tutorial")

# Search with scraping
result = app.search(
    "latest AI news",
    num_results=10,
    scrape_results=True,
    formats=["markdown", "text"]
)

# Access results
for item in result.results:
    print(f"{item.position}. {item.title}")
    print(f"   URL: {item.url}")
    if item.scraped_content:
        print(f"   Content: {item.scraped_content.content['markdown'][:200]}...")

# Different search engines
result = app.search(
    "machine learning",
    engine="duckduckgo",  # or "google", "bing"
    num_results=20
)

# With date filtering
from datetime import datetime, timedelta

result = app.search(
    "tech news",
    start_date=datetime.now() - timedelta(days=7),
    end_date=datetime.now()
)

โœจ New in 0.2.0

Proxies

# Global proxy for all requests (HTTP + browser)
app = RapidCrawlApp(proxy="http://user:pass@host:8080")

# Or per request
result = app.scrape_url("https://example.com", proxy="http://host:3128")

Caching

# Enable an in-memory + on-disk TTL cache
app = RapidCrawlApp(cache=True, cache_ttl=3600, cache_dir=".rapidcrawl_cache")

result = app.scrape_url("https://example.com")   # fetched
again  = app.scrape_url("https://example.com")   # served from cache
print(again.from_cache)  # True

# Bypass the cache for a single request
fresh = app.scrape_url("https://example.com", use_cache=False)

Export results

result = app.crawl_url("https://example.com", max_pages=20)

app.export(result, "crawl.json")   # format inferred from extension
app.export(result, "crawl.csv")    # one row per page
app.export(result, "crawl.md")     # human-readable Markdown

Async

import asyncio
from rapidcrawl import RapidCrawlApp

async def main():
    async with RapidCrawlApp() as app:
        result = await app.scrape_url_async("https://example.com")
        crawl  = await app.crawl_url_async("https://example.com", max_pages=50, max_concurrency=10)
        results = await app.extract_async(
            ["https://a.com", "https://b.com"],
            schema=[{"name": "title", "selector": "h1"}],
        )

asyncio.run(main())

LLM-based extraction (optional)

Schema-based (CSS) extraction is keyless and always available. For natural-language extraction, install the extra and provide a key:

pip install "rapid-crawl[llm]"
app = RapidCrawlApp(llm_provider="openai", llm_api_key="sk-...")  # or set OPENAI_API_KEY

result = app.scrape_url(
    "https://example.com/product",
    extract_prompt="Extract the product name, price, and whether it is in stock.",
)
print(result.structured_data)

๐Ÿ’ป CLI Usage

RapidCrawl provides a comprehensive command-line interface:

Global options

Options that apply to every command (placed before the subcommand):

rapidcrawl --proxy http://host:8080 --cache --user-agent "MyBot/1.0" scrape https://example.com
rapidcrawl --no-verify-ssl --debug map https://example.com

Setup Wizard

# Interactive setup (optional โ€” writes a .env file)
rapidcrawl setup

Scraping

# Basic scrape
rapidcrawl scrape https://example.com

# Save to file
rapidcrawl scrape https://example.com -o output.md

# Multiple formats
rapidcrawl scrape https://example.com -f markdown -f html -f screenshot

# Wait for element
rapidcrawl scrape https://example.com --wait-for ".content"

# Extract structured data
rapidcrawl scrape https://example.com \
  --extract-schema '[{"name": "title", "selector": "h1"}]'

Crawling

# Basic crawl
rapidcrawl crawl https://example.com

# Advanced crawl
rapidcrawl crawl https://example.com \
  --max-pages 100 \
  --max-depth 3 \
  --include "*/blog/*" \
  --exclude "*.pdf" \
  --output ./crawl-results/

Mapping

# Map all URLs
rapidcrawl map https://example.com

# Filter and save
rapidcrawl map https://example.com \
  --search "product" \
  --limit 1000 \
  --output urls.txt

Searching

# Basic search
rapidcrawl search "python tutorials"

# Search and scrape
rapidcrawl search "machine learning" \
  --scrape \
  --num-results 20 \
  --engine google \
  --output results/

โš™๏ธ Configuration

RapidCrawl works with zero configuration. Everything below is optional.

Environment Variables

Create a .env file in your project root (all optional):

# General
RAPIDCRAWL_TIMEOUT=30
RAPIDCRAWL_MAX_RETRIES=3
RAPIDCRAWL_USER_AGENT=MyBot/1.0

# Proxy (URL string, with optional credentials)
RAPIDCRAWL_PROXY=http://user:pass@host:8080

# Caching
RAPIDCRAWL_CACHE=true
RAPIDCRAWL_CACHE_TTL=3600
RAPIDCRAWL_CACHE_DIR=.rapidcrawl_cache

# Optional LLM extraction (extract_prompt)
RAPIDCRAWL_LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

Python Configuration

from rapidcrawl import RapidCrawlApp

app = RapidCrawlApp(
    timeout=60.0,
    max_retries=5,
    user_agent="MyBot/1.0",
    proxy="http://user:pass@host:8080",   # also accepts a ProxyConfig or dict
    cache=True,                            # enable result caching
    cache_ttl=3600,
    cache_dir=".rapidcrawl_cache",         # persist cache to disk
    verify_ssl=True,
    debug=True,
    # Optional LLM extraction:
    llm_provider="openai",                 # or "anthropic"
    llm_api_key="sk-...",                  # or set OPENAI_API_KEY / ANTHROPIC_API_KEY
)

Configuration Options

Option Purpose
timeout Default request timeout in seconds
max_retries Retries for transient network errors
user_agent Override the default browser-like User-Agent
proxy Proxy for HTTP and browser requests
cache / cache_ttl / cache_dir Result caching
verify_ssl Disable for self-signed certificates
llm_* Optional OpenAI/Anthropic config for extract_prompt

๐Ÿ“š API Reference

RapidCrawlApp

The main client class for interacting with RapidCrawl.

Constructor

RapidCrawlApp(
    api_key: Optional[str] = None,       # optional; reserved for a future backend
    base_url: Optional[str] = None,      # optional; reserved for a future backend
    timeout: Optional[float] = None,
    max_retries: Optional[int] = None,
    verify_ssl: bool = True,
    debug: bool = False,
    user_agent: Optional[str] = None,
    proxy: Optional[Union[str, ProxyConfig, dict]] = None,
    cache: bool = False,
    cache_ttl: Optional[float] = None,
    cache_dir: Optional[str] = None,
    llm_provider: Optional[str] = None,  # "openai" | "anthropic"
    llm_api_key: Optional[str] = None,
    llm_model: Optional[str] = None,
)

Methods

Sync Async Description
scrape_url(url, **opts) scrape_url_async(...) Scrape a single URL
crawl_url(url, **opts) crawl_url_async(...) Crawl a website (concurrent)
map_url(url, **opts) map_url_async(...) Discover URLs
search(query, **opts) search_async(...) Search the web
extract(urls, schema, prompt) extract_async(...) Extract structured data
export(result, path, fmt=None) export_async(...) Save a result to JSON/CSV/Markdown
clear_cache() โ€” Clear the result cache

The client is also a context manager (with RapidCrawlApp() as app: / async with ...).

Models

ScrapeOptions

from rapidcrawl.models import ScrapeOptions, OutputFormat

options = ScrapeOptions(
    url="https://example.com",
    formats=[OutputFormat.MARKDOWN, OutputFormat.HTML],
    wait_for=".content",
    timeout=30000,
    mobile=False,
    actions=[...],
    extract_schema=[...],
    headers={"User-Agent": "Custom UA"}
)

CrawlOptions

from rapidcrawl.models import CrawlOptions

options = CrawlOptions(
    url="https://example.com",
    max_pages=100,
    max_depth=3,
    include_patterns=["*/blog/*"],
    exclude_patterns=["*.pdf"],
    allow_subdomains=False,
    webhook_url="https://webhook.example.com"
)

๐Ÿ”ง Examples

For comprehensive examples, see the examples directory:

E-commerce Price Monitoring

from rapidcrawl import RapidCrawlApp
import json

app = RapidCrawlApp()

# Define extraction schema
schema = [
    {"name": "title", "selector": "h1.product-title"},
    {"name": "price", "selector": ".price-now", "type": "number"},
    {"name": "stock", "selector": ".availability"},
    {"name": "image", "selector": "img.product-image", "attribute": "src"}
]

# Monitor multiple products
products = [
    "https://shop.example.com/product1",
    "https://shop.example.com/product2",
]

results = []
for url in products:
    result = app.scrape_url(url, extract_schema=schema)
    if result.success:
        results.append({
            "url": url,
            "data": result.structured_data,
            "timestamp": result.scraped_at
        })

# Save results
with open("prices.json", "w") as f:
    json.dump(results, f, indent=2, default=str)

Content Aggregation

import asyncio
from rapidcrawl import RapidCrawlApp

app = RapidCrawlApp()

async def aggregate_news():
    # Search multiple queries
    queries = [
        "artificial intelligence breakthroughs",
        "quantum computing news",
        "robotics innovation"
    ]
    
    all_articles = []
    
    for query in queries:
        result = app.search(
            query,
            num_results=5,
            scrape_results=True,
            formats=["markdown"]
        )
        
        for item in result.results:
            if item.scraped_content and item.scraped_content.success:
                all_articles.append({
                    "title": item.title,
                    "url": item.url,
                    "content": item.scraped_content.content["markdown"],
                    "query": query
                })
    
    return all_articles

# Run aggregation
articles = asyncio.run(aggregate_news())

Website Change Detection

import hashlib
import time
from rapidcrawl import RapidCrawlApp

app = RapidCrawlApp()

def monitor_changes(url, interval=3600):
    """Monitor a webpage for changes."""
    previous_hash = None
    
    while True:
        result = app.scrape_url(url, formats=["text"])
        
        if result.success:
            content = result.content["text"]
            current_hash = hashlib.md5(content.encode()).hexdigest()
            
            if previous_hash and current_hash != previous_hash:
                print(f"Change detected at {url}!")
                # Send notification, save diff, etc.
            
            previous_hash = current_hash
        
        time.sleep(interval)

# Monitor a page
monitor_changes("https://example.com/status", interval=300)  # Check every 5 minutes

๐Ÿš€ Advanced Usage

Rate Limiting

import time
from rapidcrawl import RapidCrawlApp

class RateLimitedScraper:
    def __init__(self, requests_per_second=2):
        self.app = RapidCrawlApp()
        self.min_interval = 1.0 / requests_per_second
        self.last_request = 0
    
    def scrape_url(self, url):
        current = time.time()
        elapsed = current - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        
        self.last_request = time.time()
        return self.app.scrape_url(url)

Caching Results

from functools import lru_cache
import hashlib

class CachedScraper:
    def __init__(self):
        self.app = RapidCrawlApp()
        self.cache = {}
    
    def scrape_with_cache(self, url, max_age_hours=24):
        cache_key = hashlib.md5(url.encode()).hexdigest()
        
        if cache_key in self.cache:
            cached_time, cached_result = self.cache[cache_key]
            age_hours = (time.time() - cached_time) / 3600
            if age_hours < max_age_hours:
                return cached_result
        
        result = self.app.scrape_url(url)
        self.cache[cache_key] = (time.time(), result)
        return result

Error Handling

from rapidcrawl.exceptions import (
    RateLimitError,
    TimeoutError,
    NetworkError
)

def robust_scrape(url, max_retries=3):
    app = RapidCrawlApp()
    
    for attempt in range(max_retries):
        try:
            return app.scrape_url(url)
        except RateLimitError as e:
            wait_time = e.retry_after or 60
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
        except TimeoutError:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise
        except NetworkError as e:
            print(f"Network error: {e}")
            time.sleep(2 ** attempt)  # Exponential backoff

Concurrent Scraping

from concurrent.futures import ThreadPoolExecutor, as_completed

def concurrent_scrape(urls, max_workers=5):
    app = RapidCrawlApp()
    results = {}
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_url = {
            executor.submit(app.scrape_url, url): url 
            for url in urls
        }
        
        for future in as_completed(future_to_url):
            url = future_to_url[future]
            try:
                results[url] = future.result()
            except Exception as e:
                results[url] = {"error": str(e)}
    
    return results

For more advanced patterns, see the Advanced Usage Guide.

โšก Performance

Benchmarks

Operation URLs Time Speed
Sequential Scraping 10 12.3s 0.8 pages/sec
Concurrent Scraping 10 3.1s 3.2 pages/sec
Async Crawling 100 28.5s 3.5 pages/sec
URL Mapping 1000 5.2s 192 URLs/sec

Optimization Tips

  1. Use Async Operations: For crawling large sites, use crawl_url_async()
  2. Enable Connection Pooling: Reuse HTTP connections
  3. Limit Concurrent Requests: Prevent overwhelming servers
  4. Cache Results: Avoid re-scraping unchanged content
  5. Use Specific Formats: Only request needed output formats

๐Ÿ” Troubleshooting

Common Issues

Installation Problems

# Update pip
python -m pip install --upgrade pip

# Install in virtual environment
python -m venv venv
source venv/bin/activate
pip install rapid-crawl

Playwright Issues

# Install browser dependencies
playwright install-deps chromium

# Or use Firefox
playwright install firefox

SSL Certificate Errors

# For self-signed certificates (development only!)
app = RapidCrawlApp(verify_ssl=False)

Rate Limiting

# Handle rate limits gracefully
try:
    result = app.scrape_url(url)
except RateLimitError as e:
    time.sleep(e.retry_after or 60)
    result = app.scrape_url(url)

For detailed troubleshooting, see the Troubleshooting Guide.

๐Ÿ› ๏ธ Development

Setting up development environment

# Clone the repository
git clone https://github.com/aoneahsan/rapid-crawl.git
cd rapid-crawl

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Running tests

# Run all tests
pytest

# Run with coverage
pytest --cov=rapidcrawl

# Run specific test file
pytest tests/test_scrape.py

Code formatting

# Format code
black src/rapidcrawl

# Run linter
ruff check src/rapidcrawl

# Type checking
mypy src/rapidcrawl

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Start

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Development Guidelines

  • Write tests for new features
  • Follow PEP 8 style guide
  • Update documentation
  • Add type hints
  • Run tests before submitting

๐Ÿ”’ Security

Security is important to us. Please see our Security Policy for details on:

  • Reporting vulnerabilities
  • Security best practices
  • API key management
  • Data privacy

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ’ฌ Support

Documentation

Community

Resources

๐Ÿ‘จโ€๐Ÿ’ป Developer

Ahsan Mahmood

๐Ÿ™ Acknowledgments

๐Ÿ“Š Statistics

GitHub Stars GitHub Forks PyPI Downloads GitHub Issues GitHub Pull Requests


RapidCrawl - Fast, reliable web scraping for Python
Made with โค๏ธ by Ahsan Mahmood

โญ Star us on GitHub โ€ข ๐Ÿ“ฆ Install from PyPI โ€ข ๐Ÿ› Report a Bug

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rapid_crawl-0.2.0.tar.gz (108.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rapid_crawl-0.2.0-py3-none-any.whl (51.2 kB view details)

Uploaded Python 3

File details

Details for the file rapid_crawl-0.2.0.tar.gz.

File metadata

  • Download URL: rapid_crawl-0.2.0.tar.gz
  • Upload date:
  • Size: 108.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rapid_crawl-0.2.0.tar.gz
Algorithm Hash digest
SHA256 afd0b2f3521c6bfbc231cfa8059a33cfa2ceb8593b79cf01cd3efdd96bd76ba7
MD5 b6db8b9cd96871aa3a2be67878da4dca
BLAKE2b-256 54e58894c11651aafcfa845fbef2c9f0ab54da72877b242c7c37f561e232f25a

See more details on using hashes here.

File details

Details for the file rapid_crawl-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: rapid_crawl-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 51.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for rapid_crawl-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8b40abe5db4d1aead854028b1791b9e76cc8ca5e09c5cf09a3533dc6fbc24aef
MD5 f53bb5d25e38fa9d95c4dcdb3dd9cc55
BLAKE2b-256 6b5c9003b3baaf8ee0c9be9df4b8c34c4d35be77827ec5c62d1b747f70aa8782

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page