Skip to main content

Production-ready, enterprise-grade web scraping and crawling framework with advanced AI integration

Project description

OpenCrawler

OpenCrawler Logo
AI-Powered Web Intelligence

Python License PyPI Tests Code Style

OpenCrawler is a production-ready, enterprise-grade web scraping and crawling framework with advanced AI integration, comprehensive monitoring, and scalable architecture.

🚀 Quick Installation

# Install from PyPI
pip install opencrawler

# Install with AI capabilities
pip install "opencrawler[ai]"

# Install with all features
pip install "opencrawler[all]"

Features

Core Capabilities

  • Multi-Engine Support: Playwright, Selenium, Requests, CloudScraper
  • AI-Powered Extraction: OpenAI Agents SDK integration for intelligent data extraction
  • Stealth Technology: Advanced anti-detection and bot bypass capabilities
  • Distributed Processing: Scalable architecture for high-volume operations
  • Real-time Monitoring: Comprehensive metrics and health monitoring
  • Enterprise Security: RBAC, audit trails, and compliance features

Advanced Features

  • LLM Integration: Support for OpenAI, Anthropic, and local models
  • Microservice Architecture: FastAPI-based REST API with auto-documentation
  • Database Support: PostgreSQL, TimescaleDB, Redis integration
  • Container Ready: Docker and Kubernetes deployment configurations
  • Performance Optimization: Intelligent caching, rate limiting, and resource management
  • Error Recovery: Sophisticated error handling and retry mechanisms

Quick Start

Basic Usage

import asyncio
from webscraper.core.advanced_scraper import AdvancedWebScraper

async def main():
    # Initialize scraper
    scraper = AdvancedWebScraper()
    await scraper.setup()
    
    # Scrape a webpage
    result = await scraper.scrape_url("https://example.com")
    print(f"Title: {result.get('title')}")
    print(f"Content length: {len(result.get('content', ''))}")
    
    # Cleanup
    await scraper.cleanup()

asyncio.run(main())

CLI Usage

# Basic scraping
opencrawler scrape https://example.com

# Advanced scraping with AI
opencrawler scrape https://example.com --ai-extract --model gpt-4

# Start API server
opencrawler api --host 0.0.0.0 --port 8000

# Run system validation
opencrawler-validate --level production

Architecture

OpenCrawler follows a modular, microservice-oriented architecture:

OpenCrawler/
├── webscraper/
│   ├── core/           # Core scraping engines
│   ├── ai/             # AI/LLM integration
│   ├── api/            # FastAPI REST API
│   ├── engines/        # Scraping engines (Playwright, Selenium, etc.)
│   ├── processors/     # Data processing pipelines
│   ├── monitoring/     # System monitoring and metrics
│   ├── security/       # Authentication and security
│   ├── utils/          # Utilities and helpers
│   └── orchestrator/   # System orchestration
├── tests/              # Comprehensive test suite
├── deployment/         # Docker and Kubernetes configs
├── docs/               # Documentation
└── examples/           # Usage examples

Configuration

Environment Variables

# OpenAI API (optional)
export OPENAI_API_KEY="your-api-key-here"

# Database (optional)
export DATABASE_URL="postgresql://user:pass@localhost/opencrawler"

# Redis (optional)
export REDIS_URL="redis://localhost:6379"

# Test mode
export OPENCRAWLER_TEST_MODE=true

Configuration File

Create a config.yaml file:

scraper:
  engines: ["playwright", "requests"]
  stealth_level: "medium"
  javascript_enabled: true
  
ai:
  enabled: true
  model: "gpt-4"
  temperature: 0.7
  
database:
  url: "postgresql://localhost/opencrawler"
  pool_size: 10
  
monitoring:
  enabled: true
  metrics_port: 9090
  
security:
  enable_auth: true
  rate_limit: 100

API Reference

REST API

Start the API server:

opencrawler-api --port 8000

Endpoints

  • GET /health - Health check
  • POST /scrape - Scrape a single URL
  • POST /crawl - Crawl multiple URLs
  • GET /metrics - System metrics
  • GET /docs - API documentation

Example Request

curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "extract_ai": true}'

Python API

from webscraper.api.complete_api import OpenCrawlerAPI

# Initialize API
api = OpenCrawlerAPI()
await api.initialize()

# Scrape with AI
result = await api.scrape_with_ai(
    url="https://example.com",
    schema={"title": "string", "content": "string"}
)

# Cleanup
await api.cleanup()

Advanced Usage

AI-Powered Extraction

from webscraper.ai.llm_scraper import LLMScraper

scraper = LLMScraper()
await scraper.initialize()

# Extract structured data
result = await scraper.run(
    url="https://news.example.com",
    schema={
        "title": "string",
        "author": "string", 
        "date": "date",
        "content": "string"
    }
)

Distributed Processing

from webscraper.core.distributed_processor import DistributedProcessor

processor = DistributedProcessor(worker_count=16)
await processor.initialize()

# Process multiple URLs
results = await processor.process_batch([
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
])

Custom Engines

from webscraper.engines.base_engine import BaseEngine

class CustomEngine(BaseEngine):
    async def fetch(self, url: str, **kwargs) -> dict:
        # Custom implementation
        return {"content": "...", "status": 200}

# Register custom engine
scraper.register_engine("custom", CustomEngine())

Monitoring and Metrics

Built-in Monitoring

from webscraper.monitoring.advanced_monitoring import AdvancedMonitoringSystem

monitor = AdvancedMonitoringSystem()
await monitor.initialize()

# Get system metrics
metrics = await monitor.get_system_metrics()
print(f"CPU: {metrics['cpu_usage']}%")
print(f"Memory: {metrics['memory_usage']}%")

Prometheus Integration

OpenCrawler exports metrics to Prometheus:

# Start with monitoring
python master_cli.py api --enable-metrics --metrics-port 9090

Metrics available at http://localhost:9090/metrics

Deployment

Docker

# Build image
docker build -t opencrawler .

# Run container
docker run -p 8000:8000 opencrawler

Docker Compose

# Start all services
docker-compose up -d

# Production deployment
docker-compose -f docker-compose.production.yml up -d

Kubernetes

# Deploy to Kubernetes
kubectl apply -f kubernetes/

# Check deployment
kubectl get pods -l app=opencrawler

Production Deployment

from deployment.production_deployment import ProductionDeploymentSystem

deployment = ProductionDeploymentSystem()
await deployment.initialize()

# Deploy to production
result = await deployment.deploy(
    environment="production",
    config_overrides={"replicas": 5}
)

Testing

Running Tests

# Run all tests
pytest

# Run specific test suite
pytest tests/test_complete_system.py

# Run with coverage
pytest --cov=webscraper

# Run in test mode
OPENCRAWLER_TEST_MODE=true pytest

Test Categories

  • Unit Tests: Core component testing
  • Integration Tests: Service integration testing
  • Performance Tests: Load and performance testing
  • Security Tests: Security validation
  • End-to-End Tests: Complete workflow testing

Validation

# Run comprehensive validation
python webscraper/utils/comprehensive_validator.py --level production

# Check system health
python -c "
from webscraper.orchestrator.system_orchestrator import SystemOrchestrator
import asyncio

async def main():
    orchestrator = SystemOrchestrator()
    await orchestrator.initialize()
    health = await orchestrator.get_system_health()
    print(f'System Status: {health[\"status\"]}')
    await orchestrator.shutdown()

asyncio.run(main())
"

Performance

Benchmarks

  • Single Page: ~2-5 seconds per page
  • Concurrent Crawling: 50-100 pages/minute
  • Memory Usage: <1GB for typical workloads
  • CPU Usage: Optimized for multi-core systems

Optimization

# Enable performance optimizations
scraper = AdvancedWebScraper(
    stealth_level="low",  # Faster but less stealthy
    javascript_enabled=False,  # Skip JS rendering
    cache_enabled=True,  # Enable caching
    concurrent_requests=10  # Increase concurrency
)

Security

Authentication

from webscraper.security.authentication import AuthManager

auth = AuthManager()
await auth.initialize()

# Create user
user = await auth.create_user("username", "password", ["scraper"])

# Authenticate
token = await auth.authenticate("username", "password")

Rate Limiting

from webscraper.security.rate_limiter import RateLimiter

limiter = RateLimiter(requests_per_minute=60)
await limiter.check_rate_limit(user_id="user123")

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Clone and install
git clone https://github.com/llamasearch/opencrawler.git
cd opencrawler
pip install -e ".[dev]"

# Run pre-commit hooks
pre-commit install

# Run tests
pytest

Code Style

We use Black for code formatting:

# Format code
black webscraper/

# Check formatting
black --check webscraper/

License

OpenCrawler is licensed under the MIT License. See LICENSE for details.

Support

Changelog

See CHANGELOG.md for version history and updates.

Assets

OpenCrawler includes a complete set of professional logo assets:

Logo Variants

  • assets/opencrawler-logo.svg - Main logo with full branding (light theme)
  • assets/opencrawler-logo-dark.svg - Dark variant for light backgrounds
  • assets/opencrawler-icon.svg - Icon version for app icons and buttons
  • assets/favicon.svg - Favicon optimized for small sizes

Design Features

  • Spider/Crawler Theme: Represents web crawling and data extraction
  • AI/Neural Network Elements: Symbolizes AI-powered intelligence
  • Modern Gradients: Professional blue, green, and orange color scheme
  • Scalable Vector Graphics: Perfect quality at any size
  • Multiple Formats: SVG for web, can be converted to PNG/ICO as needed

Usage Guidelines

<!-- Main logo for documentation -->
<img src="assets/opencrawler-logo.svg" alt="OpenCrawler" width="200">

<!-- Dark variant for light backgrounds -->
<img src="assets/opencrawler-logo-dark.svg" alt="OpenCrawler" width="200">

<!-- Icon for buttons/navigation -->
<img src="assets/opencrawler-icon.svg" alt="OpenCrawler" width="32">

<!-- Favicon -->
<link rel="icon" type="image/svg+xml" href="assets/favicon.svg">

Acknowledgments

OpenCrawler is built with these excellent libraries:


Author: Nik Jois nikjois@llamasearch.ai
Organization: LlamaSearch.ai
Version: 1.0.1
Status: Production Ready

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opencrawler-1.0.2.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opencrawler-1.0.2-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file opencrawler-1.0.2.tar.gz.

File metadata

  • Download URL: opencrawler-1.0.2.tar.gz
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for opencrawler-1.0.2.tar.gz
Algorithm Hash digest
SHA256 9be15833b6b9ad19552192e35044025b426fdfda2780ad72eb6f579af4af7386
MD5 045b2f681fd936047bdf2b48df6cd9d4
BLAKE2b-256 7c1c654ed4818796c9f3e76d21c1f3d09f85e8e05e67443e8269198e22f7084a

See more details on using hashes here.

File details

Details for the file opencrawler-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: opencrawler-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for opencrawler-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ce2dc519d0aa1021ab30d0ca50b194c835d1be1f5ad3ec0be938ab8700c422b3
MD5 7908eead0d4bf45fa145d60affb5653a
BLAKE2b-256 22ebfb279d35e792363e9a46d10317ca7c621a0318268a1a02edca53e04330c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page