Production-ready, enterprise-grade web scraping and crawling framework with advanced AI integration
Project description
OpenCrawler
AI-Powered Web Intelligence
OpenCrawler is a production-ready, enterprise-grade web scraping and crawling framework with advanced AI integration, comprehensive monitoring, and scalable architecture.
🚀 Quick Installation
# Install from PyPI
pip install opencrawler
# Install with AI capabilities
pip install "opencrawler[ai]"
# Install with all features
pip install "opencrawler[all]"
Features
Core Capabilities
- Multi-Engine Support: Playwright, Selenium, Requests, CloudScraper
- AI-Powered Extraction: OpenAI Agents SDK integration for intelligent data extraction
- Stealth Technology: Advanced anti-detection and bot bypass capabilities
- Distributed Processing: Scalable architecture for high-volume operations
- Real-time Monitoring: Comprehensive metrics and health monitoring
- Enterprise Security: RBAC, audit trails, and compliance features
Advanced Features
- LLM Integration: Support for OpenAI, Anthropic, and local models
- Microservice Architecture: FastAPI-based REST API with auto-documentation
- Database Support: PostgreSQL, TimescaleDB, Redis integration
- Container Ready: Docker and Kubernetes deployment configurations
- Performance Optimization: Intelligent caching, rate limiting, and resource management
- Error Recovery: Sophisticated error handling and retry mechanisms
Quick Start
Basic Usage
import asyncio
from webscraper.core.advanced_scraper import AdvancedWebScraper
async def main():
# Initialize scraper
scraper = AdvancedWebScraper()
await scraper.setup()
# Scrape a webpage
result = await scraper.scrape_url("https://example.com")
print(f"Title: {result.get('title')}")
print(f"Content length: {len(result.get('content', ''))}")
# Cleanup
await scraper.cleanup()
asyncio.run(main())
CLI Usage
# Basic scraping
opencrawler scrape https://example.com
# Advanced scraping with AI
opencrawler scrape https://example.com --ai-extract --model gpt-4
# Start API server
opencrawler api --host 0.0.0.0 --port 8000
# Run system validation
opencrawler-validate --level production
Architecture
OpenCrawler follows a modular, microservice-oriented architecture:
OpenCrawler/
├── webscraper/
│ ├── core/ # Core scraping engines
│ ├── ai/ # AI/LLM integration
│ ├── api/ # FastAPI REST API
│ ├── engines/ # Scraping engines (Playwright, Selenium, etc.)
│ ├── processors/ # Data processing pipelines
│ ├── monitoring/ # System monitoring and metrics
│ ├── security/ # Authentication and security
│ ├── utils/ # Utilities and helpers
│ └── orchestrator/ # System orchestration
├── tests/ # Comprehensive test suite
├── deployment/ # Docker and Kubernetes configs
├── docs/ # Documentation
└── examples/ # Usage examples
Configuration
Environment Variables
# OpenAI API (optional)
export OPENAI_API_KEY="your-api-key-here"
# Database (optional)
export DATABASE_URL="postgresql://user:pass@localhost/opencrawler"
# Redis (optional)
export REDIS_URL="redis://localhost:6379"
# Test mode
export OPENCRAWLER_TEST_MODE=true
Configuration File
Create a config.yaml file:
scraper:
engines: ["playwright", "requests"]
stealth_level: "medium"
javascript_enabled: true
ai:
enabled: true
model: "gpt-4"
temperature: 0.7
database:
url: "postgresql://localhost/opencrawler"
pool_size: 10
monitoring:
enabled: true
metrics_port: 9090
security:
enable_auth: true
rate_limit: 100
API Reference
REST API
Start the API server:
opencrawler-api --port 8000
Endpoints
GET /health- Health checkPOST /scrape- Scrape a single URLPOST /crawl- Crawl multiple URLsGET /metrics- System metricsGET /docs- API documentation
Example Request
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "extract_ai": true}'
Python API
from webscraper.api.complete_api import OpenCrawlerAPI
# Initialize API
api = OpenCrawlerAPI()
await api.initialize()
# Scrape with AI
result = await api.scrape_with_ai(
url="https://example.com",
schema={"title": "string", "content": "string"}
)
# Cleanup
await api.cleanup()
Advanced Usage
AI-Powered Extraction
from webscraper.ai.llm_scraper import LLMScraper
scraper = LLMScraper()
await scraper.initialize()
# Extract structured data
result = await scraper.run(
url="https://news.example.com",
schema={
"title": "string",
"author": "string",
"date": "date",
"content": "string"
}
)
Distributed Processing
from webscraper.core.distributed_processor import DistributedProcessor
processor = DistributedProcessor(worker_count=16)
await processor.initialize()
# Process multiple URLs
results = await processor.process_batch([
"https://example1.com",
"https://example2.com",
"https://example3.com"
])
Custom Engines
from webscraper.engines.base_engine import BaseEngine
class CustomEngine(BaseEngine):
async def fetch(self, url: str, **kwargs) -> dict:
# Custom implementation
return {"content": "...", "status": 200}
# Register custom engine
scraper.register_engine("custom", CustomEngine())
Monitoring and Metrics
Built-in Monitoring
from webscraper.monitoring.advanced_monitoring import AdvancedMonitoringSystem
monitor = AdvancedMonitoringSystem()
await monitor.initialize()
# Get system metrics
metrics = await monitor.get_system_metrics()
print(f"CPU: {metrics['cpu_usage']}%")
print(f"Memory: {metrics['memory_usage']}%")
Prometheus Integration
OpenCrawler exports metrics to Prometheus:
# Start with monitoring
python master_cli.py api --enable-metrics --metrics-port 9090
Metrics available at http://localhost:9090/metrics
Deployment
Docker
# Build image
docker build -t opencrawler .
# Run container
docker run -p 8000:8000 opencrawler
Docker Compose
# Start all services
docker-compose up -d
# Production deployment
docker-compose -f docker-compose.production.yml up -d
Kubernetes
# Deploy to Kubernetes
kubectl apply -f kubernetes/
# Check deployment
kubectl get pods -l app=opencrawler
Production Deployment
from deployment.production_deployment import ProductionDeploymentSystem
deployment = ProductionDeploymentSystem()
await deployment.initialize()
# Deploy to production
result = await deployment.deploy(
environment="production",
config_overrides={"replicas": 5}
)
Testing
Running Tests
# Run all tests
pytest
# Run specific test suite
pytest tests/test_complete_system.py
# Run with coverage
pytest --cov=webscraper
# Run in test mode
OPENCRAWLER_TEST_MODE=true pytest
Test Categories
- Unit Tests: Core component testing
- Integration Tests: Service integration testing
- Performance Tests: Load and performance testing
- Security Tests: Security validation
- End-to-End Tests: Complete workflow testing
Validation
# Run comprehensive validation
python webscraper/utils/comprehensive_validator.py --level production
# Check system health
python -c "
from webscraper.orchestrator.system_orchestrator import SystemOrchestrator
import asyncio
async def main():
orchestrator = SystemOrchestrator()
await orchestrator.initialize()
health = await orchestrator.get_system_health()
print(f'System Status: {health[\"status\"]}')
await orchestrator.shutdown()
asyncio.run(main())
"
Performance
Benchmarks
- Single Page: ~2-5 seconds per page
- Concurrent Crawling: 50-100 pages/minute
- Memory Usage: <1GB for typical workloads
- CPU Usage: Optimized for multi-core systems
Optimization
# Enable performance optimizations
scraper = AdvancedWebScraper(
stealth_level="low", # Faster but less stealthy
javascript_enabled=False, # Skip JS rendering
cache_enabled=True, # Enable caching
concurrent_requests=10 # Increase concurrency
)
Security
Authentication
from webscraper.security.authentication import AuthManager
auth = AuthManager()
await auth.initialize()
# Create user
user = await auth.create_user("username", "password", ["scraper"])
# Authenticate
token = await auth.authenticate("username", "password")
Rate Limiting
from webscraper.security.rate_limiter import RateLimiter
limiter = RateLimiter(requests_per_minute=60)
await limiter.check_rate_limit(user_id="user123")
Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone and install
git clone https://github.com/llamasearch/opencrawler.git
cd opencrawler
pip install -e ".[dev]"
# Run pre-commit hooks
pre-commit install
# Run tests
pytest
Code Style
We use Black for code formatting:
# Format code
black webscraper/
# Check formatting
black --check webscraper/
License
OpenCrawler is licensed under the MIT License. See LICENSE for details.
Support
- Documentation: docs/
- Examples: examples/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Changelog
See CHANGELOG.md for version history and updates.
Assets
OpenCrawler includes a complete set of professional logo assets:
Logo Variants
assets/opencrawler-logo.svg- Main logo with full branding (light theme)assets/opencrawler-logo-dark.svg- Dark variant for light backgroundsassets/opencrawler-icon.svg- Icon version for app icons and buttonsassets/favicon.svg- Favicon optimized for small sizes
Design Features
- Spider/Crawler Theme: Represents web crawling and data extraction
- AI/Neural Network Elements: Symbolizes AI-powered intelligence
- Modern Gradients: Professional blue, green, and orange color scheme
- Scalable Vector Graphics: Perfect quality at any size
- Multiple Formats: SVG for web, can be converted to PNG/ICO as needed
Usage Guidelines
<!-- Main logo for documentation -->
<img src="assets/opencrawler-logo.svg" alt="OpenCrawler" width="200">
<!-- Dark variant for light backgrounds -->
<img src="assets/opencrawler-logo-dark.svg" alt="OpenCrawler" width="200">
<!-- Icon for buttons/navigation -->
<img src="assets/opencrawler-icon.svg" alt="OpenCrawler" width="32">
<!-- Favicon -->
<link rel="icon" type="image/svg+xml" href="assets/favicon.svg">
Acknowledgments
OpenCrawler is built with these excellent libraries:
- Playwright - Modern web automation
- FastAPI - High-performance API framework
- OpenAI - AI/LLM integration
- PostgreSQL - Database backend
- Docker - Containerization
- Kubernetes - Container orchestration
Author: Nik Jois nikjois@llamasearch.ai
Organization: LlamaSearch.ai
Version: 1.0.1
Status: Production Ready
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file opencrawler-1.0.2.tar.gz.
File metadata
- Download URL: opencrawler-1.0.2.tar.gz
- Upload date:
- Size: 48.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9be15833b6b9ad19552192e35044025b426fdfda2780ad72eb6f579af4af7386
|
|
| MD5 |
045b2f681fd936047bdf2b48df6cd9d4
|
|
| BLAKE2b-256 |
7c1c654ed4818796c9f3e76d21c1f3d09f85e8e05e67443e8269198e22f7084a
|
File details
Details for the file opencrawler-1.0.2-py3-none-any.whl.
File metadata
- Download URL: opencrawler-1.0.2-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce2dc519d0aa1021ab30d0ca50b194c835d1be1f5ad3ec0be938ab8700c422b3
|
|
| MD5 |
7908eead0d4bf45fa145d60affb5653a
|
|
| BLAKE2b-256 |
22ebfb279d35e792363e9a46d10317ca7c621a0318268a1a02edca53e04330c0
|