The world's most complete, resilient, multilingual web crawler
Project description
DeepHarvest
The World's Most Complete, Resilient, Multilingual Web Crawler
Features
Core Capabilities
- Complete Coverage: Crawls entire websites including all subpages
- All Content Types: HTML, PDF, DOCX, PPTX, XLSX, images, audio, video
- JavaScript Support: Full SPA support with Playwright
- Multilingual: Handles all languages, encodings, and scripts
- Distributed: Redis-based distributed crawling with multiple workers
- Resumable: Checkpoint and resume interrupted crawls
- Intelligent: ML-based trap detection, content extraction, deduplication
Advanced Features
- Smart Trap Detection: Calendar, pagination, session ID, faceted navigation
- ML Content Extraction: Page classification, soft-404 detection, quality scoring
- Advanced URL Management: SimHash, MinHash, LSH deduplication
- Site Graph Analysis: PageRank, clustering, GraphML export
- Observability: Prometheus metrics, Grafana dashboards
- Extensible: Plugin system for custom extractors
- OSINT Mode: Entity extraction, technology detection, link graph analysis
- Browser Automation: High-level Playwright integration with screenshot capture
- Pipeline Execution: YAML-based pipeline runner for complex workflows
- API Server: REST API for programmatic access
- Multiple Exporters: JSONL, Parquet, SQLite, VectorDB (FAISS/Chroma) support
Quick Start
Installation
pip install deepharvest
Basic Usage
Simple Crawls
# Basic crawl with depth limit
deepharvest crawl https://example.com --depth 5 --output ./output
# Crawl without JavaScript rendering (faster)
deepharvest crawl https://example.com --no-js --depth 3
# Crawl with JavaScript rendering (for SPAs)
deepharvest crawl https://example.com --js --depth 3
Limiting Crawl Scope
# Limit total number of URLs crawled
deepharvest crawl https://example.com --max-urls 1000 --depth 5
# Limit response size (skip large files)
deepharvest crawl https://example.com --max-size 10 --depth 3
# Limit pages per domain (useful for multi-domain crawls)
deepharvest crawl https://example.com --max-pages-per-domain 50 --depth 5
# Set time limit (stop after specified seconds)
deepharvest crawl https://example.com --time-limit 3600 --depth 5
# Combine multiple limits
deepharvest crawl https://example.com \
--depth 5 \
--max-urls 500 \
--max-pages-per-domain 100 \
--max-size 5 \
--time-limit 1800 \
--output ./output
Distributed Crawling
# Run in distributed mode with Redis
deepharvest crawl https://example.com \
--distributed \
--redis-url redis://localhost:6379 \
--workers 5 \
--depth 10
Using Configuration Files
# Use a YAML config file
deepharvest crawl --config config.yaml
OSINT Mode
# Basic OSINT collection
deepharvest osint https://example.com
# With JSON output and link graph
deepharvest osint https://example.com --json --graph
# With screenshots
deepharvest osint https://example.com --screenshot
API Server
# Start API server
deepharvest serve --host 0.0.0.0 --port 8000
Pipeline Execution
# Run a pipeline from YAML file
deepharvest run pipeline.yaml
Python API
import asyncio
from deepharvest import DeepHarvest, CrawlConfig
async def main():
config = CrawlConfig(
seed_urls=["https://example.com"],
max_depth=5,
enable_js=True
)
crawler = DeepHarvest(config)
await crawler.initialize()
await crawler.crawl()
await crawler.shutdown()
asyncio.run(main())
Installation
From PyPI
pip install deepharvest
From Source
git clone https://github.com/deepharvest/deepharvest
cd deepharvest
pip install -e .
Using Docker
docker-compose up
Documentation
Comprehensive documentation is available in the docs/ directory:
- API Reference - Complete API documentation
- Plugin Development Guide - Create and use plugins
- OSINT Usage - OSINT mode examples
- Browser Automation - Browser automation guide
- Benchmarks - Performance benchmarks
- Troubleshooting - Common issues and solutions
- Architecture - System architecture overview
Architecture
┌─────────────────────────────────────────────────────────┐
│ DeepHarvest Core │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Frontier │ │ Fetcher │ │ JS Renderer │ │
│ │ (BFS/DFS) │ │ (HTTP/2) │ │ (Playwright) │ │
│ └─────────────┘ └──────────────┘ └───────────────┘ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Extractors │ │ Trap Det. │ │ URL Dedup │ │
│ │ (50+ fmt) │ │ (ML+Rules) │ │ (SimHash) │ │
│ └─────────────┘ └──────────────┘ └───────────────┘ │
├─────────────────────────────────────────────────────────┤
│ Distributed Layer │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Redis │ │ Workers │ │ Storage │ │
│ │ Frontier │ │ (N proc) │ │ (S3/FS) │ │
│ └──────────┘ └───────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
How It Works
DeepHarvest operates as a distributed web crawling system that systematically discovers, fetches, and extracts content from websites. The architecture follows a modular design with clear separation of concerns.
Core Workflow
-
Initialization: The crawler initializes components (frontier, fetcher, extractors, ML models) based on configuration.
-
URL Management (Frontier): A priority queue manages URLs to be crawled. Supports BFS, DFS, and priority-based strategies. In distributed mode, Redis coordinates URL distribution across workers.
-
Content Fetching: The fetcher downloads web pages with retry logic, timeout handling, and rate limiting. Attempts HTTP/2 support with fallback to HTTP/1.1.
-
HTML Parsing: Multi-strategy parser with fallback chain (lxml → html5lib → html.parser) ensures robust parsing of malformed HTML.
-
JavaScript Rendering: For Single Page Applications (SPAs), Playwright renders pages, executes JavaScript, handles infinite scroll, and captures the final DOM state.
-
Content Extraction: Specialized extractors process different content types:
- Text: HTML text extraction with boilerplate removal
- Documents: PDF, DOCX, PPTX, XLSX text extraction
- Media: Image metadata, OCR, audio transcription, video metadata
- Structured Data: JSON-LD, Microdata, OpenGraph, Schema.org
-
Link Discovery: Advanced link extractor finds URLs from multiple sources:
- HTML attributes (href, src, srcset)
- JavaScript code (router.push, window.location)
- Structured data (JSON-LD, Microdata)
- Meta tags and data URIs
-
Deduplication: Three-tier deduplication system:
- SHA256: Exact URL/content duplicates
- SimHash: Near-duplicate detection (64-bit hashing)
- MinHash LSH: Scalable similarity search for large datasets
-
Trap Detection: ML and rule-based detection prevents infinite loops from:
- Calendar-based URLs (date patterns)
- Session ID parameters
- Pagination traps
- Query parameter explosions
-
Storage: Extracted content is stored with metadata. Supports filesystem, S3, and PostgreSQL backends.
Distributed Architecture
In distributed mode, multiple workers share a Redis-based frontier. Each worker:
- Pulls URLs from the shared queue
- Processes pages independently
- Respects per-host concurrency limits
- Reports metrics to centralized monitoring
This enables linear scaling: N workers process approximately N times the throughput of a single worker.
Resilience Features
- Parser Fallback: Automatic fallback between parsers when HTML is malformed
- Network Resilience: Exponential backoff retry, timeout handling, proxy support
- Memory Management: Streaming for large files, memory guards per worker
- Checkpointing: Periodic state saves enable resuming interrupted crawls
- Error Taxonomy: Structured error handling with detailed reporting
Machine Learning Integration
- Page Classification: Identifies page types (article, product, forum, etc.) for intelligent prioritization
- Soft-404 Detection: Identifies pages that return 200 but are effectively 404s
- Quality Scoring: ML-based content quality assessment
- Trap Detection: Pattern recognition for crawler traps
Multilingual Support
- Automatic encoding detection (charset_normalizer, chardet)
- Language detection (langdetect)
- CJK (Chinese, Japanese, Korean) text processing
- RTL (Right-to-Left) language support
- Unicode normalization
Use Cases
- Research: Academic data collection and analysis
- SEO: Site auditing and competitive analysis
- Media Monitoring: News and content aggregation
- Business Intelligence: Market research and data mining
- AI Training: Dataset creation for ML models
- Analytics: Web structure and link analysis
Configuration
Create a config.yaml:
seed_urls:
- https://example.com
max_depth: 5
enable_js: true
distributed: true
redis_url: redis://localhost:6379
extractors:
- text
- pdf
- office
- images
ml_features:
trap_detection: true
soft404_detection: true
content_extraction: true
Run with config:
deepharvest crawl --config config.yaml
Testing
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# With coverage
pytest --cov=deepharvest tests/
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
License
Apache-2.0 License - see LICENSE for details.
Acknowledgments
Built with amazing open-source tools:
- Playwright for JavaScript rendering
- BeautifulSoup & lxml for HTML parsing
- PyMuPDF for PDF extraction
- Redis for distributed coordination
- scikit-learn for ML models
Contributions are welcome anytime! This project was initially developed by a single contributor, and we're excited to welcome new contributors to help make DeepHarvest even better.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepharvest-1.0.4.tar.gz.
File metadata
- Download URL: deepharvest-1.0.4.tar.gz
- Upload date:
- Size: 17.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4536e5110ebba92c62adf335ee645ae7851642a2da624f2b04838ec63246602
|
|
| MD5 |
cfffb75cb44891b107433be99e57c764
|
|
| BLAKE2b-256 |
e52e73182f9d5aa68ddb0f634a5adaeac8c5148cfc6e144251799cc69d0ae0e7
|
File details
Details for the file deepharvest-1.0.4-py3-none-any.whl.
File metadata
- Download URL: deepharvest-1.0.4-py3-none-any.whl
- Upload date:
- Size: 20.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c81cc0e46a32fd5a7c421acea0217f0113d15895bf46b14c1b7928bf63c86c15
|
|
| MD5 |
42f61265090a3a33843b810552e3472a
|
|
| BLAKE2b-256 |
89194cfd8494cbacdf3da5d2737435e3e4096804d2d0e5d7ff7a2839cca0f61d
|