Skip to main content

The world's most complete, resilient, multilingual web crawler

Project description

DeepHarvest

The World's Most Complete, Resilient, Multilingual Web Crawler

License: Apache-2.0 Python 3.9+ Docker

Features

Core Capabilities

  • Complete Coverage: Crawls entire websites including all subpages
  • All Content Types: HTML, PDF, DOCX, PPTX, XLSX, images, audio, video
  • JavaScript Support: Full SPA support with Playwright
  • Multilingual: Handles all languages, encodings, and scripts
  • Distributed: Redis-based distributed crawling with multiple workers
  • Resumable: Checkpoint and resume interrupted crawls
  • Intelligent: ML-based trap detection, content extraction, deduplication

Advanced Features

  • Smart Trap Detection: Calendar, pagination, session ID, faceted navigation
  • ML Content Extraction: Page classification, soft-404 detection, quality scoring
  • Advanced URL Management: SimHash, MinHash, LSH deduplication
  • Site Graph Analysis: PageRank, clustering, GraphML export
  • Observability: Prometheus metrics, Grafana dashboards
  • Extensible: Plugin system for custom extractors
  • OSINT Mode: Entity extraction, technology detection, link graph analysis
  • Browser Automation: High-level Playwright integration with screenshot capture
  • Pipeline Execution: YAML-based pipeline runner for complex workflows
  • API Server: REST API for programmatic access
  • Multiple Exporters: JSONL, Parquet, SQLite, VectorDB (FAISS/Chroma) support

Quick Start

Installation

pip install deepharvest

Basic Usage

Simple Crawls

# Basic crawl with depth limit
deepharvest crawl https://example.com --depth 5 --output ./output

# Crawl without JavaScript rendering (faster)
deepharvest crawl https://example.com --no-js --depth 3

# Crawl with JavaScript rendering (for SPAs)
deepharvest crawl https://example.com --js --depth 3

Limiting Crawl Scope

# Limit total number of URLs crawled
deepharvest crawl https://example.com --max-urls 1000 --depth 5

# Limit response size (skip large files)
deepharvest crawl https://example.com --max-size 10 --depth 3

# Limit pages per domain (useful for multi-domain crawls)
deepharvest crawl https://example.com --max-pages-per-domain 50 --depth 5

# Set time limit (stop after specified seconds)
deepharvest crawl https://example.com --time-limit 3600 --depth 5

# Combine multiple limits
deepharvest crawl https://example.com \
  --depth 5 \
  --max-urls 500 \
  --max-pages-per-domain 100 \
  --max-size 5 \
  --time-limit 1800 \
  --output ./output

Distributed Crawling

# Run in distributed mode with Redis
deepharvest crawl https://example.com \
  --distributed \
  --redis-url redis://localhost:6379 \
  --workers 5 \
  --depth 10

Using Configuration Files

# Use a YAML config file
deepharvest crawl --config config.yaml

OSINT Mode

# Basic OSINT collection
deepharvest osint https://example.com

# With JSON output and link graph
deepharvest osint https://example.com --json --graph

# With screenshots
deepharvest osint https://example.com --screenshot

API Server

# Start API server
deepharvest serve --host 0.0.0.0 --port 8000

Pipeline Execution

# Run a pipeline from YAML file
deepharvest run pipeline.yaml

Python API

import asyncio
from deepharvest import DeepHarvest, CrawlConfig

async def main():
    config = CrawlConfig(
        seed_urls=["https://example.com"],
        max_depth=5,
        enable_js=True
    )
    
    crawler = DeepHarvest(config)
    await crawler.initialize()
    await crawler.crawl()
    await crawler.shutdown()

asyncio.run(main())

Installation

From PyPI

pip install deepharvest

From Source

git clone https://github.com/deepharvest/deepharvest
cd deepharvest
pip install -e .

Using Docker

docker-compose up

Documentation

Comprehensive documentation is available in the docs/ directory:

Architecture

┌─────────────────────────────────────────────────────────┐
│                    DeepHarvest Core                       │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │  Frontier   │  │   Fetcher    │  │  JS Renderer  │  │
│  │  (BFS/DFS)  │  │  (HTTP/2)    │  │  (Playwright) │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Extractors  │  │  Trap Det.   │  │  URL Dedup    │  │
│  │  (50+ fmt)  │  │  (ML+Rules)  │  │  (SimHash)    │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
├─────────────────────────────────────────────────────────┤
│                  Distributed Layer                       │
│  ┌──────────┐  ┌───────────┐  ┌──────────┐            │
│  │  Redis   │  │  Workers  │  │ Storage  │            │
│  │ Frontier │  │  (N proc) │  │ (S3/FS)  │            │
│  └──────────┘  └───────────┘  └──────────┘            │
└─────────────────────────────────────────────────────────┘

How It Works

DeepHarvest operates as a distributed web crawling system that systematically discovers, fetches, and extracts content from websites. The architecture follows a modular design with clear separation of concerns.

Core Workflow

  1. Initialization: The crawler initializes components (frontier, fetcher, extractors, ML models) based on configuration.

  2. URL Management (Frontier): A priority queue manages URLs to be crawled. Supports BFS, DFS, and priority-based strategies. In distributed mode, Redis coordinates URL distribution across workers.

  3. Content Fetching: The fetcher downloads web pages with retry logic, timeout handling, and rate limiting. Attempts HTTP/2 support with fallback to HTTP/1.1.

  4. HTML Parsing: Multi-strategy parser with fallback chain (lxml → html5lib → html.parser) ensures robust parsing of malformed HTML.

  5. JavaScript Rendering: For Single Page Applications (SPAs), Playwright renders pages, executes JavaScript, handles infinite scroll, and captures the final DOM state.

  6. Content Extraction: Specialized extractors process different content types:

    • Text: HTML text extraction with boilerplate removal
    • Documents: PDF, DOCX, PPTX, XLSX text extraction
    • Media: Image metadata, OCR, audio transcription, video metadata
    • Structured Data: JSON-LD, Microdata, OpenGraph, Schema.org
  7. Link Discovery: Advanced link extractor finds URLs from multiple sources:

    • HTML attributes (href, src, srcset)
    • JavaScript code (router.push, window.location)
    • Structured data (JSON-LD, Microdata)
    • Meta tags and data URIs
  8. Deduplication: Three-tier deduplication system:

    • SHA256: Exact URL/content duplicates
    • SimHash: Near-duplicate detection (64-bit hashing)
    • MinHash LSH: Scalable similarity search for large datasets
  9. Trap Detection: ML and rule-based detection prevents infinite loops from:

    • Calendar-based URLs (date patterns)
    • Session ID parameters
    • Pagination traps
    • Query parameter explosions
  10. Storage: Extracted content is stored with metadata. Supports filesystem, S3, and PostgreSQL backends.

Distributed Architecture

In distributed mode, multiple workers share a Redis-based frontier. Each worker:

  • Pulls URLs from the shared queue
  • Processes pages independently
  • Respects per-host concurrency limits
  • Reports metrics to centralized monitoring

This enables linear scaling: N workers process approximately N times the throughput of a single worker.

Resilience Features

  • Parser Fallback: Automatic fallback between parsers when HTML is malformed
  • Network Resilience: Exponential backoff retry, timeout handling, proxy support
  • Memory Management: Streaming for large files, memory guards per worker
  • Checkpointing: Periodic state saves enable resuming interrupted crawls
  • Error Taxonomy: Structured error handling with detailed reporting

Machine Learning Integration

  • Page Classification: Identifies page types (article, product, forum, etc.) for intelligent prioritization
  • Soft-404 Detection: Identifies pages that return 200 but are effectively 404s
  • Quality Scoring: ML-based content quality assessment
  • Trap Detection: Pattern recognition for crawler traps

Multilingual Support

  • Automatic encoding detection (charset_normalizer, chardet)
  • Language detection (langdetect)
  • CJK (Chinese, Japanese, Korean) text processing
  • RTL (Right-to-Left) language support
  • Unicode normalization

Use Cases

  • Research: Academic data collection and analysis
  • SEO: Site auditing and competitive analysis
  • Media Monitoring: News and content aggregation
  • Business Intelligence: Market research and data mining
  • AI Training: Dataset creation for ML models
  • Analytics: Web structure and link analysis

Configuration

Create a config.yaml:

seed_urls:
  - https://example.com
max_depth: 5
enable_js: true
distributed: true
redis_url: redis://localhost:6379
extractors:
  - text
  - pdf
  - office
  - images
ml_features:
  trap_detection: true
  soft404_detection: true
  content_extraction: true

Run with config:

deepharvest crawl --config config.yaml

Testing

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# With coverage
pytest --cov=deepharvest tests/

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

Apache-2.0 License - see LICENSE for details.

Acknowledgments

Built with amazing open-source tools:

  • Playwright for JavaScript rendering
  • BeautifulSoup & lxml for HTML parsing
  • PyMuPDF for PDF extraction
  • Redis for distributed coordination
  • scikit-learn for ML models

Contributions are welcome anytime! This project was initially developed by a single contributor, and we're excited to welcome new contributors to help make DeepHarvest even better.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepharvest-1.0.4.tar.gz (17.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepharvest-1.0.4-py3-none-any.whl (20.4 MB view details)

Uploaded Python 3

File details

Details for the file deepharvest-1.0.4.tar.gz.

File metadata

  • Download URL: deepharvest-1.0.4.tar.gz
  • Upload date:
  • Size: 17.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for deepharvest-1.0.4.tar.gz
Algorithm Hash digest
SHA256 b4536e5110ebba92c62adf335ee645ae7851642a2da624f2b04838ec63246602
MD5 cfffb75cb44891b107433be99e57c764
BLAKE2b-256 e52e73182f9d5aa68ddb0f634a5adaeac8c5148cfc6e144251799cc69d0ae0e7

See more details on using hashes here.

File details

Details for the file deepharvest-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: deepharvest-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 20.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for deepharvest-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c81cc0e46a32fd5a7c421acea0217f0113d15895bf46b14c1b7928bf63c86c15
MD5 42f61265090a3a33843b810552e3472a
BLAKE2b-256 89194cfd8494cbacdf3da5d2737435e3e4096804d2d0e5d7ff7a2839cca0f61d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page