The world's most complete, resilient, multilingual web crawler

These details have not been verified by PyPI

Project links

Project description

DeepHarvest

The World's Most Complete, Resilient, Multilingual Web Crawler

Features

Core Capabilities

Complete Coverage: Crawls entire websites including all subpages
All Content Types: HTML, PDF, DOCX, PPTX, XLSX, images, audio, video
JavaScript Support: Full SPA support with Playwright
Multilingual: Handles all languages, encodings, and scripts
Distributed: Redis-based distributed crawling with multiple workers
Resumable: Checkpoint and resume interrupted crawls
Intelligent: ML-based trap detection, content extraction, deduplication

Advanced Features

Smart Trap Detection: Calendar, pagination, session ID, faceted navigation
ML Content Extraction: Page classification, soft-404 detection, quality scoring
Advanced URL Management: SimHash, MinHash, LSH deduplication
Site Graph Analysis: PageRank, clustering, GraphML export
Observability: Prometheus metrics, Grafana dashboards
Extensible: Plugin system for custom extractors
OSINT Mode: Entity extraction, technology detection, link graph analysis
Browser Automation: High-level Playwright integration with screenshot capture
Pipeline Execution: YAML-based pipeline runner for complex workflows
API Server: REST API for programmatic access
Multiple Exporters: JSONL, Parquet, SQLite, VectorDB (FAISS/Chroma) support

Quick Start

Installation

pip install deepharvest

Basic Usage

Simple Crawls

# Basic crawl with depth limit
deepharvest crawl https://example.com --depth 5 --output ./output

# Crawl without JavaScript rendering (faster)
deepharvest crawl https://example.com --no-js --depth 3

# Crawl with JavaScript rendering (for SPAs)
deepharvest crawl https://example.com --js --depth 3

Limiting Crawl Scope

# Limit total number of URLs crawled
deepharvest crawl https://example.com --max-urls 1000 --depth 5

# Limit response size (skip large files)
deepharvest crawl https://example.com --max-size 10 --depth 3

# Limit pages per domain (useful for multi-domain crawls)
deepharvest crawl https://example.com --max-pages-per-domain 50 --depth 5

# Set time limit (stop after specified seconds)
deepharvest crawl https://example.com --time-limit 3600 --depth 5

# Combine multiple limits
deepharvest crawl https://example.com \
  --depth 5 \
  --max-urls 500 \
  --max-pages-per-domain 100 \
  --max-size 5 \
  --time-limit 1800 \
  --output ./output

Distributed Crawling

# Run in distributed mode with Redis
deepharvest crawl https://example.com \
  --distributed \
  --redis-url redis://localhost:6379 \
  --workers 5 \
  --depth 10

Using Configuration Files

# Use a YAML config file
deepharvest crawl --config config.yaml

OSINT Mode

# Basic OSINT collection
deepharvest osint https://example.com

# With JSON output and link graph
deepharvest osint https://example.com --json --graph

# With screenshots
deepharvest osint https://example.com --screenshot

API Server

# Start API server
deepharvest serve --host 0.0.0.0 --port 8000

Pipeline Execution

# Run a pipeline from YAML file
deepharvest run pipeline.yaml

Python API

import asyncio
from deepharvest import DeepHarvest, CrawlConfig

async def main():
    config = CrawlConfig(
        seed_urls=["https://example.com"],
        max_depth=5,
        enable_js=True
    )
    
    crawler = DeepHarvest(config)
    await crawler.initialize()
    await crawler.crawl()
    await crawler.shutdown()

asyncio.run(main())

Installation

From PyPI

pip install deepharvest

From Source

git clone https://github.com/deepharvest/deepharvest
cd deepharvest
pip install -e .

Using Docker

docker-compose up

Documentation

Comprehensive documentation is available in the docs/ directory:

API Reference - Complete API documentation
Plugin Development Guide - Create and use plugins
OSINT Usage - OSINT mode examples
Browser Automation - Browser automation guide
Benchmarks - Performance benchmarks
Troubleshooting - Common issues and solutions
Architecture - System architecture overview

Architecture

┌─────────────────────────────────────────────────────────┐
│                    DeepHarvest Core                       │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │  Frontier   │  │   Fetcher    │  │  JS Renderer  │  │
│  │  (BFS/DFS)  │  │  (HTTP/2)    │  │  (Playwright) │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Extractors  │  │  Trap Det.   │  │  URL Dedup    │  │
│  │  (50+ fmt)  │  │  (ML+Rules)  │  │  (SimHash)    │  │
│  └─────────────┘  └──────────────┘  └───────────────┘  │
├─────────────────────────────────────────────────────────┤
│                  Distributed Layer                       │
│  ┌──────────┐  ┌───────────┐  ┌──────────┐            │
│  │  Redis   │  │  Workers  │  │ Storage  │            │
│  │ Frontier │  │  (N proc) │  │ (S3/FS)  │            │
│  └──────────┘  └───────────┘  └──────────┘            │
└─────────────────────────────────────────────────────────┘

How It Works

DeepHarvest operates as a distributed web crawling system that systematically discovers, fetches, and extracts content from websites. The architecture follows a modular design with clear separation of concerns.

Core Workflow

Initialization: The crawler initializes components (frontier, fetcher, extractors, ML models) based on configuration.
URL Management (Frontier): A priority queue manages URLs to be crawled. Supports BFS, DFS, and priority-based strategies. In distributed mode, Redis coordinates URL distribution across workers.
Content Fetching: The fetcher downloads web pages with retry logic, timeout handling, and rate limiting. Attempts HTTP/2 support with fallback to HTTP/1.1.
HTML Parsing: Multi-strategy parser with fallback chain (lxml → html5lib → html.parser) ensures robust parsing of malformed HTML.
JavaScript Rendering: For Single Page Applications (SPAs), Playwright renders pages, executes JavaScript, handles infinite scroll, and captures the final DOM state.
Content Extraction: Specialized extractors process different content types:
- Text: HTML text extraction with boilerplate removal
- Documents: PDF, DOCX, PPTX, XLSX text extraction
- Media: Image metadata, OCR, audio transcription, video metadata
- Structured Data: JSON-LD, Microdata, OpenGraph, Schema.org
Link Discovery: Advanced link extractor finds URLs from multiple sources:
- HTML attributes (href, src, srcset)
- JavaScript code (router.push, window.location)
- Structured data (JSON-LD, Microdata)
- Meta tags and data URIs
Deduplication: Three-tier deduplication system:
- SHA256: Exact URL/content duplicates
- SimHash: Near-duplicate detection (64-bit hashing)
- MinHash LSH: Scalable similarity search for large datasets
Trap Detection: ML and rule-based detection prevents infinite loops from:
- Calendar-based URLs (date patterns)
- Session ID parameters
- Pagination traps
- Query parameter explosions
Storage: Extracted content is stored with metadata. Supports filesystem, S3, and PostgreSQL backends.

Distributed Architecture

In distributed mode, multiple workers share a Redis-based frontier. Each worker:

Pulls URLs from the shared queue
Processes pages independently
Respects per-host concurrency limits
Reports metrics to centralized monitoring

This enables linear scaling: N workers process approximately N times the throughput of a single worker.

Resilience Features

Parser Fallback: Automatic fallback between parsers when HTML is malformed
Network Resilience: Exponential backoff retry, timeout handling, proxy support
Memory Management: Streaming for large files, memory guards per worker
Checkpointing: Periodic state saves enable resuming interrupted crawls
Error Taxonomy: Structured error handling with detailed reporting

Machine Learning Integration

Page Classification: Identifies page types (article, product, forum, etc.) for intelligent prioritization
Soft-404 Detection: Identifies pages that return 200 but are effectively 404s
Quality Scoring: ML-based content quality assessment
Trap Detection: Pattern recognition for crawler traps

Multilingual Support

Automatic encoding detection (charset_normalizer, chardet)
Language detection (langdetect)
CJK (Chinese, Japanese, Korean) text processing
RTL (Right-to-Left) language support
Unicode normalization

Use Cases

Research: Academic data collection and analysis
SEO: Site auditing and competitive analysis
Media Monitoring: News and content aggregation
Business Intelligence: Market research and data mining
AI Training: Dataset creation for ML models
Analytics: Web structure and link analysis

Configuration

Create a config.yaml:

seed_urls:
  - https://example.com
max_depth: 5
enable_js: true
distributed: true
redis_url: redis://localhost:6379
extractors:
  - text
  - pdf
  - office
  - images
ml_features:
  trap_detection: true
  soft404_detection: true
  content_extraction: true

Run with config:

deepharvest crawl --config config.yaml

Testing

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# With coverage
pytest --cov=deepharvest tests/

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

Apache-2.0 License - see LICENSE for details.

Acknowledgments

Built with amazing open-source tools:

Playwright for JavaScript rendering
BeautifulSoup & lxml for HTML parsing
PyMuPDF for PDF extraction
Redis for distributed coordination
scikit-learn for ML models

Contributions are welcome anytime! This project was initially developed by a single contributor, and we're excited to welcome new contributors to help make DeepHarvest even better.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.4

Dec 20, 2025

1.0.3

Dec 4, 2025

1.0.2

Dec 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepharvest-1.0.4.tar.gz (17.7 MB view details)

Uploaded Dec 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deepharvest-1.0.4-py3-none-any.whl (20.4 MB view details)

Uploaded Dec 20, 2025 Python 3

File details

Details for the file deepharvest-1.0.4.tar.gz.

File metadata

Download URL: deepharvest-1.0.4.tar.gz
Upload date: Dec 20, 2025
Size: 17.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for deepharvest-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`b4536e5110ebba92c62adf335ee645ae7851642a2da624f2b04838ec63246602`
MD5	`cfffb75cb44891b107433be99e57c764`
BLAKE2b-256	`e52e73182f9d5aa68ddb0f634a5adaeac8c5148cfc6e144251799cc69d0ae0e7`

See more details on using hashes here.

File details

Details for the file deepharvest-1.0.4-py3-none-any.whl.

File metadata

Download URL: deepharvest-1.0.4-py3-none-any.whl
Upload date: Dec 20, 2025
Size: 20.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for deepharvest-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c81cc0e46a32fd5a7c421acea0217f0113d15895bf46b14c1b7928bf63c86c15`
MD5	`42f61265090a3a33843b810552e3472a`
BLAKE2b-256	`89194cfd8494cbacdf3da5d2737435e3e4096804d2d0e5d7ff7a2839cca0f61d`

See more details on using hashes here.

deepharvest 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DeepHarvest

Features

Core Capabilities

Advanced Features

Quick Start

Installation

Basic Usage

Simple Crawls

Limiting Crawl Scope

Distributed Crawling

Using Configuration Files

OSINT Mode

API Server

Pipeline Execution

Python API

Installation

From PyPI

From Source

Using Docker

Documentation

Architecture

How It Works

Core Workflow

Distributed Architecture

Resilience Features

Machine Learning Integration

Multilingual Support

Use Cases

Configuration

Testing

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes