Skip to main content

Sync-First бібліотека для побудови графу веб-сайтів - просто як requests!

Project description

GraphCrawler Logo

GraphCrawler

Enterprise-Grade Web Crawling Framework for Graph-Based Site Analysis

PyPI Version Python Versions Downloads License Tests Coverage Code Style Documentation

DocumentationAPI Reference


Why GraphCrawler?

Modern web applications require sophisticated crawling solutions. GraphCrawler was built from the ground up with Clean Architecture principles, offering unmatched flexibility and performance for web analysis tasks.

🎯 Built for Scale

  • Process 1M+ pages with low-memory mode
  • Distributed crawling via Celery workers
  • Automatic rate limiting & autothrottle
  • Checkpoint/resume for long-running jobs
  • Python 3.14 free-threading support (3.2x faster)

🧩 Extensible by Design

  • Plugin architecture for custom logic
  • Multiple storage backends
  • Swappable transport drivers
  • Event-driven processing pipeline
  • Webhook notifications for real-time updates

🛡️ Production Ready

  • Battle-tested in enterprise environments
  • Comprehensive error handling
  • SSRF protection built-in
  • Full type annotations (mypy strict)
  • Anti-bot bypass (Cloudflare, DataDome, PerimeterX)

🤖 AI-Native

  • Integrated LLM extraction (OpenAI, Anthropic, Bedrock)
  • Vector embeddings for semantic search
  • Smart content classification
  • ML-powered URL prioritization
  • CAPTCHA solving integration

Table of Contents


What's New in v4.0

🚀 Python 3.14 Free-Threading Support

# Enable free-threading for maximum speed
export PYTHON_GIL=0
python your_script.py

Performance Results:

  • 2-4x faster HTML parsing
  • 🚀 3.2x faster end-to-end crawling
  • 📉 16% less memory usage
  • ⏱️ 30% faster startup

🌱 Multiple Seed URLs

graph = gc.crawl(
    seed_urls=[
        "https://example.com/products/",
        "https://example.com/blog/",
        "https://example.com/docs/",
    ],
    max_depth=3
)

🔄 Incremental Crawling

# Start with initial crawl
graph1 = gc.crawl("https://example.com", max_pages=50)

# Later, continue from where you left off
graph2 = gc.crawl(base_graph=graph1, max_pages=100)

📡 Real-Time Dashboard & WebSocket

# Start dashboard server
uvicorn graph_crawler.api.dashboard:app --port 8000

# WebSocket endpoint for live updates
# ws://localhost:8000/ws/crawl

Installation

From PyPI (Recommended)

pip install graph-crawler

Optional Dependencies

# JavaScript/SPA rendering
pip install graph-crawler[playwright]

# Vector embeddings & ML
pip install graph-crawler[embeddings]

# MongoDB storage backend
pip install graph-crawler[mongodb]

# PostgreSQL storage backend  
pip install graph-crawler[postgresql]

# Distributed crawling (Celery)
pip install graph-crawler[distributed]

# Full installation
pip install graph-crawler[all]

System Requirements

Component Minimum Recommended
Python 3.11 3.14+ (free-threading)
Memory 512 MB 4 GB+
OS Linux, macOS, Windows Linux (Ubuntu 22.04+)

Quick Start

Basic Usage

import graph_crawler as gc

# Crawl a website
graph = gc.crawl(
    url="https://example.com",
    max_depth=3,
    max_pages=100
)

# Analyze results
print(f"Discovered {len(graph.nodes):,} pages")
print(f"Mapped {len(graph.edges):,} links")

# Persist to disk
gc.save_graph(graph, "example_graph.json")

Async API

import asyncio
import graph_crawler as gc

async def main():
    graph = await gc.async_crawl(
        url="https://example.com",
        max_depth=5,
        max_pages=1000,
        request_delay=0.25
    )
    
    # Process nodes concurrently
    async for node in graph.iter_nodes_async():
        print(f"[{node.response_status}] {node.url}")
    
    return graph

graph = asyncio.run(main())

Client Interface

from graph_crawler import GraphCrawlerClient

async with GraphCrawlerClient(
    driver="playwright",
    storage="sqlite"
) as client:
    # Configure and execute
    graph = await client.crawl(
        "https://spa-application.com",
        max_depth=4
    )
    
    # Save with metadata
    await client.save("spa_graph", graph, tags=["spa", "react"])

Core Concepts

Graph Model

GraphCrawler represents websites as directed graphs:

┌─────────────────────────────────────────────────────────┐
│                        Graph                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐             │
│  │  Node   │───▶│  Node   │───▶│  Node   │             │
│  │ (root)  │    │ /about  │    │ /team   │             │
│  └─────────┘    └─────────┘    └─────────┘             │
│       │              │                                   │
│       │         ┌────┴────┐                             │
│       ▼         ▼         ▼                             │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐                 │
│  │  Node   │  │  Node   │  │  Node   │                 │
│  │ /blog   │  │/contact │  │/careers │                 │
│  └─────────┘  └─────────┘  └─────────┘                 │
└─────────────────────────────────────────────────────────┘

Node — represents a single page with:

  • URL, depth, status code
  • Content hash (SHA-256) and SimHash
  • Metadata (title, description, headers)
  • Custom user data

Edge — represents a link between pages:

  • Source and target URLs
  • Link text and attributes
  • Edge type (internal, external, resource)

Graph Operations

# Graph algebra
merged = graph1 + graph2           # Union
diff = graph1 - graph2             # Difference  
common = graph1 & graph2           # Intersection
symmetric = graph1 ^ graph2        # Symmetric difference

# Subgraph detection
if graph1 < graph2:
    print("graph1 is a subgraph of graph2")

# Find popular pages
popular = graph.get_popular_nodes(top_n=10, by='in_degree')

# Find orphan pages (no incoming links)
for node in graph:
    if graph.get_in_degree(node) == 0 and node.depth > 0:
        print(f"Orphan: {node.url}")

Configuration

Crawl Parameters

Parameter Type Default Description
url str required Starting URL
seed_urls list[str] None Multiple starting URLs
max_depth int 3 Maximum link depth from root
max_pages int 100 Maximum pages to crawl
same_domain bool True Restrict to starting domain
request_delay float 0.5 Delay between requests (seconds)
timeout int 300 Global timeout (seconds)
driver str "http" Transport driver
url_rules list[URLRule] [] URL filtering/priority rules
node_plugins list[Plugin] [] Content processing plugins
respect_robots bool True Honor robots.txt
base_graph Graph None Continue incremental crawl
low_memory_mode bool False Enable eviction for large crawls

URL Rules

from graph_crawler import URLRule, SmartURLRule, build_smart_rules

# Pattern-based filtering
rules = [
    # Skip non-HTML resources
    URLRule(pattern=r"\.(pdf|zip|exe|dmg)$", should_scan=False),
    
    # Skip admin areas
    URLRule(pattern=r"/(admin|wp-admin|dashboard)/", should_scan=False),
    
    # Prioritize product pages
    URLRule(pattern=r"/products?/[\w-]+$", priority=10),
    
    # Deprioritize pagination
    URLRule(pattern=r"\?page=\d+", priority=-5),
]

# Or use smart presets
rules = build_smart_rules(
    skip_extensions=[".pdf", ".zip"],
    skip_paths=["/admin", "/api"],
    priority_paths=["/products", "/categories"],
    skip_query_params=["session", "token"]
)

graph = gc.crawl("https://shop.example.com", url_rules=rules)

Settings Files

# crawler_settings.yaml
project_name: "my_crawler"

crawler:
  max_depth: 5
  max_pages: 10000
  request_delay: 0.25
  timeout: 3600
  respect_robots: true

driver:
  type: playwright
  headless: true
  wait_for: networkidle

storage:
  type: sqlite
  path: ./data/crawl.db

plugins:
  - graph_crawler.extensions.plugins.node.SEOPlugin
  - graph_crawler.extensions.plugins.node.StructuredDataPlugin
from graph_crawler import CrawlerSettings

settings = CrawlerSettings.from_yaml("crawler_settings.yaml")
graph = gc.crawl("https://example.com", settings=settings)

Drivers

GraphCrawler supports multiple transport drivers:

Basic Drivers

Driver Engine Best For Anti-Bot JS Rendering
http httpx Static sites, APIs
async aiohttp High concurrency
playwright Chromium SPAs, modern sites
cloudscraper requests Cloudflare sites

Professional Anti-Bot Drivers (v4.1+)

Driver Engine Best For Bypasses JS Rendering
undetected undetected-chromedriver Enterprise anti-bot Cloudflare, DataDome, PerimeterX, Imperva, Kasada
nodriver nodriver Async Cloudflare bypass Cloudflare Turnstile, DataDome
tls curl-cffi Fast TLS fingerprint DataDome, PerimeterX (no JS challenges)
botasaurus botasaurus Enterprise scraping All major anti-bot systems

Undetected Chrome Driver

Professional anti-bot driver based on undetected-chromedriver. Automatically removes automation flags and simulates human behavior.

graph = gc.crawl(
    "https://cloudflare-protected.com",
    driver="undetected",
    driver_config={
        "headless": True,
        "proxy": "http://user:pass@proxy:8080",
        "human_behavior": True,
        "page_load_timeout": 60
    }
)

NoDriver (Async)

Async anti-bot driver with built-in Cloudflare Turnstile solver:

graph = await gc.async_crawl(
    "https://turnstile-protected.com",
    driver="nodriver",
    driver_config={
        "headless": True,
        "cf_auto_solve": True,
        "cf_wait_timeout": 30,
        "human_behavior": True
    }
)

TLS Fingerprint Client

Fast HTTP client that impersonates real browser TLS fingerprints. Best for sites without JavaScript challenges:

graph = gc.crawl(
    "https://datadome-protected.com",
    driver="tls",
    driver_config={
        "impersonate": "chrome131",  # chrome110-131, firefox117-120, safari15-17
        "proxy": "http://user:pass@proxy:8080",
        "max_retries": 3
    }
)

Botasaurus Driver

Enterprise-grade anti-bot framework with automatic bypass:

graph = gc.crawl(
    "https://enterprise-protected.com",
    driver="botasaurus",
    driver_config={
        "headless": True,
        "block_images": True,
        "proxy": "http://user:pass@proxy:8080"
    }
)

Playwright Driver

graph = gc.crawl(
    "https://react-application.com",
    driver="playwright",
    
    # Playwright-specific options
    headless=True,
    wait_for_selector=".app-loaded",
    wait_for_timeout=5000,
    viewport={"width": 1920, "height": 1080},
    
    # Browser context
    ignore_https_errors=True,
    java_script_enabled=True
)

Storage Backends

Backend Capacity Persistence Query Speed Use Case
memory ~10K nodes ⚡ Fastest Development, testing
json ~50K nodes 🐌 Slow Small projects, export
sqlite ~500K nodes ⚡ Fast Local production
mongodb Unlimited ⚡ Fast Distributed, cloud
postgresql Unlimited ⚡ Fast Analytics, enterprise

Low-Memory Mode

For crawling sites with millions of pages:

graph = gc.crawl(
    "https://large-site.com",
    max_pages=1_000_000,
    low_memory_mode=True,          # Enable eviction
    eviction_threshold=50_000,     # Nodes in RAM
    eviction_storage="sqlite"      # Where to evict
)

Plugin System

Plugin Types

Type Hook Point Use Case
NodePlugin After page fetch Content extraction, SEO analysis
EdgePlugin After link discovery Link classification, filtering
EnginePlugin Before/during crawl URL prioritization, rate limiting
ExportPlugin During export Data transformation

Built-in Node Plugins

from graph_crawler.extensions.plugins.node import (
    StructuredDataPlugin,  # JSON-LD, OpenGraph, Microdata, RDFa, Twitter Cards
    SEOPlugin,             # Meta tags, headings, schema
    ContentHashPlugin,     # Duplicate detection (SHA-256 + SimHash)
    VectorizationPlugin,   # Text embeddings
)

from graph_crawler.extensions.plugins.node.extractors import (
    PhoneExtractorPlugin,  # UA, US, RU phone formats
    EmailExtractorPlugin,  # RFC 5322 compliant
    PriceExtractorPlugin,  # USD, EUR, UAH with ranges
)

graph = gc.crawl(
    "https://shop.example.com",
    node_plugins=[
        StructuredDataPlugin(),
        PhoneExtractorPlugin(),
        EmailExtractorPlugin(),
        PriceExtractorPlugin(),
    ]
)

# Access extracted data
for node in graph:
    print(f"Page: {node.url}")
    print(f"  Phones: {node.user_data.get('phones', [])}")
    print(f"  Emails: {node.user_data.get('emails', [])}")
    print(f"  Prices: {node.user_data.get('prices', [])}")
    print(f"  OpenGraph: {node.user_data.get('opengraph', {})}")

Custom Plugin

from graph_crawler import BaseNodePlugin, NodePluginType

class PriceMonitorPlugin(BaseNodePlugin):
    """Monitor product prices across e-commerce sites."""
    
    plugin_type = NodePluginType.ON_HTML_PARSED
    priority = 100
    
    def execute(self, context):
        if not context.html_tree:
            return context
        
        # Extract price using CSS selectors
        price_elem = context.html_tree.select_one('[data-price], .price, #price')
        if price_elem:
            context.user_data["price"] = {
                "raw": price_elem.get_text(strip=True),
                "currency": self._detect_currency(price_elem),
                "value": self._parse_price(price_elem)
            }
        
        return context

graph = gc.crawl(url, node_plugins=[PriceMonitorPlugin()])

Smart Crawling (ML)

SmartPageFinderPlugin

ML-powered plugin that uses LLM or keyword analysis to find relevant pages:

from graph_crawler.extensions.plugins.node import SmartPageFinderPlugin

plugin = SmartPageFinderPlugin(
    search_prompt="Python developer jobs in Kyiv",
    config={
        "min_relevance_score": 0.7,
        "analyze_links": True,
        "model": "gpt-4o-mini"  # Optional LLM
    }
)

graph = gc.crawl("https://jobs.example.com", node_plugins=[plugin])

# Find target pages
for node in graph:
    if node.user_data.get("is_target_page"):
        print(f"Found: {node.url}")
        print(f"  Score: {node.user_data['relevance_score']:.2f}")
        print(f"  Reason: {node.user_data['relevance_reason']}")

VectorCrawlEnginePlugin

Vector-based URL prioritization using embeddings:

from graph_crawler.extensions.plugins.crawl_engine import VectorCrawlEnginePlugin

plugin = VectorCrawlEnginePlugin(
    keywords=["python", "developer", "remote", "jobs"],
    min_priority=1,
    max_priority=15,
    model_name="paraphrase-multilingual-MiniLM-L12-v2"
)
plugin.setup()  # Load model

graph = gc.crawl(
    "https://careers.example.com",
    engine_plugins=[plugin]
)

SmartCrawlEnginePlugin

Intelligent URL prioritization before scanning:

from graph_crawler.extensions.plugins.crawl_engine import SmartCrawlEnginePlugin

plugin = SmartCrawlEnginePlugin(
    search_prompt="Machine learning engineer positions",
    config={
        "aggressive_filtering": True,  # Skip irrelevant URLs without scanning
        "use_llm": False  # Use fast keyword-based analysis
    }
)

graph = gc.crawl("https://linkedin.com/jobs", engine_plugins=[plugin])

Anti-Bot & CAPTCHA

Anti-Bot Detection

GraphCrawler can detect and bypass various anti-bot systems:

from graph_crawler.extensions.plugins.engine import AntiBotSystem, detect_anti_bot_system

# Automatic detection
system = detect_anti_bot_system(html_content)
# Returns: AntiBotSystem.CLOUDFLARE, AKAMAI, DATADOME, PERIMETERX, etc.

Supported anti-bot systems:

  • Cloudflare (Challenge, Turnstile)
  • Akamai (Bot Manager)
  • DataDome
  • PerimeterX
  • Imperva/Incapsula

CAPTCHA Solving

Integration with popular CAPTCHA solving services:

from graph_crawler.extensions.plugins.engine.captcha import (
    CaptchaPlugin,
    create_solver,
    CaptchaType
)

# Create solver
solver = create_solver(
    service="2captcha",  # or "anticaptcha", "capsolver"
    api_key="your-api-key"
)

# Check balance
balance = solver.check_balance()
print(f"Balance: ${balance}")

# Use with crawler
plugin = CaptchaPlugin(solver=solver)
graph = gc.crawl(
    "https://protected-site.com",
    driver="playwright",
    engine_plugins=[plugin]
)

Supported CAPTCHA types:

  • reCAPTCHA v2/v3
  • hCaptcha
  • Image CAPTCHA

Distributed Crawling

Scale horizontally with Celery workers:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│    Redis    │◀────│   Worker 1  │
└─────────────┘     │   (Broker)  │     └─────────────┘
                    └─────────────┘             │
                          ▲                     ▼
                          │               ┌───────────┐
                    ┌─────────────┐       │  MongoDB  │
                    │   Worker 2  │──────▶│ (Results) │
                    └─────────────┘       └───────────┘
                          ▲
                    ┌─────────────┐
                    │   Worker N  │
                    └─────────────┘

Setup

# Start Redis
docker run -d -p 6379:6379 redis:alpine

# Start workers
celery -A graph_crawler.infrastructure.messaging worker -l INFO -c 4

Usage

from graph_crawler import EasyDistributedCrawler

crawler = EasyDistributedCrawler(
    broker_url="redis://localhost:6379/0",
    result_backend="redis://localhost:6379/1",
    mongodb_uri="mongodb://localhost:27017"
)

# Submit crawl job
job_id = await crawler.submit(
    url="https://large-site.com",
    max_pages=100_000,
    max_depth=10,
    workers=8
)

# Monitor progress
while True:
    status = await crawler.get_status(job_id)
    print(f"Progress: {status.pages_crawled}/{status.pages_total}")
    
    if status.is_complete:
        break
    
    await asyncio.sleep(5)

# Get results
graph = await crawler.get_result(job_id)

REST API & Dashboard

REST API

Built-in FastAPI-based REST API for remote control:

# Start API server
uvicorn graph_crawler.api.rest_api:router --port 8001

Endpoints:

Method Endpoint Description
POST /api/v1/crawl/start Start new crawl
POST /api/v1/crawl/{id}/pause Pause crawl
POST /api/v1/crawl/{id}/resume Resume crawl
POST /api/v1/crawl/{id}/stop Stop crawl
GET /api/v1/crawl/{id}/status Get crawl status
GET /api/v1/crawl/list List all crawls

Real-Time Dashboard

# Start dashboard
uvicorn graph_crawler.api.dashboard:app --port 8000

Features:

  • 📊 Real-time statistics via WebSocket
  • 📈 Live crawl progress visualization
  • ⏯️ Pause/Resume/Stop controls
  • 📝 Error monitoring
  • 📉 Performance metrics

WebSocket Events:

  • initial_state — Current state on connect
  • stats_update — Statistics update
  • page_crawled — New page scanned
  • error — Crawl error occurred

Webhooks

Receive real-time notifications for crawl events:

from graph_crawler.api.webhooks import WebhookManager, WebhookEvent

# Setup webhooks
manager = WebhookManager()

manager.add_webhook(
    url="https://your-server.com/webhook",
    events=[
        WebhookEvent.CRAWL_STARTED,
        WebhookEvent.CRAWL_FINISHED,
        WebhookEvent.PAGE_CRAWLED,
        WebhookEvent.CRAWL_ERROR,
        WebhookEvent.MILESTONE_REACHED,  # Every N pages
    ],
    secret="your-hmac-secret",  # For signature verification
    headers={"Authorization": "Bearer token"}
)

# Start async delivery
await manager.start()

# Integrate with crawler
await integrate_webhooks_with_crawler(event_bus, webhook_configs)

Webhook Payload:

{
    "event": "page_crawled",
    "data": {
        "url": "https://example.com/page",
        "status": 200,
        "depth": 2
    },
    "timestamp": "2024-01-15T10:30:00Z"
}

AI Integration

⚠️ Important: AI features require your own API keys from external providers. GraphCrawler provides the integration layer and abstractions, not built-in AI capabilities.

Required API Keys

Provider Environment Variable Get Key
OpenAI OPENAI_API_KEY platform.openai.com
Anthropic ANTHROPIC_API_KEY console.anthropic.com
AWS Bedrock AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY AWS Console
Emergent EMERGENT_LLM_KEY Universal key for EmergentModel
# Example .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
EMERGENT_LLM_KEY=...

LLM-Powered Extraction

from graph_crawler.ai import ExtractionPlugin
from graph_crawler.ai.models import OpenAIModel, AnthropicModel, BedrockModel

# Configure model
model = OpenAIModel(
    api_key="sk-...",
    model="gpt-4o",
    temperature=0
)

# Create extraction plugin
extractor = ExtractionPlugin(
    model=model,
    prompt="""
    Extract the following from this page:
    - Main topic
    - Key entities (people, companies, products)
    - Sentiment (positive/neutral/negative)
    
    Return as JSON.
    """
)

graph = gc.crawl(
    "https://news-site.com",
    max_pages=50,
    node_plugins=[extractor]
)

# Access AI-extracted data
for node in graph:
    ai_data = node.user_data.get("ai_extraction", {})
    print(f"{node.url}: {ai_data.get('main_topic')}")

Vector Search

from graph_crawler.extensions.plugins.node.vectorization import (
    VectorizationPlugin,
    semantic_search,
    cluster_by_similarity
)

vectorizer = VectorizationPlugin(
    model="text-embedding-3-small",
    api_key="sk-..."
)

graph = gc.crawl("https://docs.example.com", node_plugins=[vectorizer])

# Semantic search across crawled pages
results = semantic_search(
    graph=graph,
    query="How to configure authentication?",
    top_k=5
)

for node, score in results:
    print(f"[{score:.3f}] {node.url}")

# Cluster similar pages
clusters = cluster_by_similarity(graph, method="kmeans", n_clusters=5)

Data Extraction

Built-in Extractors

Extractor Data Types Formats
PhoneExtractor Phone numbers UA, US, RU, international
EmailExtractor Email addresses RFC 5322 compliant
PriceExtractor Prices USD, EUR, UAH, ranges
StructuredData Schema.org JSON-LD, Microdata, RDFa
OpenGraph Social meta og:title, og:image, etc.
TwitterCards Twitter meta twitter:card, etc.

Structured Data Extraction

from graph_crawler.extensions.plugins.node import StructuredDataPlugin

graph = gc.crawl(
    "https://shop.example.com",
    node_plugins=[StructuredDataPlugin()]
)

for node in graph:
    # JSON-LD data
    jsonld = node.user_data.get("jsonld", [])
    for item in jsonld:
        if item.get("@type") == "Product":
            print(f"Product: {item.get('name')}")
            print(f"Price: {item.get('offers', {}).get('price')}")
    
    # OpenGraph
    og = node.user_data.get("opengraph", {})
    print(f"OG Title: {og.get('og:title')}")
    
    # Microdata
    microdata = node.user_data.get("microdata", [])

CLI Reference

# Crawl website
graph-crawler crawl https://example.com \
    --max-depth 5 \
    --max-pages 1000 \
    --driver playwright \
    --output ./results/

# List saved graphs
graph-crawler list --storage sqlite --path ./data/

# Graph information
graph-crawler info my_graph --detailed

# Export graph
graph-crawler export my_graph \
    --format csv \
    --output ./exports/graph.csv

# Compare two graphs
graph-crawler diff graph_v1 graph_v2 --show-added --show-removed

# Start API server
graph-crawler serve --host 0.0.0.0 --port 8000

# Initialize new project
graph-crawler init my_crawler_project

Architecture

GraphCrawler follows Clean Architecture principles:

┌──────────────────────────────────────────────────────────────┐
│                        Presentation                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │    CLI     │  │  REST API  │  │  WebSocket │             │
│  └────────────┘  └────────────┘  └────────────┘             │
├──────────────────────────────────────────────────────────────┤
│                         Public API                            │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │  crawl() • async_crawl() • GraphCrawlerClient           │ │
│  └─────────────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│                      Application Layer                        │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │  Use Cases │  │  Services  │  │    DTOs    │             │
│  │  (Spider)  │  │ (Exporter) │  │  (Mapper)  │             │
│  └────────────┘  └────────────┘  └────────────┘             │
├──────────────────────────────────────────────────────────────┤
│                        Domain Layer                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │  Entities  │  │   Value    │  │ Interfaces │             │
│  │Graph,Node, │  │  Objects   │  │IDriver,    │             │
│  │   Edge     │  │Settings,   │  │IStorage    │             │
│  └────────────┘  └────────────┘  └────────────┘             │
├──────────────────────────────────────────────────────────────┤
│                     Infrastructure Layer                      │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │ Transport  │  │Persistence │  │ Messaging  │             │
│  │HTTP,       │  │SQLite,     │  │Celery,     │             │
│  │Playwright  │  │MongoDB     │  │Redis       │             │
│  └────────────┘  └────────────┘  └────────────┘             │
├──────────────────────────────────────────────────────────────┤
│                      Extensions Layer                         │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │   Node     │  │   Engine   │  │    AI      │             │
│  │  Plugins   │  │  Plugins   │  │  Models    │             │
│  └────────────┘  └────────────┘  └────────────┘             │
└──────────────────────────────────────────────────────────────┘

Directory Structure

graph_crawler/
├── api/                      # Public API surface
│   ├── sync.py              # Synchronous API
│   ├── async_.py            # Asynchronous API
│   ├── client/              # OOP client interface
│   ├── rest_api.py          # FastAPI REST endpoints
│   ├── dashboard.py         # Real-time dashboard
│   ├── webhooks.py          # Webhook notifications
│   └── websocket_manager.py # WebSocket handling
├── domain/                   # Core business logic
│   ├── entities/            # Graph, Node, Edge
│   ├── value_objects/       # Settings, Configs, Rules
│   ├── interfaces/          # Abstract contracts
│   └── events/              # Domain events (EventBus)
├── application/              # Application services
│   ├── use_cases/           # Crawling, export logic
│   ├── services/            # Factories, helpers
│   └── dto/                 # Data transfer objects
├── infrastructure/           # External implementations
│   ├── transport/           # HTTP, Playwright drivers
│   ├── persistence/         # Storage backends
│   └── messaging/           # Celery, Redis
├── extensions/               # Plugin system
│   ├── plugins/
│   │   ├── node/           # Content extraction plugins
│   │   ├── crawl_engine/   # URL prioritization plugins
│   │   └── engine/         # Anti-bot, CAPTCHA plugins
│   └── middleware/          # Request middleware
├── ai/                       # AI/ML integrations
│   ├── models/              # OpenAI, Anthropic, Bedrock
│   └── extraction/          # LLM extraction
└── shared/                   # Cross-cutting concerns
    ├── exceptions.py        # Custom exceptions
    ├── constants.py         # Configuration
    └── utils/               # Helpers

Performance

Benchmarks

Tested on AWS c5.2xlarge (8 vCPU, 16 GB RAM):

Scenario Pages Time Memory Rate
Static site (HTTP) 10,000 45s 512 MB 222 p/s
SPA (Playwright) 1,000 180s 2 GB 5.5 p/s
Distributed (4 workers) 100,000 15min 8 GB 111 p/s
Low-memory mode 1,000,000 4h 1 GB 69 p/s
Python 3.14 free-threading 10,000 14s 430 MB 714 p/s

Optimization Tips

# 1. Use async driver for static sites
graph = gc.crawl(url, driver="async", concurrency=50)

# 2. Disable unnecessary features
graph = gc.crawl(
    url,
    compute_hashes=False,      # Skip content hashing
    extract_metadata=False,    # Skip meta extraction
    store_html=False           # Don't persist HTML
)

# 3. Use URL rules to focus crawl
rules = [URLRule(pattern=r"/blog/", should_scan=False)]
graph = gc.crawl(url, url_rules=rules)

# 4. Enable low-memory for large crawls
graph = gc.crawl(url, max_pages=500_000, low_memory_mode=True)

# 5. Enable Python 3.14 free-threading
# export PYTHON_GIL=0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graph_crawler-4.0.53.tar.gz (870.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graph_crawler-4.0.53-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file graph_crawler-4.0.53.tar.gz.

File metadata

  • Download URL: graph_crawler-4.0.53.tar.gz
  • Upload date:
  • Size: 870.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for graph_crawler-4.0.53.tar.gz
Algorithm Hash digest
SHA256 51462e6a33b4861f21f4cf7abfbd85af33389fca25c5a57583dbdfcd1f2e17ab
MD5 792993365c0c5224fc1b773761c009ee
BLAKE2b-256 3950966281d0ded8370aa9218b1cc9767fdbd506f34c61e43ad06c4fe8a85ef8

See more details on using hashes here.

File details

Details for the file graph_crawler-4.0.53-py3-none-any.whl.

File metadata

  • Download URL: graph_crawler-4.0.53-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for graph_crawler-4.0.53-py3-none-any.whl
Algorithm Hash digest
SHA256 534b0adfc53dff62bce215b5356e4c3e91f7357d19f9649df6b154a31bf9de68
MD5 74f8e1736135b94074485cb26443e2f2
BLAKE2b-256 623f3e802337916e4d15ca75f8d6cc8b94fbe4297fe0a70620e3a69b7f2d2ea5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page