Sync-First бібліотека для побудови графу веб-сайтів - просто як requests!

These details have not been verified by PyPI

Project links

Project description

GraphCrawler

Enterprise-Grade Web Crawling Framework for Graph-Based Site Analysis

Why GraphCrawler?

Modern web applications require sophisticated crawling solutions. GraphCrawler was built from the ground up with Clean Architecture principles, offering unmatched flexibility and performance for web analysis tasks.

🎯 Built for Scale

Process 1M+ pages with low-memory mode
Distributed crawling via Celery workers
Automatic rate limiting & autothrottle
Checkpoint/resume for long-running jobs
Python 3.14 free-threading support (3.2x faster)

🧩 Extensible by Design

Plugin architecture for custom logic
Multiple storage backends
Swappable transport drivers
Event-driven processing pipeline
Webhook notifications for real-time updates

🛡️ Production Ready

Battle-tested in enterprise environments
Comprehensive error handling
SSRF protection built-in
Full type annotations (mypy strict)
Anti-bot bypass (Cloudflare, DataDome, PerimeterX)

🤖 AI-Native

Integrated LLM extraction (OpenAI, Anthropic, Bedrock)
Vector embeddings for semantic search
Smart content classification
ML-powered URL prioritization
CAPTCHA solving integration

What's New in v4.0
Installation
Quick Start
Core Concepts
Configuration
Drivers
Storage Backends
Plugin System
Smart Crawling (ML)
Anti-Bot & CAPTCHA
Distributed Crawling
REST API & Dashboard
Webhooks
AI Integration
Data Extraction
CLI Reference
Architecture
Performance

What's New in v4.0

🚀 Python 3.14 Free-Threading Support

# Enable free-threading for maximum speed
export PYTHON_GIL=0
python your_script.py

Performance Results:

⚡ 2-4x faster HTML parsing
🚀 3.2x faster end-to-end crawling
📉 16% less memory usage
⏱️ 30% faster startup

🌱 Multiple Seed URLs

graph = gc.crawl(
    seed_urls=[
        "https://example.com/products/",
        "https://example.com/blog/",
        "https://example.com/docs/",
    ],
    max_depth=3
)

🔄 Incremental Crawling

# Start with initial crawl
graph1 = gc.crawl("https://example.com", max_pages=50)

# Later, continue from where you left off
graph2 = gc.crawl(base_graph=graph1, max_pages=100)

📡 Real-Time Dashboard & WebSocket

# Start dashboard server
uvicorn graph_crawler.api.dashboard:app --port 8000

# WebSocket endpoint for live updates
# ws://localhost:8000/ws/crawl

Installation

From PyPI (Recommended)

pip install graph-crawler

Optional Dependencies

# JavaScript/SPA rendering
pip install graph-crawler[playwright]

# Vector embeddings & ML
pip install graph-crawler[embeddings]

# MongoDB storage backend
pip install graph-crawler[mongodb]

# PostgreSQL storage backend  
pip install graph-crawler[postgresql]

# Distributed crawling (Celery)
pip install graph-crawler[distributed]

# Full installation
pip install graph-crawler[all]

System Requirements

Component	Minimum	Recommended
Python	3.11	3.14+ (free-threading)
Memory	512 MB	4 GB+
OS	Linux, macOS, Windows	Linux (Ubuntu 22.04+)

Quick Start

Basic Usage

import graph_crawler as gc

# Crawl a website
graph = gc.crawl(
    url="https://example.com",
    max_depth=3,
    max_pages=100
)

# Analyze results
print(f"Discovered {len(graph.nodes):,} pages")
print(f"Mapped {len(graph.edges):,} links")

# Persist to disk
gc.save_graph(graph, "example_graph.json")

Async API

import asyncio
import graph_crawler as gc

async def main():
    graph = await gc.async_crawl(
        url="https://example.com",
        max_depth=5,
        max_pages=1000,
        request_delay=0.25
    )
    
    # Process nodes concurrently
    async for node in graph.iter_nodes_async():
        print(f"[{node.response_status}] {node.url}")
    
    return graph

graph = asyncio.run(main())

Client Interface

from graph_crawler import GraphCrawlerClient

async with GraphCrawlerClient(
    driver="playwright",
    storage="sqlite"
) as client:
    # Configure and execute
    graph = await client.crawl(
        "https://spa-application.com",
        max_depth=4
    )
    
    # Save with metadata
    await client.save("spa_graph", graph, tags=["spa", "react"])

Core Concepts

Graph Model

GraphCrawler represents websites as directed graphs:

┌─────────────────────────────────────────────────────────┐
│                        Graph                             │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐             │
│  │  Node   │───▶│  Node   │───▶│  Node   │             │
│  │ (root)  │    │ /about  │    │ /team   │             │
│  └─────────┘    └─────────┘    └─────────┘             │
│       │              │                                   │
│       │         ┌────┴────┐                             │
│       ▼         ▼         ▼                             │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐                 │
│  │  Node   │  │  Node   │  │  Node   │                 │
│  │ /blog   │  │/contact │  │/careers │                 │
│  └─────────┘  └─────────┘  └─────────┘                 │
└─────────────────────────────────────────────────────────┘

Node — represents a single page with:

URL, depth, status code
Content hash (SHA-256) and SimHash
Metadata (title, description, headers)
Custom user data

Edge — represents a link between pages:

Source and target URLs
Link text and attributes
Edge type (internal, external, resource)

Graph Operations

# Graph algebra
merged = graph1 + graph2           # Union
diff = graph1 - graph2             # Difference  
common = graph1 & graph2           # Intersection
symmetric = graph1 ^ graph2        # Symmetric difference

# Subgraph detection
if graph1 < graph2:
    print("graph1 is a subgraph of graph2")

# Find popular pages
popular = graph.get_popular_nodes(top_n=10, by='in_degree')

# Find orphan pages (no incoming links)
for node in graph:
    if graph.get_in_degree(node) == 0 and node.depth > 0:
        print(f"Orphan: {node.url}")

Configuration

Crawl Parameters

Parameter	Type	Default	Description
`url`	`str`	required	Starting URL
`seed_urls`	`list[str]`	`None`	Multiple starting URLs
`max_depth`	`int`	`3`	Maximum link depth from root
`max_pages`	`int`	`100`	Maximum pages to crawl
`same_domain`	`bool`	`True`	Restrict to starting domain
`request_delay`	`float`	`0.5`	Delay between requests (seconds)
`timeout`	`int`	`300`	Global timeout (seconds)
`driver`	`str`	`"http"`	Transport driver
`url_rules`	`list[URLRule]`	`[]`	URL filtering/priority rules
`node_plugins`	`list[Plugin]`	`[]`	Content processing plugins
`respect_robots`	`bool`	`True`	Honor robots.txt
`base_graph`	`Graph`	`None`	Continue incremental crawl
`low_memory_mode`	`bool`	`False`	Enable eviction for large crawls

URL Rules

from graph_crawler import URLRule, SmartURLRule, build_smart_rules

# Pattern-based filtering
rules = [
    # Skip non-HTML resources
    URLRule(pattern=r"\.(pdf|zip|exe|dmg)$", should_scan=False),
    
    # Skip admin areas
    URLRule(pattern=r"/(admin|wp-admin|dashboard)/", should_scan=False),
    
    # Prioritize product pages
    URLRule(pattern=r"/products?/[\w-]+$", priority=10),
    
    # Deprioritize pagination
    URLRule(pattern=r"\?page=\d+", priority=-5),
]

# Or use smart presets
rules = build_smart_rules(
    skip_extensions=[".pdf", ".zip"],
    skip_paths=["/admin", "/api"],
    priority_paths=["/products", "/categories"],
    skip_query_params=["session", "token"]
)

graph = gc.crawl("https://shop.example.com", url_rules=rules)

Settings Files

# crawler_settings.yaml
project_name: "my_crawler"

crawler:
  max_depth: 5
  max_pages: 10000
  request_delay: 0.25
  timeout: 3600
  respect_robots: true

driver:
  type: playwright
  headless: true
  wait_for: networkidle

storage:
  type: sqlite
  path: ./data/crawl.db

plugins:
  - graph_crawler.extensions.plugins.node.SEOPlugin
  - graph_crawler.extensions.plugins.node.StructuredDataPlugin

from graph_crawler import CrawlerSettings

settings = CrawlerSettings.from_yaml("crawler_settings.yaml")
graph = gc.crawl("https://example.com", settings=settings)

Drivers

GraphCrawler supports multiple transport drivers:

Basic Drivers

Driver	Engine	Best For	Anti-Bot	JS Rendering
`http`	`httpx`	Static sites, APIs	❌	❌
`async`	`aiohttp`	High concurrency	❌	❌
`playwright`	Chromium	SPAs, modern sites	✅	✅
`cloudscraper`	requests	Cloudflare sites	✅	❌

Professional Anti-Bot Drivers (v4.1+)

Driver	Engine	Best For	Bypasses	JS Rendering
`undetected`	undetected-chromedriver	Enterprise anti-bot	Cloudflare, DataDome, PerimeterX, Imperva, Kasada	✅
`nodriver`	nodriver	Async Cloudflare bypass	Cloudflare Turnstile, DataDome	✅
`tls`	curl-cffi	Fast TLS fingerprint	DataDome, PerimeterX (no JS challenges)	❌
`botasaurus`	botasaurus	Enterprise scraping	All major anti-bot systems	✅

Undetected Chrome Driver

Professional anti-bot driver based on undetected-chromedriver. Automatically removes automation flags and simulates human behavior.

graph = gc.crawl(
    "https://cloudflare-protected.com",
    driver="undetected",
    driver_config={
        "headless": True,
        "proxy": "http://user:pass@proxy:8080",
        "human_behavior": True,
        "page_load_timeout": 60
    }
)

NoDriver (Async)

Async anti-bot driver with built-in Cloudflare Turnstile solver:

graph = await gc.async_crawl(
    "https://turnstile-protected.com",
    driver="nodriver",
    driver_config={
        "headless": True,
        "cf_auto_solve": True,
        "cf_wait_timeout": 30,
        "human_behavior": True
    }
)

TLS Fingerprint Client

Fast HTTP client that impersonates real browser TLS fingerprints. Best for sites without JavaScript challenges:

graph = gc.crawl(
    "https://datadome-protected.com",
    driver="tls",
    driver_config={
        "impersonate": "chrome131",  # chrome110-131, firefox117-120, safari15-17
        "proxy": "http://user:pass@proxy:8080",
        "max_retries": 3
    }
)

Botasaurus Driver

Enterprise-grade anti-bot framework with automatic bypass:

graph = gc.crawl(
    "https://enterprise-protected.com",
    driver="botasaurus",
    driver_config={
        "headless": True,
        "block_images": True,
        "proxy": "http://user:pass@proxy:8080"
    }
)

Playwright Driver

graph = gc.crawl(
    "https://react-application.com",
    driver="playwright",
    
    # Playwright-specific options
    headless=True,
    wait_for_selector=".app-loaded",
    wait_for_timeout=5000,
    viewport={"width": 1920, "height": 1080},
    
    # Browser context
    ignore_https_errors=True,
    java_script_enabled=True
)

Storage Backends

Backend	Capacity	Persistence	Query Speed	Use Case
`memory`	~10K nodes	❌	⚡ Fastest	Development, testing
`json`	~50K nodes	✅	🐌 Slow	Small projects, export
`sqlite`	~500K nodes	✅	⚡ Fast	Local production
`mongodb`	Unlimited	✅	⚡ Fast	Distributed, cloud
`postgresql`	Unlimited	✅	⚡ Fast	Analytics, enterprise

Low-Memory Mode

For crawling sites with millions of pages:

graph = gc.crawl(
    "https://large-site.com",
    max_pages=1_000_000,
    low_memory_mode=True,          # Enable eviction
    eviction_threshold=50_000,     # Nodes in RAM
    eviction_storage="sqlite"      # Where to evict
)

Plugin System

Plugin Types

Type	Hook Point	Use Case
`NodePlugin`	After page fetch	Content extraction, SEO analysis
`EdgePlugin`	After link discovery	Link classification, filtering
`EnginePlugin`	Before/during crawl	URL prioritization, rate limiting
`ExportPlugin`	During export	Data transformation

Built-in Node Plugins

from graph_crawler.extensions.plugins.node import (
    StructuredDataPlugin,  # JSON-LD, OpenGraph, Microdata, RDFa, Twitter Cards
    SEOPlugin,             # Meta tags, headings, schema
    ContentHashPlugin,     # Duplicate detection (SHA-256 + SimHash)
    VectorizationPlugin,   # Text embeddings
)

from graph_crawler.extensions.plugins.node.extractors import (
    PhoneExtractorPlugin,  # UA, US, RU phone formats
    EmailExtractorPlugin,  # RFC 5322 compliant
    PriceExtractorPlugin,  # USD, EUR, UAH with ranges
)

graph = gc.crawl(
    "https://shop.example.com",
    node_plugins=[
        StructuredDataPlugin(),
        PhoneExtractorPlugin(),
        EmailExtractorPlugin(),
        PriceExtractorPlugin(),
    ]
)

# Access extracted data
for node in graph:
    print(f"Page: {node.url}")
    print(f"  Phones: {node.user_data.get('phones', [])}")
    print(f"  Emails: {node.user_data.get('emails', [])}")
    print(f"  Prices: {node.user_data.get('prices', [])}")
    print(f"  OpenGraph: {node.user_data.get('opengraph', {})}")

Custom Plugin

from graph_crawler import BaseNodePlugin, NodePluginType

class PriceMonitorPlugin(BaseNodePlugin):
    """Monitor product prices across e-commerce sites."""
    
    plugin_type = NodePluginType.ON_HTML_PARSED
    priority = 100
    
    def execute(self, context):
        if not context.html_tree:
            return context
        
        # Extract price using CSS selectors
        price_elem = context.html_tree.select_one('[data-price], .price, #price')
        if price_elem:
            context.user_data["price"] = {
                "raw": price_elem.get_text(strip=True),
                "currency": self._detect_currency(price_elem),
                "value": self._parse_price(price_elem)
            }
        
        return context

graph = gc.crawl(url, node_plugins=[PriceMonitorPlugin()])

Smart Crawling (ML)

SmartPageFinderPlugin

ML-powered plugin that uses LLM or keyword analysis to find relevant pages:

from graph_crawler.extensions.plugins.node import SmartPageFinderPlugin

plugin = SmartPageFinderPlugin(
    search_prompt="Python developer jobs in Kyiv",
    config={
        "min_relevance_score": 0.7,
        "analyze_links": True,
        "model": "gpt-4o-mini"  # Optional LLM
    }
)

graph = gc.crawl("https://jobs.example.com", node_plugins=[plugin])

# Find target pages
for node in graph:
    if node.user_data.get("is_target_page"):
        print(f"Found: {node.url}")
        print(f"  Score: {node.user_data['relevance_score']:.2f}")
        print(f"  Reason: {node.user_data['relevance_reason']}")

VectorCrawlEnginePlugin

Vector-based URL prioritization using embeddings:

from graph_crawler.extensions.plugins.crawl_engine import VectorCrawlEnginePlugin

plugin = VectorCrawlEnginePlugin(
    keywords=["python", "developer", "remote", "jobs"],
    min_priority=1,
    max_priority=15,
    model_name="paraphrase-multilingual-MiniLM-L12-v2"
)
plugin.setup()  # Load model

graph = gc.crawl(
    "https://careers.example.com",
    engine_plugins=[plugin]
)

SmartCrawlEnginePlugin

Intelligent URL prioritization before scanning:

from graph_crawler.extensions.plugins.crawl_engine import SmartCrawlEnginePlugin

plugin = SmartCrawlEnginePlugin(
    search_prompt="Machine learning engineer positions",
    config={
        "aggressive_filtering": True,  # Skip irrelevant URLs without scanning
        "use_llm": False  # Use fast keyword-based analysis
    }
)

graph = gc.crawl("https://linkedin.com/jobs", engine_plugins=[plugin])

Anti-Bot & CAPTCHA

Anti-Bot Detection

GraphCrawler can detect and bypass various anti-bot systems:

from graph_crawler.extensions.plugins.engine import AntiBotSystem, detect_anti_bot_system

# Automatic detection
system = detect_anti_bot_system(html_content)
# Returns: AntiBotSystem.CLOUDFLARE, AKAMAI, DATADOME, PERIMETERX, etc.

Supported anti-bot systems:

Cloudflare (Challenge, Turnstile)
Akamai (Bot Manager)
DataDome
PerimeterX
Imperva/Incapsula

CAPTCHA Solving

Integration with popular CAPTCHA solving services:

from graph_crawler.extensions.plugins.engine.captcha import (
    CaptchaPlugin,
    create_solver,
    CaptchaType
)

# Create solver
solver = create_solver(
    service="2captcha",  # or "anticaptcha", "capsolver"
    api_key="your-api-key"
)

# Check balance
balance = solver.check_balance()
print(f"Balance: ${balance}")

# Use with crawler
plugin = CaptchaPlugin(solver=solver)
graph = gc.crawl(
    "https://protected-site.com",
    driver="playwright",
    engine_plugins=[plugin]
)

Supported CAPTCHA types:

reCAPTCHA v2/v3
hCaptcha
Image CAPTCHA

Distributed Crawling

Scale horizontally with Celery workers:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│    Redis    │◀────│   Worker 1  │
└─────────────┘     │   (Broker)  │     └─────────────┘
                    └─────────────┘             │
                          ▲                     ▼
                          │               ┌───────────┐
                    ┌─────────────┐       │  MongoDB  │
                    │   Worker 2  │──────▶│ (Results) │
                    └─────────────┘       └───────────┘
                          ▲
                    ┌─────────────┐
                    │   Worker N  │
                    └─────────────┘

Setup

# Start Redis
docker run -d -p 6379:6379 redis:alpine

# Start workers
celery -A graph_crawler.infrastructure.messaging worker -l INFO -c 4

Usage

from graph_crawler import EasyDistributedCrawler

crawler = EasyDistributedCrawler(
    broker_url="redis://localhost:6379/0",
    result_backend="redis://localhost:6379/1",
    mongodb_uri="mongodb://localhost:27017"
)

# Submit crawl job
job_id = await crawler.submit(
    url="https://large-site.com",
    max_pages=100_000,
    max_depth=10,
    workers=8
)

# Monitor progress
while True:
    status = await crawler.get_status(job_id)
    print(f"Progress: {status.pages_crawled}/{status.pages_total}")
    
    if status.is_complete:
        break
    
    await asyncio.sleep(5)

# Get results
graph = await crawler.get_result(job_id)

REST API & Dashboard

REST API

Built-in FastAPI-based REST API for remote control:

# Start API server
uvicorn graph_crawler.api.rest_api:router --port 8001

Endpoints:

Method	Endpoint	Description
`POST`	`/api/v1/crawl/start`	Start new crawl
`POST`	`/api/v1/crawl/{id}/pause`	Pause crawl
`POST`	`/api/v1/crawl/{id}/resume`	Resume crawl
`POST`	`/api/v1/crawl/{id}/stop`	Stop crawl
`GET`	`/api/v1/crawl/{id}/status`	Get crawl status
`GET`	`/api/v1/crawl/list`	List all crawls

Real-Time Dashboard

# Start dashboard
uvicorn graph_crawler.api.dashboard:app --port 8000

Features:

📊 Real-time statistics via WebSocket
📈 Live crawl progress visualization
⏯️ Pause/Resume/Stop controls
📝 Error monitoring
📉 Performance metrics

WebSocket Events:

initial_state — Current state on connect
stats_update — Statistics update
page_crawled — New page scanned
error — Crawl error occurred

Webhooks

Receive real-time notifications for crawl events:

from graph_crawler.api.webhooks import WebhookManager, WebhookEvent

# Setup webhooks
manager = WebhookManager()

manager.add_webhook(
    url="https://your-server.com/webhook",
    events=[
        WebhookEvent.CRAWL_STARTED,
        WebhookEvent.CRAWL_FINISHED,
        WebhookEvent.PAGE_CRAWLED,
        WebhookEvent.CRAWL_ERROR,
        WebhookEvent.MILESTONE_REACHED,  # Every N pages
    ],
    secret="your-hmac-secret",  # For signature verification
    headers={"Authorization": "Bearer token"}
)

# Start async delivery
await manager.start()

# Integrate with crawler
await integrate_webhooks_with_crawler(event_bus, webhook_configs)

Webhook Payload:

{
    "event": "page_crawled",
    "data": {
        "url": "https://example.com/page",
        "status": 200,
        "depth": 2
    },
    "timestamp": "2024-01-15T10:30:00Z"
}

AI Integration

⚠️ Important: AI features require your own API keys from external providers. GraphCrawler provides the integration layer and abstractions, not built-in AI capabilities.

Required API Keys

Provider	Environment Variable	Get Key
OpenAI	`OPENAI_API_KEY`	platform.openai.com
Anthropic	`ANTHROPIC_API_KEY`	console.anthropic.com
AWS Bedrock	`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY`	AWS Console
Emergent	`EMERGENT_LLM_KEY`	Universal key for EmergentModel

# Example .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
EMERGENT_LLM_KEY=...

LLM-Powered Extraction

from graph_crawler.ai import ExtractionPlugin
from graph_crawler.ai.models import OpenAIModel, AnthropicModel, BedrockModel

# Configure model
model = OpenAIModel(
    api_key="sk-...",
    model="gpt-4o",
    temperature=0
)

# Create extraction plugin
extractor = ExtractionPlugin(
    model=model,
    prompt="""
    Extract the following from this page:
    - Main topic
    - Key entities (people, companies, products)
    - Sentiment (positive/neutral/negative)
    
    Return as JSON.
    """
)

graph = gc.crawl(
    "https://news-site.com",
    max_pages=50,
    node_plugins=[extractor]
)

# Access AI-extracted data
for node in graph:
    ai_data = node.user_data.get("ai_extraction", {})
    print(f"{node.url}: {ai_data.get('main_topic')}")

Vector Search

from graph_crawler.extensions.plugins.node.vectorization import (
    VectorizationPlugin,
    semantic_search,
    cluster_by_similarity
)

vectorizer = VectorizationPlugin(
    model="text-embedding-3-small",
    api_key="sk-..."
)

graph = gc.crawl("https://docs.example.com", node_plugins=[vectorizer])

# Semantic search across crawled pages
results = semantic_search(
    graph=graph,
    query="How to configure authentication?",
    top_k=5
)

for node, score in results:
    print(f"[{score:.3f}] {node.url}")

# Cluster similar pages
clusters = cluster_by_similarity(graph, method="kmeans", n_clusters=5)

Data Extraction

Built-in Extractors

Extractor	Data Types	Formats
PhoneExtractor	Phone numbers	UA, US, RU, international
EmailExtractor	Email addresses	RFC 5322 compliant
PriceExtractor	Prices	USD, EUR, UAH, ranges
StructuredData	Schema.org	JSON-LD, Microdata, RDFa
OpenGraph	Social meta	og:title, og:image, etc.
TwitterCards	Twitter meta	twitter:card, etc.

Structured Data Extraction

from graph_crawler.extensions.plugins.node import StructuredDataPlugin

graph = gc.crawl(
    "https://shop.example.com",
    node_plugins=[StructuredDataPlugin()]
)

for node in graph:
    # JSON-LD data
    jsonld = node.user_data.get("jsonld", [])
    for item in jsonld:
        if item.get("@type") == "Product":
            print(f"Product: {item.get('name')}")
            print(f"Price: {item.get('offers', {}).get('price')}")
    
    # OpenGraph
    og = node.user_data.get("opengraph", {})
    print(f"OG Title: {og.get('og:title')}")
    
    # Microdata
    microdata = node.user_data.get("microdata", [])

CLI Reference

# Crawl website
graph-crawler crawl https://example.com \
    --max-depth 5 \
    --max-pages 1000 \
    --driver playwright \
    --output ./results/

# List saved graphs
graph-crawler list --storage sqlite --path ./data/

# Graph information
graph-crawler info my_graph --detailed

# Export graph
graph-crawler export my_graph \
    --format csv \
    --output ./exports/graph.csv

# Compare two graphs
graph-crawler diff graph_v1 graph_v2 --show-added --show-removed

# Start API server
graph-crawler serve --host 0.0.0.0 --port 8000

# Initialize new project
graph-crawler init my_crawler_project

Architecture

GraphCrawler follows Clean Architecture principles:

┌──────────────────────────────────────────────────────────────┐
│                        Presentation                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │    CLI     │  │  REST API  │  │  WebSocket │             │
│  └────────────┘  └────────────┘  └────────────┘             │
├──────────────────────────────────────────────────────────────┤
│                         Public API                            │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │  crawl() • async_crawl() • GraphCrawlerClient           │ │
│  └─────────────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│                      Application Layer                        │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │  Use Cases │  │  Services  │  │    DTOs    │             │
│  │  (Spider)  │  │ (Exporter) │  │  (Mapper)  │             │
│  └────────────┘  └────────────┘  └────────────┘             │
├──────────────────────────────────────────────────────────────┤
│                        Domain Layer                           │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │  Entities  │  │   Value    │  │ Interfaces │             │
│  │Graph,Node, │  │  Objects   │  │IDriver,    │             │
│  │   Edge     │  │Settings,   │  │IStorage    │             │
│  └────────────┘  └────────────┘  └────────────┘             │
├──────────────────────────────────────────────────────────────┤
│                     Infrastructure Layer                      │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │ Transport  │  │Persistence │  │ Messaging  │             │
│  │HTTP,       │  │SQLite,     │  │Celery,     │             │
│  │Playwright  │  │MongoDB     │  │Redis       │             │
│  └────────────┘  └────────────┘  └────────────┘             │
├──────────────────────────────────────────────────────────────┤
│                      Extensions Layer                         │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐             │
│  │   Node     │  │   Engine   │  │    AI      │             │
│  │  Plugins   │  │  Plugins   │  │  Models    │             │
│  └────────────┘  └────────────┘  └────────────┘             │
└──────────────────────────────────────────────────────────────┘

Directory Structure

graph_crawler/
├── api/                      # Public API surface
│   ├── sync.py              # Synchronous API
│   ├── async_.py            # Asynchronous API
│   ├── client/              # OOP client interface
│   ├── rest_api.py          # FastAPI REST endpoints
│   ├── dashboard.py         # Real-time dashboard
│   ├── webhooks.py          # Webhook notifications
│   └── websocket_manager.py # WebSocket handling
├── domain/                   # Core business logic
│   ├── entities/            # Graph, Node, Edge
│   ├── value_objects/       # Settings, Configs, Rules
│   ├── interfaces/          # Abstract contracts
│   └── events/              # Domain events (EventBus)
├── application/              # Application services
│   ├── use_cases/           # Crawling, export logic
│   ├── services/            # Factories, helpers
│   └── dto/                 # Data transfer objects
├── infrastructure/           # External implementations
│   ├── transport/           # HTTP, Playwright drivers
│   ├── persistence/         # Storage backends
│   └── messaging/           # Celery, Redis
├── extensions/               # Plugin system
│   ├── plugins/
│   │   ├── node/           # Content extraction plugins
│   │   ├── crawl_engine/   # URL prioritization plugins
│   │   └── engine/         # Anti-bot, CAPTCHA plugins
│   └── middleware/          # Request middleware
├── ai/                       # AI/ML integrations
│   ├── models/              # OpenAI, Anthropic, Bedrock
│   └── extraction/          # LLM extraction
└── shared/                   # Cross-cutting concerns
    ├── exceptions.py        # Custom exceptions
    ├── constants.py         # Configuration
    └── utils/               # Helpers

Performance

Benchmarks

Tested on AWS c5.2xlarge (8 vCPU, 16 GB RAM):

Scenario	Pages	Time	Memory	Rate
Static site (HTTP)	10,000	45s	512 MB	222 p/s
SPA (Playwright)	1,000	180s	2 GB	5.5 p/s
Distributed (4 workers)	100,000	15min	8 GB	111 p/s
Low-memory mode	1,000,000	4h	1 GB	69 p/s
Python 3.14 free-threading	10,000	14s	430 MB	714 p/s

Optimization Tips

# 1. Use async driver for static sites
graph = gc.crawl(url, driver="async", concurrency=50)

# 2. Disable unnecessary features
graph = gc.crawl(
    url,
    compute_hashes=False,      # Skip content hashing
    extract_metadata=False,    # Skip meta extraction
    store_html=False           # Don't persist HTML
)

# 3. Use URL rules to focus crawl
rules = [URLRule(pattern=r"/blog/", should_scan=False)]
graph = gc.crawl(url, url_rules=rules)

# 4. Enable low-memory for large crawls
graph = gc.crawl(url, max_pages=500_000, low_memory_mode=True)

# 5. Enable Python 3.14 free-threading
# export PYTHON_GIL=0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.0.66

Apr 29, 2026

4.0.65

Apr 27, 2026

4.0.64

Apr 27, 2026

4.0.63

Apr 15, 2026

4.0.62

Apr 15, 2026

4.0.61

Apr 15, 2026

4.0.60

Apr 14, 2026

4.0.59

Apr 14, 2026

4.0.58

Apr 14, 2026

4.0.57

Apr 14, 2026

4.0.56

Apr 14, 2026

4.0.55

Apr 13, 2026

This version

4.0.53

Apr 13, 2026

4.0.52

Apr 13, 2026

4.0.51

Apr 13, 2026

4.0.50

Apr 5, 2026

4.0.49

Apr 5, 2026

4.0.48

Apr 4, 2026

4.0.47

Apr 4, 2026

4.0.46

Apr 4, 2026

4.0.45

Apr 4, 2026

4.0.44

Apr 4, 2026

4.0.43

Apr 2, 2026

4.0.42

Apr 2, 2026

4.0.41

Mar 30, 2026

4.0.40

Mar 30, 2026

4.0.36

Mar 30, 2026

4.0.35

Mar 30, 2026

4.0.34

Mar 30, 2026

4.0.33

Mar 30, 2026

4.0.32

Mar 22, 2026

4.0.31

Mar 21, 2026

4.0.30

Mar 21, 2026

4.0.29

Mar 21, 2026

4.0.28

Mar 10, 2026

4.0.27

Mar 10, 2026

4.0.26

Mar 9, 2026

4.0.25

Mar 9, 2026

4.0.24

Mar 8, 2026

4.0.23

Mar 1, 2026

4.0.22

Mar 1, 2026

4.0.21

Feb 27, 2026

4.0.20

Feb 26, 2026

4.0.19

Feb 26, 2026

4.0.18

Feb 24, 2026

4.0.16

Feb 23, 2026

4.0.15

Feb 23, 2026

4.0.14

Feb 21, 2026

4.0.13

Feb 20, 2026

4.0.12

Feb 17, 2026

4.0.11

Feb 17, 2026

4.0.10

Feb 15, 2026

4.0.9

Feb 15, 2026

4.0.8

Feb 15, 2026

4.0.7

Feb 15, 2026

4.0.6

Feb 7, 2026

4.0.4

Feb 3, 2026

4.0.3

Jan 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graph_crawler-4.0.53.tar.gz (870.1 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

graph_crawler-4.0.53-py3-none-any.whl (1.0 MB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file graph_crawler-4.0.53.tar.gz.

File metadata

Download URL: graph_crawler-4.0.53.tar.gz
Upload date: Apr 13, 2026
Size: 870.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for graph_crawler-4.0.53.tar.gz
Algorithm	Hash digest
SHA256	`51462e6a33b4861f21f4cf7abfbd85af33389fca25c5a57583dbdfcd1f2e17ab`
MD5	`792993365c0c5224fc1b773761c009ee`
BLAKE2b-256	`3950966281d0ded8370aa9218b1cc9767fdbd506f34c61e43ad06c4fe8a85ef8`

See more details on using hashes here.

File details

Details for the file graph_crawler-4.0.53-py3-none-any.whl.

File metadata

Download URL: graph_crawler-4.0.53-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 1.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for graph_crawler-4.0.53-py3-none-any.whl
Algorithm	Hash digest
SHA256	`534b0adfc53dff62bce215b5356e4c3e91f7357d19f9649df6b154a31bf9de68`
MD5	`74f8e1736135b94074485cb26443e2f2`
BLAKE2b-256	`623f3e802337916e4d15ca75f8d6cc8b94fbe4297fe0a70620e3a69b7f2d2ea5`

See more details on using hashes here.

graph-crawler 4.0.53

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GraphCrawler

Why GraphCrawler?

🎯 Built for Scale

🧩 Extensible by Design

🛡️ Production Ready

🤖 AI-Native

Table of Contents

What's New in v4.0

🚀 Python 3.14 Free-Threading Support

🌱 Multiple Seed URLs

🔄 Incremental Crawling

📡 Real-Time Dashboard & WebSocket

Installation

From PyPI (Recommended)

Optional Dependencies

System Requirements

Quick Start

Basic Usage

Async API

Client Interface

Core Concepts

Graph Model

Graph Operations

Configuration

Crawl Parameters

URL Rules

Settings Files

Drivers

Basic Drivers

Professional Anti-Bot Drivers (v4.1+)

Undetected Chrome Driver

NoDriver (Async)

TLS Fingerprint Client

Botasaurus Driver

Playwright Driver

Storage Backends

Low-Memory Mode

Plugin System

Plugin Types

Built-in Node Plugins

Custom Plugin

Smart Crawling (ML)

SmartPageFinderPlugin

VectorCrawlEnginePlugin

SmartCrawlEnginePlugin

Anti-Bot & CAPTCHA

Anti-Bot Detection

CAPTCHA Solving

Distributed Crawling

Setup

Usage

REST API & Dashboard

REST API

Real-Time Dashboard

Webhooks

AI Integration

Required API Keys

LLM-Powered Extraction

Vector Search

Data Extraction

Built-in Extractors

Structured Data Extraction

CLI Reference

Architecture

Directory Structure

Performance

Benchmarks

Optimization Tips

Project details

Verified details

Maintainers