Sync-First бібліотека для побудови графу веб-сайтів - просто як requests!
Project description
GraphCrawler
Enterprise-Grade Web Crawling Framework for Graph-Based Site Analysis
Documentation • Examples • API Reference • Changelog
Why GraphCrawler?
Modern web applications require sophisticated crawling solutions. GraphCrawler was built from the ground up with Clean Architecture principles, offering unmatched flexibility and performance for web analysis tasks.
🎯 Built for Scale
|
🧩 Extensible by Design
|
🛡️ Production Ready
|
🤖 AI-Native
|
Table of Contents
- What's New in v4.0
- Installation
- Quick Start
- Core Concepts
- Configuration
- Drivers
- Storage Backends
- Plugin System
- Smart Crawling (ML)
- Anti-Bot & CAPTCHA
- Distributed Crawling
- REST API & Dashboard
- Webhooks
- AI Integration
- Data Extraction
- CLI Reference
- Architecture
- Performance
What's New in v4.0
🚀 Python 3.14 Free-Threading Support
# Enable free-threading for maximum speed
export PYTHON_GIL=0
python your_script.py
Performance Results:
- ⚡ 2-4x faster HTML parsing
- 🚀 3.2x faster end-to-end crawling
- 📉 16% less memory usage
- ⏱️ 30% faster startup
🌱 Multiple Seed URLs
graph = gc.crawl(
seed_urls=[
"https://example.com/products/",
"https://example.com/blog/",
"https://example.com/docs/",
],
max_depth=3
)
🔄 Incremental Crawling
# Start with initial crawl
graph1 = gc.crawl("https://example.com", max_pages=50)
# Later, continue from where you left off
graph2 = gc.crawl(base_graph=graph1, max_pages=100)
📡 Real-Time Dashboard & WebSocket
# Start dashboard server
uvicorn graph_crawler.api.dashboard:app --port 8000
# WebSocket endpoint for live updates
# ws://localhost:8000/ws/crawl
Installation
From PyPI (Recommended)
pip install graph-crawler
Optional Dependencies
# JavaScript/SPA rendering
pip install graph-crawler[playwright]
# Vector embeddings & ML
pip install graph-crawler[embeddings]
# MongoDB storage backend
pip install graph-crawler[mongodb]
# PostgreSQL storage backend
pip install graph-crawler[postgresql]
# Distributed crawling (Celery)
pip install graph-crawler[distributed]
# Full installation
pip install graph-crawler[all]
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| Python | 3.11 | 3.14+ (free-threading) |
| Memory | 512 MB | 4 GB+ |
| OS | Linux, macOS, Windows | Linux (Ubuntu 22.04+) |
Quick Start
Basic Usage
import graph_crawler as gc
# Crawl a website
graph = gc.crawl(
url="https://example.com",
max_depth=3,
max_pages=100
)
# Analyze results
print(f"Discovered {len(graph.nodes):,} pages")
print(f"Mapped {len(graph.edges):,} links")
# Persist to disk
gc.save_graph(graph, "example_graph.json")
Async API
import asyncio
import graph_crawler as gc
async def main():
graph = await gc.async_crawl(
url="https://example.com",
max_depth=5,
max_pages=1000,
request_delay=0.25
)
# Process nodes concurrently
async for node in graph.iter_nodes_async():
print(f"[{node.response_status}] {node.url}")
return graph
graph = asyncio.run(main())
Client Interface
from graph_crawler import GraphCrawlerClient
async with GraphCrawlerClient(
driver="playwright",
storage="sqlite"
) as client:
# Configure and execute
graph = await client.crawl(
"https://spa-application.com",
max_depth=4
)
# Save with metadata
await client.save("spa_graph", graph, tags=["spa", "react"])
Core Concepts
Graph Model
GraphCrawler represents websites as directed graphs:
┌─────────────────────────────────────────────────────────┐
│ Graph │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node │───▶│ Node │───▶│ Node │ │
│ │ (root) │ │ /about │ │ /team │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ │ ┌────┴────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node │ │ Node │ │ Node │ │
│ │ /blog │ │/contact │ │/careers │ │
│ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────┘
Node — represents a single page with:
- URL, depth, status code
- Content hash (SHA-256) and SimHash
- Metadata (title, description, headers)
- Custom user data
Edge — represents a link between pages:
- Source and target URLs
- Link text and attributes
- Edge type (internal, external, resource)
Graph Operations
# Graph algebra
merged = graph1 + graph2 # Union
diff = graph1 - graph2 # Difference
common = graph1 & graph2 # Intersection
symmetric = graph1 ^ graph2 # Symmetric difference
# Subgraph detection
if graph1 < graph2:
print("graph1 is a subgraph of graph2")
# Find popular pages
popular = graph.get_popular_nodes(top_n=10, by='in_degree')
# Find orphan pages (no incoming links)
for node in graph:
if graph.get_in_degree(node) == 0 and node.depth > 0:
print(f"Orphan: {node.url}")
Configuration
Crawl Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str |
required | Starting URL |
seed_urls |
list[str] |
None |
Multiple starting URLs |
max_depth |
int |
3 |
Maximum link depth from root |
max_pages |
int |
100 |
Maximum pages to crawl |
same_domain |
bool |
True |
Restrict to starting domain |
request_delay |
float |
0.5 |
Delay between requests (seconds) |
timeout |
int |
300 |
Global timeout (seconds) |
driver |
str |
"http" |
Transport driver |
url_rules |
list[URLRule] |
[] |
URL filtering/priority rules |
node_plugins |
list[Plugin] |
[] |
Content processing plugins |
respect_robots |
bool |
True |
Honor robots.txt |
base_graph |
Graph |
None |
Continue incremental crawl |
low_memory_mode |
bool |
False |
Enable eviction for large crawls |
URL Rules
from graph_crawler import URLRule, SmartURLRule, build_smart_rules
# Pattern-based filtering
rules = [
# Skip non-HTML resources
URLRule(pattern=r"\.(pdf|zip|exe|dmg)$", should_scan=False),
# Skip admin areas
URLRule(pattern=r"/(admin|wp-admin|dashboard)/", should_scan=False),
# Prioritize product pages
URLRule(pattern=r"/products?/[\w-]+$", priority=10),
# Deprioritize pagination
URLRule(pattern=r"\?page=\d+", priority=-5),
]
# Or use smart presets
rules = build_smart_rules(
skip_extensions=[".pdf", ".zip"],
skip_paths=["/admin", "/api"],
priority_paths=["/products", "/categories"],
skip_query_params=["session", "token"]
)
graph = gc.crawl("https://shop.example.com", url_rules=rules)
Settings Files
# crawler_settings.yaml
project_name: "my_crawler"
crawler:
max_depth: 5
max_pages: 10000
request_delay: 0.25
timeout: 3600
respect_robots: true
driver:
type: playwright
headless: true
wait_for: networkidle
storage:
type: sqlite
path: ./data/crawl.db
plugins:
- graph_crawler.extensions.plugins.node.SEOPlugin
- graph_crawler.extensions.plugins.node.StructuredDataPlugin
from graph_crawler import CrawlerSettings
settings = CrawlerSettings.from_yaml("crawler_settings.yaml")
graph = gc.crawl("https://example.com", settings=settings)
Drivers
GraphCrawler supports multiple transport drivers:
| Driver | Engine | Best For | Anti-Bot | JS Rendering |
|---|---|---|---|---|
http |
httpx |
Static sites, APIs | ❌ | ❌ |
async |
aiohttp |
High concurrency | ❌ | ❌ |
playwright |
Chromium | SPAs, modern sites | ✅ | ✅ |
cloudscraper |
requests | Cloudflare sites | ✅ | ❌ |
Playwright Driver
graph = gc.crawl(
"https://react-application.com",
driver="playwright",
# Playwright-specific options
headless=True,
wait_for_selector=".app-loaded",
wait_for_timeout=5000,
viewport={"width": 1920, "height": 1080},
# Browser context
ignore_https_errors=True,
java_script_enabled=True
)
Storage Backends
| Backend | Capacity | Persistence | Query Speed | Use Case |
|---|---|---|---|---|
memory |
~10K nodes | ❌ | ⚡ Fastest | Development, testing |
json |
~50K nodes | ✅ | 🐌 Slow | Small projects, export |
sqlite |
~500K nodes | ✅ | ⚡ Fast | Local production |
mongodb |
Unlimited | ✅ | ⚡ Fast | Distributed, cloud |
postgresql |
Unlimited | ✅ | ⚡ Fast | Analytics, enterprise |
Low-Memory Mode
For crawling sites with millions of pages:
graph = gc.crawl(
"https://large-site.com",
max_pages=1_000_000,
low_memory_mode=True, # Enable eviction
eviction_threshold=50_000, # Nodes in RAM
eviction_storage="sqlite" # Where to evict
)
Plugin System
Plugin Types
| Type | Hook Point | Use Case |
|---|---|---|
NodePlugin |
After page fetch | Content extraction, SEO analysis |
EdgePlugin |
After link discovery | Link classification, filtering |
EnginePlugin |
Before/during crawl | URL prioritization, rate limiting |
ExportPlugin |
During export | Data transformation |
Built-in Node Plugins
from graph_crawler.extensions.plugins.node import (
StructuredDataPlugin, # JSON-LD, OpenGraph, Microdata, RDFa, Twitter Cards
SEOPlugin, # Meta tags, headings, schema
ContentHashPlugin, # Duplicate detection (SHA-256 + SimHash)
VectorizationPlugin, # Text embeddings
)
from graph_crawler.extensions.plugins.node.extractors import (
PhoneExtractorPlugin, # UA, US, RU phone formats
EmailExtractorPlugin, # RFC 5322 compliant
PriceExtractorPlugin, # USD, EUR, UAH with ranges
)
graph = gc.crawl(
"https://shop.example.com",
node_plugins=[
StructuredDataPlugin(),
PhoneExtractorPlugin(),
EmailExtractorPlugin(),
PriceExtractorPlugin(),
]
)
# Access extracted data
for node in graph.nodes:
print(f"Page: {node.url}")
print(f" Phones: {node.user_data.get('phones', [])}")
print(f" Emails: {node.user_data.get('emails', [])}")
print(f" Prices: {node.user_data.get('prices', [])}")
print(f" OpenGraph: {node.user_data.get('opengraph', {})}")
Custom Plugin
from graph_crawler import BaseNodePlugin, NodePluginType
class PriceMonitorPlugin(BaseNodePlugin):
"""Monitor product prices across e-commerce sites."""
plugin_type = NodePluginType.ON_HTML_PARSED
priority = 100
def execute(self, context):
if not context.html_tree:
return context
# Extract price using CSS selectors
price_elem = context.html_tree.select_one('[data-price], .price, #price')
if price_elem:
context.user_data["price"] = {
"raw": price_elem.get_text(strip=True),
"currency": self._detect_currency(price_elem),
"value": self._parse_price(price_elem)
}
return context
graph = gc.crawl(url, node_plugins=[PriceMonitorPlugin()])
Smart Crawling (ML)
SmartPageFinderPlugin
ML-powered plugin that uses LLM or keyword analysis to find relevant pages:
from graph_crawler.extensions.plugins.node import SmartPageFinderPlugin
plugin = SmartPageFinderPlugin(
search_prompt="Python developer jobs in Kyiv",
config={
"min_relevance_score": 0.7,
"analyze_links": True,
"model": "gpt-4o-mini" # Optional LLM
}
)
graph = gc.crawl("https://jobs.example.com", node_plugins=[plugin])
# Find target pages
for node in graph.nodes:
if node.user_data.get("is_target_page"):
print(f"Found: {node.url}")
print(f" Score: {node.user_data['relevance_score']:.2f}")
print(f" Reason: {node.user_data['relevance_reason']}")
VectorCrawlEnginePlugin
Vector-based URL prioritization using embeddings:
from graph_crawler.extensions.plugins.crawl_engine import VectorCrawlEnginePlugin
plugin = VectorCrawlEnginePlugin(
keywords=["python", "developer", "remote", "jobs"],
min_priority=1,
max_priority=15,
model_name="paraphrase-multilingual-MiniLM-L12-v2"
)
plugin.setup() # Load model
graph = gc.crawl(
"https://careers.example.com",
engine_plugins=[plugin]
)
SmartCrawlEnginePlugin
Intelligent URL prioritization before scanning:
from graph_crawler.extensions.plugins.crawl_engine import SmartCrawlEnginePlugin
plugin = SmartCrawlEnginePlugin(
search_prompt="Machine learning engineer positions",
config={
"aggressive_filtering": True, # Skip irrelevant URLs without scanning
"use_llm": False # Use fast keyword-based analysis
}
)
graph = gc.crawl("https://linkedin.com/jobs", engine_plugins=[plugin])
Anti-Bot & CAPTCHA
Anti-Bot Detection
GraphCrawler can detect and bypass various anti-bot systems:
from graph_crawler.extensions.plugins.engine import AntiBotSystem, detect_anti_bot_system
# Automatic detection
system = detect_anti_bot_system(html_content)
# Returns: AntiBotSystem.CLOUDFLARE, AKAMAI, DATADOME, PERIMETERX, etc.
Supported anti-bot systems:
- Cloudflare (Challenge, Turnstile)
- Akamai (Bot Manager)
- DataDome
- PerimeterX
- Imperva/Incapsula
CAPTCHA Solving
Integration with popular CAPTCHA solving services:
from graph_crawler.extensions.plugins.engine.captcha import (
CaptchaPlugin,
create_solver,
CaptchaType
)
# Create solver
solver = create_solver(
service="2captcha", # or "anticaptcha", "capsolver"
api_key="your-api-key"
)
# Check balance
balance = solver.check_balance()
print(f"Balance: ${balance}")
# Use with crawler
plugin = CaptchaPlugin(solver=solver)
graph = gc.crawl(
"https://protected-site.com",
driver="playwright",
engine_plugins=[plugin]
)
Supported CAPTCHA types:
- reCAPTCHA v2/v3
- hCaptcha
- Image CAPTCHA
Distributed Crawling
Scale horizontally with Celery workers:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ Redis │◀────│ Worker 1 │
└─────────────┘ │ (Broker) │ └─────────────┘
└─────────────┘ │
▲ ▼
│ ┌───────────┐
┌─────────────┐ │ MongoDB │
│ Worker 2 │──────▶│ (Results) │
└─────────────┘ └───────────┘
▲
┌─────────────┐
│ Worker N │
└─────────────┘
Setup
# Start Redis
docker run -d -p 6379:6379 redis:alpine
# Start workers
celery -A graph_crawler.infrastructure.messaging worker -l INFO -c 4
Usage
from graph_crawler import EasyDistributedCrawler
crawler = EasyDistributedCrawler(
broker_url="redis://localhost:6379/0",
result_backend="redis://localhost:6379/1",
mongodb_uri="mongodb://localhost:27017"
)
# Submit crawl job
job_id = await crawler.submit(
url="https://large-site.com",
max_pages=100_000,
max_depth=10,
workers=8
)
# Monitor progress
while True:
status = await crawler.get_status(job_id)
print(f"Progress: {status.pages_crawled}/{status.pages_total}")
if status.is_complete:
break
await asyncio.sleep(5)
# Get results
graph = await crawler.get_result(job_id)
REST API & Dashboard
REST API
Built-in FastAPI-based REST API for remote control:
# Start API server
uvicorn graph_crawler.api.rest_api:router --port 8001
Endpoints:
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/crawl/start |
Start new crawl |
POST |
/api/v1/crawl/{id}/pause |
Pause crawl |
POST |
/api/v1/crawl/{id}/resume |
Resume crawl |
POST |
/api/v1/crawl/{id}/stop |
Stop crawl |
GET |
/api/v1/crawl/{id}/status |
Get crawl status |
GET |
/api/v1/crawl/list |
List all crawls |
Real-Time Dashboard
# Start dashboard
uvicorn graph_crawler.api.dashboard:app --port 8000
Features:
- 📊 Real-time statistics via WebSocket
- 📈 Live crawl progress visualization
- ⏯️ Pause/Resume/Stop controls
- 📝 Error monitoring
- 📉 Performance metrics
WebSocket Events:
initial_state— Current state on connectstats_update— Statistics updatepage_crawled— New page scannederror— Crawl error occurred
Webhooks
Receive real-time notifications for crawl events:
from graph_crawler.api.webhooks import WebhookManager, WebhookEvent
# Setup webhooks
manager = WebhookManager()
manager.add_webhook(
url="https://your-server.com/webhook",
events=[
WebhookEvent.CRAWL_STARTED,
WebhookEvent.CRAWL_FINISHED,
WebhookEvent.PAGE_CRAWLED,
WebhookEvent.CRAWL_ERROR,
WebhookEvent.MILESTONE_REACHED, # Every N pages
],
secret="your-hmac-secret", # For signature verification
headers={"Authorization": "Bearer token"}
)
# Start async delivery
await manager.start()
# Integrate with crawler
await integrate_webhooks_with_crawler(event_bus, webhook_configs)
Webhook Payload:
{
"event": "page_crawled",
"data": {
"url": "https://example.com/page",
"status": 200,
"depth": 2
},
"timestamp": "2024-01-15T10:30:00Z"
}
AI Integration
LLM-Powered Extraction
from graph_crawler.ai import ExtractionPlugin
from graph_crawler.ai.models import OpenAIModel, AnthropicModel, BedrockModel
# Configure model
model = OpenAIModel(
api_key="sk-...",
model="gpt-4o",
temperature=0
)
# Create extraction plugin
extractor = ExtractionPlugin(
model=model,
prompt="""
Extract the following from this page:
- Main topic
- Key entities (people, companies, products)
- Sentiment (positive/neutral/negative)
Return as JSON.
"""
)
graph = gc.crawl(
"https://news-site.com",
max_pages=50,
node_plugins=[extractor]
)
# Access AI-extracted data
for node in graph.nodes:
ai_data = node.user_data.get("ai_extraction", {})
print(f"{node.url}: {ai_data.get('main_topic')}")
Vector Search
from graph_crawler.extensions.plugins.node.vectorization import (
VectorizationPlugin,
semantic_search,
cluster_by_similarity
)
vectorizer = VectorizationPlugin(
model="text-embedding-3-small",
api_key="sk-..."
)
graph = gc.crawl("https://docs.example.com", node_plugins=[vectorizer])
# Semantic search across crawled pages
results = semantic_search(
graph=graph,
query="How to configure authentication?",
top_k=5
)
for node, score in results:
print(f"[{score:.3f}] {node.url}")
# Cluster similar pages
clusters = cluster_by_similarity(graph, method="kmeans", n_clusters=5)
Data Extraction
Built-in Extractors
| Extractor | Data Types | Formats |
|---|---|---|
| PhoneExtractor | Phone numbers | UA, US, RU, international |
| EmailExtractor | Email addresses | RFC 5322 compliant |
| PriceExtractor | Prices | USD, EUR, UAH, ranges |
| StructuredData | Schema.org | JSON-LD, Microdata, RDFa |
| OpenGraph | Social meta | og:title, og:image, etc. |
| TwitterCards | Twitter meta | twitter:card, etc. |
Structured Data Extraction
from graph_crawler.extensions.plugins.node import StructuredDataPlugin
graph = gc.crawl(
"https://shop.example.com",
node_plugins=[StructuredDataPlugin()]
)
for node in graph.nodes:
# JSON-LD data
jsonld = node.user_data.get("jsonld", [])
for item in jsonld:
if item.get("@type") == "Product":
print(f"Product: {item.get('name')}")
print(f"Price: {item.get('offers', {}).get('price')}")
# OpenGraph
og = node.user_data.get("opengraph", {})
print(f"OG Title: {og.get('og:title')}")
# Microdata
microdata = node.user_data.get("microdata", [])
CLI Reference
# Crawl website
graph-crawler crawl https://example.com \
--max-depth 5 \
--max-pages 1000 \
--driver playwright \
--output ./results/
# List saved graphs
graph-crawler list --storage sqlite --path ./data/
# Graph information
graph-crawler info my_graph --detailed
# Export graph
graph-crawler export my_graph \
--format csv \
--output ./exports/graph.csv
# Compare two graphs
graph-crawler diff graph_v1 graph_v2 --show-added --show-removed
# Start API server
graph-crawler serve --host 0.0.0.0 --port 8000
# Initialize new project
graph-crawler init my_crawler_project
Architecture
GraphCrawler follows Clean Architecture principles:
┌──────────────────────────────────────────────────────────────┐
│ Presentation │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ CLI │ │ REST API │ │ WebSocket │ │
│ └────────────┘ └────────────┘ └────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Public API │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ crawl() • async_crawl() • GraphCrawlerClient │ │
│ └─────────────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Application Layer │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Use Cases │ │ Services │ │ DTOs │ │
│ │ (Spider) │ │ (Exporter) │ │ (Mapper) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Domain Layer │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Entities │ │ Value │ │ Interfaces │ │
│ │Graph,Node, │ │ Objects │ │IDriver, │ │
│ │ Edge │ │Settings, │ │IStorage │ │
│ └────────────┘ └────────────┘ └────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Infrastructure Layer │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Transport │ │Persistence │ │ Messaging │ │
│ │HTTP, │ │SQLite, │ │Celery, │ │
│ │Playwright │ │MongoDB │ │Redis │ │
│ └────────────┘ └────────────┘ └────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Extensions Layer │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Node │ │ Engine │ │ AI │ │
│ │ Plugins │ │ Plugins │ │ Models │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────────────────────┘
Directory Structure
graph_crawler/
├── api/ # Public API surface
│ ├── sync.py # Synchronous API
│ ├── async_.py # Asynchronous API
│ ├── client/ # OOP client interface
│ ├── rest_api.py # FastAPI REST endpoints
│ ├── dashboard.py # Real-time dashboard
│ ├── webhooks.py # Webhook notifications
│ └── websocket_manager.py # WebSocket handling
├── domain/ # Core business logic
│ ├── entities/ # Graph, Node, Edge
│ ├── value_objects/ # Settings, Configs, Rules
│ ├── interfaces/ # Abstract contracts
│ └── events/ # Domain events (EventBus)
├── application/ # Application services
│ ├── use_cases/ # Crawling, export logic
│ ├── services/ # Factories, helpers
│ └── dto/ # Data transfer objects
├── infrastructure/ # External implementations
│ ├── transport/ # HTTP, Playwright drivers
│ ├── persistence/ # Storage backends
│ └── messaging/ # Celery, Redis
├── extensions/ # Plugin system
│ ├── plugins/
│ │ ├── node/ # Content extraction plugins
│ │ ├── crawl_engine/ # URL prioritization plugins
│ │ └── engine/ # Anti-bot, CAPTCHA plugins
│ └── middleware/ # Request middleware
├── ai/ # AI/ML integrations
│ ├── models/ # OpenAI, Anthropic, Bedrock
│ └── extraction/ # LLM extraction
└── shared/ # Cross-cutting concerns
├── exceptions.py # Custom exceptions
├── constants.py # Configuration
└── utils/ # Helpers
Performance
Benchmarks
Tested on AWS c5.2xlarge (8 vCPU, 16 GB RAM):
| Scenario | Pages | Time | Memory | Rate |
|---|---|---|---|---|
| Static site (HTTP) | 10,000 | 45s | 512 MB | 222 p/s |
| SPA (Playwright) | 1,000 | 180s | 2 GB | 5.5 p/s |
| Distributed (4 workers) | 100,000 | 15min | 8 GB | 111 p/s |
| Low-memory mode | 1,000,000 | 4h | 1 GB | 69 p/s |
| Python 3.14 free-threading | 10,000 | 14s | 430 MB | 714 p/s |
Optimization Tips
# 1. Use async driver for static sites
graph = gc.crawl(url, driver="async", concurrency=50)
# 2. Disable unnecessary features
graph = gc.crawl(
url,
compute_hashes=False, # Skip content hashing
extract_metadata=False, # Skip meta extraction
store_html=False # Don't persist HTML
)
# 3. Use URL rules to focus crawl
rules = [URLRule(pattern=r"/blog/", should_scan=False)]
graph = gc.crawl(url, url_rules=rules)
# 4. Enable low-memory for large crawls
graph = gc.crawl(url, max_pages=500_000, low_memory_mode=True)
# 5. Enable Python 3.14 free-threading
# export PYTHON_GIL=0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graph_crawler-4.0.33.tar.gz.
File metadata
- Download URL: graph_crawler-4.0.33.tar.gz
- Upload date:
- Size: 760.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7296af5b6ab943819de5bac7d2bc912ce87a30d04b97a9444d6d42ac5ff065c3
|
|
| MD5 |
076bb375d100944a46d34e1cb757c353
|
|
| BLAKE2b-256 |
045e7b726635a47a7505324b8c12fca27f74bb4928c1c85afc3dffe3c577f90b
|
File details
Details for the file graph_crawler-4.0.33-py3-none-any.whl.
File metadata
- Download URL: graph_crawler-4.0.33-py3-none-any.whl
- Upload date:
- Size: 907.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33ec4cab04a7e3364fa40d82529789a304877c74b33c235252c66cc7e79260f6
|
|
| MD5 |
daed5a2eb68cffe3155f32ebc3d3444b
|
|
| BLAKE2b-256 |
d5c29cea4b9385945dbc02234bbca842477813610724a14e74137f7ca5cbe248
|