Sync-First бібліотека для побудови графу веб-сайтів - просто як requests!

These details have not been verified by PyPI

Project links

Project description

GraphCrawler

Python бібліотека для сканування веб-сайтів та побудови графу їх структури.

🚀 Python 3.14 Optimizations

GraphCrawler 4.0 оптимізований для Python 3.14 з підтримкою free-threading:

⚡ 2-4x швидше HTML парсинг (free-threading)
🚀 3.2x швидше end-to-end crawling
📉 16% менше memory usage
⏱️ 30% швидший startup

Free-threading Mode (рекомендовано)

# Увімкнути free-threading для максимальної швидкості
export PYTHON_GIL=0
python your_script.py

Встановлення

pip install -e .

Optional dependencies

# Playwright driver (для JavaScript сайтів)
pip install -e ".[playwright]"

# Векторизація тексту (плагін)
pip install -e ".[embeddings]"

# Content extractors (плагіни)
pip install -e ".[articles]"

# MongoDB/PostgreSQL storage
pip install -e ".[mongodb,postgresql]"

# Все разом
pip install -e ".[all]"

Швидкий старт

import graph_crawler as gc

# Синхронний API (рекомендовано)
graph = gc.crawl("https://example.com")

print(f"Знайдено {len(graph.nodes)} сторінок")
print(f"Знайдено {len(graph.edges)} посилань")

API

Sync API

import graph_crawler as gc

# Функція crawl()
graph = gc.crawl(
    "https://example.com",
    max_depth=3,        # Максимальна глибина (default: 3)
    max_pages=100,      # Максимум сторінок (default: 100)
    same_domain=True,   # Тільки поточний домен (default: True)
    timeout=300,        # Таймаут в секундах
    request_delay=0.5,  # Затримка між запитами (default: 0.5)
    follow_links=True,  # Переходити за посиланнями (default: True)
    driver="http",      # "http", "async", "playwright"
)

# Клас Crawler (reusable)
with gc.Crawler(max_depth=3) as crawler:
    graph1 = crawler.crawl("https://site1.com")
    graph2 = crawler.crawl("https://site2.com")

Параметр follow_links

# follow_links=False - сканувати тільки вказані URL
urls = ["https://site.com/page1", "https://site.com/page2"]
graph = gc.crawl(seed_urls=urls, follow_links=False)

Async API

import asyncio
import graph_crawler as gc

async def main():
    # Функція async_crawl()
    graph = await gc.async_crawl("https://example.com")
    
    # Клас AsyncCrawler (паралельний краулінг)
    async with gc.AsyncCrawler() as crawler:
        graphs = await asyncio.gather(
            crawler.crawl("https://site1.com"),
            crawler.crawl("https://site2.com"),
        )
    return graphs

graphs = asyncio.run(main())

Операції з графом

# Статистика
stats = graph.get_stats()
# {'total_nodes': 47, 'scanned_nodes': 45, 'total_edges': 156, ...}

# Пошук вузла
node = graph.get_node_by_url("https://example.com/page")

# Операції над графами
merged = graph1 + graph2      # Об'єднання
diff = graph2 - graph1        # Різниця
common = graph1 & graph2      # Перетин

# Порівняння
if graph1 < graph2:
    print("graph1 є підграфом graph2")

# Експорт
graph.export_edges("edges.json", format="json")
graph.export_edges("edges.csv", format="csv")
graph.export_edges("graph.dot", format="dot")

URL Rules

from graph_crawler import crawl, URLRule

rules = [
    URLRule(pattern=r".*\.pdf$", should_scan=False),     # Ігнорувати PDF
    URLRule(pattern=r"/products/", priority=10),         # Високий пріоритет
    URLRule(pattern=r"/admin/", should_scan=False),      # Ігнорувати admin
    
    # should_follow_links - контроль переходу за посиланнями
    URLRule(
        pattern=r'external\.com',
        should_scan=True,           # Сканувати сторінку
        should_follow_links=False   # Не переходити за посиланнями
    ),
]

graph = crawl("https://example.com", url_rules=rules)

Edge Rules (контроль ребер)

from graph_crawler import crawl, EdgeRule

edge_rules = [
    # Не створювати edges при різниці глибини > 2
    EdgeRule(max_depth_diff=2, action='skip'),
    
    # Не створювати edges з blog на products
    EdgeRule(
        source_pattern=r'.*/blog/.*',
        target_pattern=r'.*/products/.*',
        action='skip'
    ),
]

graph = crawl("https://example.com", edge_rules=edge_rules)

ContentType (тип контенту)

from graph_crawler import ContentType

# Детекція типу контенту
content_type = ContentType.from_content_type_header("text/html; charset=utf-8")
# ContentType.HTML

content_type = ContentType.from_url("https://api.example.com/data.json")
# ContentType.JSON

# Фільтрація nodes по типу
html_nodes = [n for n in graph if n.content_type == ContentType.HTML]

# Перевірки
if content_type.is_text_based():
    print("Text content")
if content_type.is_scannable():
    print("Can scan for links")

Плагіни

from graph_crawler import crawl, BaseNodePlugin, NodePluginType

class CustomPlugin(BaseNodePlugin):
    @property
    def name(self):
        return "custom_plugin"
    
    @property
    def plugin_type(self):
        return NodePluginType.ON_HTML_PARSED
    
    def execute(self, context):
        # context.html_tree - BeautifulSoup об'єкт
        # context.extracted_links - список посилань
        # context.user_data - словник для даних
        images = context.html_tree.find_all('img')
        context.user_data['image_count'] = len(images)
        return context

graph = crawl("https://example.com", plugins=[CustomPlugin()])

Драйвери

Драйвер	Опис	Використання
`http`	Async HTTP (aiohttp)	Статичні сайти (default)
`async`	Alias для http	Зворотня сумісність
`playwright`	Браузер з JS рендерингом	JavaScript сайти

# HTTP драйвер (default)
graph = gc.crawl("https://example.com", driver="http")

# Playwright для JavaScript сайтів
graph = gc.crawl("https://spa-example.com", driver="playwright")

Storage

Storage	Опис	Рекомендовано для
`memory`	В пам'яті	< 1,000 сторінок
`json`	JSON файл	1,000 - 20,000 сторінок
`sqlite`	SQLite база	20,000+ сторінок
`postgresql`	PostgreSQL	Великі проекти
`mongodb`	MongoDB	Великі проекти

Структура проекту

graph_crawler/
├── api/              # Simple API (crawl, Crawler, async_crawl)
├── client/           # GraphCrawlerClient
├── core/             # Node, Edge, Graph, Events, Models
├── crawler/          # Spider, Scheduler, LinkProcessor, Filters
├── drivers/          # HTTP, Playwright драйвери
├── storage/          # Memory, JSON, SQLite, PostgreSQL, MongoDB
├── plugins/          # Node плагіни (vectorization, content_extractors)
├── middleware/       # Rate limiting, Retry, Robots.txt, Proxy
├── factories/        # Driver, Storage factories
├── containers/       # Dependency Injection containers
├── adapters/         # BeautifulSoup adapter
├── exporters/        # JSON, CSV, DOT exporters
└── utils/            # URL utils, DNS cache, Bloom filter

Тестування

pytest
pytest --cov=package_crawler

Вимоги

Python 3.11+ (мінімальна версія)
Залежності: див. requirements.txt

Яку версію Python обрати?

Версія	Рекомендовано для	Примітки
3.14	Максимальна швидкість	Free-threading (GIL=0), ~3.2x швидше
3.12-3.13	Візуалізація з коробки	Стабільні залежності (pyvis, networkx)
3.11	Сумісність	Всі функції працюють

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.0.66

Apr 29, 2026

4.0.65

Apr 27, 2026

4.0.64

Apr 27, 2026

4.0.63

Apr 15, 2026

4.0.62

Apr 15, 2026

4.0.61

Apr 15, 2026

4.0.60

Apr 14, 2026

4.0.59

Apr 14, 2026

4.0.58

Apr 14, 2026

4.0.57

Apr 14, 2026

4.0.56

Apr 14, 2026

4.0.55

Apr 13, 2026

4.0.53

Apr 13, 2026

4.0.52

Apr 13, 2026

4.0.51

Apr 13, 2026

4.0.50

Apr 5, 2026

4.0.49

Apr 5, 2026

4.0.48

Apr 4, 2026

4.0.47

Apr 4, 2026

4.0.46

Apr 4, 2026

4.0.45

Apr 4, 2026

4.0.44

Apr 4, 2026

4.0.43

Apr 2, 2026

4.0.42

Apr 2, 2026

4.0.41

Mar 30, 2026

4.0.40

Mar 30, 2026

4.0.36

Mar 30, 2026

4.0.35

Mar 30, 2026

4.0.34

Mar 30, 2026

4.0.33

Mar 30, 2026

4.0.32

Mar 22, 2026

4.0.31

Mar 21, 2026

4.0.30

Mar 21, 2026

4.0.29

Mar 21, 2026

4.0.28

Mar 10, 2026

4.0.27

Mar 10, 2026

4.0.26

Mar 9, 2026

4.0.25

Mar 9, 2026

4.0.24

Mar 8, 2026

4.0.23

Mar 1, 2026

4.0.22

Mar 1, 2026

4.0.21

Feb 27, 2026

4.0.20

Feb 26, 2026

4.0.19

Feb 26, 2026

4.0.18

Feb 24, 2026

4.0.16

Feb 23, 2026

4.0.15

Feb 23, 2026

4.0.14

Feb 21, 2026

4.0.13

Feb 20, 2026

4.0.12

Feb 17, 2026

4.0.11

Feb 17, 2026

4.0.10

Feb 15, 2026

4.0.9

Feb 15, 2026

4.0.8

Feb 15, 2026

4.0.7

Feb 15, 2026

This version

4.0.6

Feb 7, 2026

4.0.4

Feb 3, 2026

4.0.3

Jan 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graph_crawler-4.0.6.tar.gz (687.0 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

graph_crawler-4.0.6-py3-none-any.whl (815.1 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file graph_crawler-4.0.6.tar.gz.

File metadata

Download URL: graph_crawler-4.0.6.tar.gz
Upload date: Feb 7, 2026
Size: 687.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for graph_crawler-4.0.6.tar.gz
Algorithm	Hash digest
SHA256	`b9e8a72da0ccef72f70ea94e79a22a4632f94739e9c3474cee93115afd41ddfa`
MD5	`630244747c44ea9aece2f3e6c3c8f14f`
BLAKE2b-256	`928266049077daad094cf5b87b63f0ebb0545e9d28e7a36c373170c58d4441f6`

See more details on using hashes here.

File details

Details for the file graph_crawler-4.0.6-py3-none-any.whl.

File metadata

Download URL: graph_crawler-4.0.6-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 815.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for graph_crawler-4.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80c858c52575086e38d04682d98b4b65f0d0c6627b582bc6be7b56b824406260`
MD5	`28f98b83f94fb6f8071cff73fe3a8249`
BLAKE2b-256	`a1b8f8c350841db43e46d5254277d214391bd978074a40095cf62fc310e6c2ed`

See more details on using hashes here.

graph-crawler 4.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GraphCrawler

🚀 Python 3.14 Optimizations

Free-threading Mode (рекомендовано)

Встановлення

Optional dependencies

Швидкий старт

API

Sync API

Параметр follow_links

Async API

Операції з графом

URL Rules

Edge Rules (контроль ребер)

ContentType (тип контенту)

Плагіни

Драйвери

Storage

Структура проекту

Тестування

Вимоги

Яку версію Python обрати?

Рекомендації для Python 3.14

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes