Skip to main content

Sync-First бібліотека для побудови графу веб-сайтів - просто як requests!

Project description

GraphCrawler

Python 3.11+ PyPI License: MIT

Бібліотека для побудови графу структури веб-сайтів.

Встановлення

pip install graph-crawler

Додаткові залежності:

pip install graph-crawler[playwright]    # JavaScript сайти
pip install graph-crawler[embeddings]    # Векторизація
pip install graph-crawler[mongodb]       # MongoDB storage
pip install graph-crawler[all]           # Все

Використання

import graph_crawler as gc

# Базове сканування
graph = gc.crawl("https://example.com", max_depth=2, max_pages=50)

print(f"Сторінок: {len(graph.nodes)}")
print(f"Посилань: {len(graph.edges)}")

# Збереження
gc.save_graph(graph, "site.json")

Async API

import asyncio
import graph_crawler as gc

async def main():
    graph = await gc.async_crawl("https://example.com")
    return graph

graph = asyncio.run(main())

Параметри crawl()

Параметр Default Опис
max_depth 3 Глибина сканування
max_pages 100 Ліміт сторінок
same_domain True Тільки поточний домен
request_delay 0.5 Затримка між запитами (сек)
timeout 300 Загальний таймаут (сек)
driver "http" Драйвер: http, playwright

URL Rules

from graph_crawler import crawl, URLRule

rules = [
    URLRule(pattern=r"\.pdf$", should_scan=False),
    URLRule(pattern=r"/admin/", should_scan=False),
    URLRule(pattern=r"/products/", priority=10),
]

graph = crawl("https://example.com", url_rules=rules)

Операції з графом

# Статистика
stats = graph.get_stats()

# Пошук
node = graph.get_node_by_url("https://example.com/page")

# Об'єднання графів
merged = graph1 + graph2

# Експорт
graph.export_edges("edges.csv", format="csv")
graph.export_edges("graph.dot", format="dot")

Драйвери

Драйвер Призначення
http Статичні сайти (default)
playwright JavaScript/SPA сайти
# Playwright для JS сайтів
graph = gc.crawl("https://spa-site.com", driver="playwright")

Storage

Тип Рекомендовано
memory < 1K сторінок
json 1K - 20K сторінок
sqlite 20K+ сторінок
mongodb Великі проекти

CLI

graph-crawler crawl https://example.com --max-depth 2
graph-crawler list
graph-crawler info graph_name

Вимоги

  • Python 3.11+

Ліцензія

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graph_crawler-4.0.20.tar.gz (749.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graph_crawler-4.0.20-py3-none-any.whl (875.5 kB view details)

Uploaded Python 3

File details

Details for the file graph_crawler-4.0.20.tar.gz.

File metadata

  • Download URL: graph_crawler-4.0.20.tar.gz
  • Upload date:
  • Size: 749.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for graph_crawler-4.0.20.tar.gz
Algorithm Hash digest
SHA256 0525a162a55bd7e8d52f6f523119814aa6605cd7f79174dafdf80597e9a5a0c4
MD5 4cf04da2724b4895dc27c2d37a1e06bb
BLAKE2b-256 c19bf8721220c2df00e84ecf783966645e51853c62d2bf677d89a927fa8ab4e3

See more details on using hashes here.

File details

Details for the file graph_crawler-4.0.20-py3-none-any.whl.

File metadata

  • Download URL: graph_crawler-4.0.20-py3-none-any.whl
  • Upload date:
  • Size: 875.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for graph_crawler-4.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 ff2d3a36577a7eb2f007496ecdc6b223eda0dd06943242c9f7cbfc8b0f0f9e2b
MD5 fec87922d6d8f7116ba27b9e794e25d3
BLAKE2b-256 7c3638c5e8e80fe8801bcbcbe45a5f8cbf3f970cc8e3fa9918de903aa46e13fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page