Skip to main content

High-performance async web scraper with selectolax parsing

Project description

Ergane

PyPI version License: MIT Python 3.10+

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

  • Programmatic APICrawler, crawl(), and stream() let you embed scraping in any Python application
  • Hook System — Intercept requests and responses with the CrawlHook protocol
  • HTTP/2 & Async — Fast concurrent connections with per-domain rate limiting and retry logic
  • Fast Parsing — Selectolax HTML parsing (16x faster than BeautifulSoup)
  • Built-in Presets — Pre-configured schemas for popular sites (no coding required)
  • Custom Schemas — Define Pydantic models with CSS selectors and type coercion
  • Multi-Format Output — Export to CSV, Excel, Parquet, JSON, JSONL, or SQLite
  • Response Caching — SQLite-based caching for faster development and debugging
  • MCP Server — Expose scraping tools to LLMs via the Model Context Protocol
  • JavaScript Rendering — Render JS-heavy pages via Playwright
  • Production Ready — robots.txt compliance, graceful shutdown, checkpoints, proxy support

Installation

pip install ergane

# With JavaScript rendering support
pip install ergane[js]

# With MCP server support
pip install ergane[mcp]

Quick Start

CLI

# Use a built-in preset (no code needed)
ergane --preset quotes -o quotes.csv

# Crawl a custom URL
ergane -u https://example.com -n 100 -o data.parquet

# List available presets
ergane --list-presets

Python

import asyncio
from ergane import Crawler

async def main():
    async with Crawler(
        urls=["https://quotes.toscrape.com"],
        max_pages=20,
    ) as crawler:
        async for item in crawler.stream():
            print(item.url, item.title)

asyncio.run(main())

MCP Server

pip install ergane[mcp]
ergane mcp

Add to your Claude Desktop or Claude Code config:

{
  "mcpServers": {
    "ergane": {
      "command": "ergane",
      "args": ["mcp"]
    }
  }
}

The server exposes four tools: list_presets_tool, extract_tool, scrape_preset_tool, and crawl_tool.

Documentation

Guide Description
CLI Reference Commands, flags, presets, schemas, config files, troubleshooting
Python Library Crawler API, hooks, typed extraction, authentication, advanced usage
MCP Server Setup, tool reference, error handling, result format

Built-in Presets

Preset Site Fields Extracted
hacker-news news.ycombinator.com title, link, score, author, comments
github-repos github.com/search name, description, stars, language, link
reddit old.reddit.com title, subreddit, score, author, comments, link
quotes quotes.toscrape.com quote, author, tags
amazon-products amazon.com title, price, rating, reviews, link
ebay-listings ebay.com title, price, condition, shipping, link
wikipedia-articles en.wikipedia.org title, link
bbc-news bbc.com/news title, summary, link

Architecture

Ergane separates the engine (pure async library) from its three interfaces: the CLI (Rich progress bars, signal handling), the Python library (direct import), and the MCP server (LLM integration). Hooks plug into the pipeline at two points: after scheduling and after fetching.

         CLI (main.py)              Python Library             MCP Server
    ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
    │  Click options        │  │  from ergane import   │  │  FastMCP (stdio)     │
    │  Rich progress bar    │  │  Crawler / crawl()    │  │  4 tools + resources │
    │  Signal handling      │  │  stream()             │  │  ergane mcp          │
    │  CrawlOptions config  │  │                       │  │                      │
    └──────────┬───────────┘  └────────────┬──────────┘  └──────────┬───────────┘
               │                           │                        │
               └───────────────┬───────────┴────────────────────────┘
                              │
                              ▼
               ┌──────────────────────────────────┐
               │         Crawler  (engine)         │
               │    Pure async · no I/O concerns   │
               │    Spawns N worker coroutines     │
               └──────────────┬───────────────────┘
                              │
              ┌───────────────┼───────────────────┐
              │               │                   │
              ▼               ▼                   ▼
      ┌──────────────┐ ┌───────────┐   ┌──────────────────────┐
      │  Scheduler   │ │  Fetcher  │   │   Pipeline           │
      │  URL frontier│ │  HTTP/2   │   │  BatchWriter strategy│
      │  dedup queue │ │  retries  │   │  per-format writers  │
      └──────┬───────┘ └─────┬─────┘   └──────────────────────┘
             │               │
             ▼               ▼
  ┌──────────────────────────────────────────────────┐
  │                Worker loop (× N)                  │
  │                                                   │
  │  1. Scheduler.get()   → CrawlRequest              │
  │  2. hooks.on_request  → modify / skip             │
  │  3. Fetcher.fetch()   → CrawlResponse             │
  │  4. hooks.on_response → modify / discard          │
  │  5. Parser.extract()  → Pydantic model / dict     │
  │  6. Pipeline.add()    → buffered output           │
  │  7. extract_links()   → new URLs → Scheduler      │
  └──────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────┐
  │               Cross-cutting concerns              │
  │                                                   │
  │  Cache ─── SQLite response cache with TTL         │
  │  Checkpoint ─ periodic JSON snapshots for resume  │
  │  Schema ── YAML → FieldConfig → extraction        │
  └──────────────────────────────────────────────────┘

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.7.3.tar.gz (252.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ergane-0.7.3-py3-none-any.whl (67.4 kB view details)

Uploaded Python 3

File details

Details for the file ergane-0.7.3.tar.gz.

File metadata

  • Download URL: ergane-0.7.3.tar.gz
  • Upload date:
  • Size: 252.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.7.3.tar.gz
Algorithm Hash digest
SHA256 6eea996a2663f5de313483260ce489907be650b938241ca742bbd02414e4d55f
MD5 9247918dde4e029de311d269e90eb213
BLAKE2b-256 ca730821614a087fdb50febb54b1d181ff4953250c0bfcbc68c8232ec6aa408f

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.7.3.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ergane-0.7.3-py3-none-any.whl.

File metadata

  • Download URL: ergane-0.7.3-py3-none-any.whl
  • Upload date:
  • Size: 67.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.7.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6785c468b888d8af0230aa9bbe42e5f336ad8cba74631b7e3f92568a5f6efc51
MD5 d794e4e1763ca0a08693c9432f7844c2
BLAKE2b-256 629bb0804660d526159a4408c423564d6d8ba7b5888c395c177cee02e2aa0f36

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.7.3-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page