Skip to main content

High-performance async web scraper with selectolax parsing

Project description

Ergane

PyPI version License: MIT Python 3.10+

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

  • Programmatic APICrawler, crawl(), and stream() let you embed scraping in any Python application
  • Hook System — Intercept requests and responses with the CrawlHook protocol
  • HTTP/2 & Async — Fast concurrent connections with per-domain rate limiting and retry logic
  • Fast Parsing — Selectolax HTML parsing (16x faster than BeautifulSoup)
  • Built-in Presets — Pre-configured schemas for popular sites (no coding required)
  • Custom Schemas — Define Pydantic models with CSS selectors and type coercion
  • Multi-Format Output — Export to CSV, Excel, Parquet, JSON, JSONL, or SQLite
  • Response Caching — SQLite-based caching for faster development and debugging
  • MCP Server — Expose scraping tools to LLMs via the Model Context Protocol
  • JavaScript Rendering — Render JS-heavy pages via Playwright
  • Production Ready — robots.txt compliance, graceful shutdown, checkpoints, proxy support

Installation

pip install ergane

# With JavaScript rendering support
pip install ergane[js]

# With MCP server support
pip install ergane[mcp]

Quick Start

CLI

# Use a built-in preset (no code needed)
ergane --preset quotes -o quotes.csv

# Crawl a custom URL
ergane -u https://example.com -n 100 -o data.parquet

# List available presets
ergane --list-presets

Python

import asyncio
from ergane import Crawler

async def main():
    async with Crawler(
        urls=["https://quotes.toscrape.com"],
        max_pages=20,
    ) as crawler:
        async for item in crawler.stream():
            print(item.url, item.title)

asyncio.run(main())

MCP Server

pip install ergane[mcp]
ergane mcp

Add to your Claude Desktop or Claude Code config:

{
  "mcpServers": {
    "ergane": {
      "command": "ergane",
      "args": ["mcp"]
    }
  }
}

The server exposes four tools: list_presets_tool, extract_tool, scrape_preset_tool, and crawl_tool.

Documentation

Guide Description
CLI Reference Commands, flags, presets, schemas, config files, troubleshooting
Python Library Crawler API, hooks, typed extraction, authentication, advanced usage
MCP Server Setup, tool reference, error handling, result format

Built-in Presets

Preset Site Fields Extracted
hacker-news news.ycombinator.com title, link, score, author, comments
github-repos github.com/search name, description, stars, language, link
reddit old.reddit.com title, subreddit, score, author, comments, link
quotes quotes.toscrape.com quote, author, tags
amazon-products amazon.com title, price, rating, reviews, link
ebay-listings ebay.com title, price, condition, shipping, link
wikipedia-articles en.wikipedia.org title, link
bbc-news bbc.com/news title, summary, link

Architecture

Ergane separates the engine (pure async library) from its three interfaces: the CLI (Rich progress bars, signal handling), the Python library (direct import), and the MCP server (LLM integration). Hooks plug into the pipeline at two points: after scheduling and after fetching.

         CLI (main.py)              Python Library             MCP Server
    ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
    │  Click options        │  │  from ergane import   │  │  FastMCP (stdio)     │
    │  Rich progress bar    │  │  Crawler / crawl()    │  │  4 tools + resources │
    │  Signal handling      │  │  stream()             │  │  ergane mcp          │
    │  CrawlOptions config  │  │                       │  │                      │
    └──────────┬───────────┘  └────────────┬──────────┘  └──────────┬───────────┘
               │                           │                        │
               └───────────────┬───────────┴────────────────────────┘
                              │
                              ▼
               ┌──────────────────────────────────┐
               │         Crawler  (engine)         │
               │    Pure async · no I/O concerns   │
               │    Spawns N worker coroutines     │
               └──────────────┬───────────────────┘
                              │
              ┌───────────────┼───────────────────┐
              │               │                   │
              ▼               ▼                   ▼
      ┌──────────────┐ ┌───────────┐   ┌──────────────────────┐
      │  Scheduler   │ │  Fetcher  │   │   Pipeline           │
      │  URL frontier│ │  HTTP/2   │   │  BatchWriter strategy│
      │  dedup queue │ │  retries  │   │  per-format writers  │
      └──────┬───────┘ └─────┬─────┘   └──────────────────────┘
             │               │
             ▼               ▼
  ┌──────────────────────────────────────────────────┐
  │                Worker loop (× N)                  │
  │                                                   │
  │  1. Scheduler.get()   → CrawlRequest              │
  │  2. hooks.on_request  → modify / skip             │
  │  3. Fetcher.fetch()   → CrawlResponse             │
  │  4. hooks.on_response → modify / discard          │
  │  5. Parser.extract()  → Pydantic model / dict     │
  │  6. Pipeline.add()    → buffered output           │
  │  7. extract_links()   → new URLs → Scheduler      │
  └──────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────┐
  │               Cross-cutting concerns              │
  │                                                   │
  │  Cache ─── SQLite response cache with TTL         │
  │  Checkpoint ─ periodic JSON snapshots for resume  │
  │  Schema ── YAML → FieldConfig → extraction        │
  └──────────────────────────────────────────────────┘

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.7.1.tar.gz (235.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ergane-0.7.1-py3-none-any.whl (66.5 kB view details)

Uploaded Python 3

File details

Details for the file ergane-0.7.1.tar.gz.

File metadata

  • Download URL: ergane-0.7.1.tar.gz
  • Upload date:
  • Size: 235.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.7.1.tar.gz
Algorithm Hash digest
SHA256 969cb6f9c1e19e25ab3108bd58deaf28e9dabc68f0df566a62633756aacd64a9
MD5 f49217c9bd2a057305d1aa1c0e9fd033
BLAKE2b-256 a4b3f27cfddadf0f35d361e42605c221cfb3aac3325400bf5a8d672f1f1d35e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.7.1.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ergane-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: ergane-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 66.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 236434157cbe2960a07eb417d6883786d286383416072bbda5abdf941b3cfbbf
MD5 2f5b8fa498eeb7385fad1bfa84eaffb0
BLAKE2b-256 72658bfcc7c6ca1ff9c8b8b676086bc165a4d984a71c2494b1d1c5b6ee20cac8

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.7.1-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page