Skip to main content

High-performance async web scraper with selectolax parsing

Project description

Ergane

PyPI version License: MIT Python 3.10+

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

  • Programmatic APICrawler, crawl(), and stream() let you embed scraping in any Python application
  • Hook System — Intercept requests and responses with the CrawlHook protocol
  • HTTP/2 & Async — Fast concurrent connections with per-domain rate limiting and retry logic
  • Fast Parsing — Selectolax HTML parsing (16x faster than BeautifulSoup)
  • Built-in Presets — Pre-configured schemas for popular sites (no coding required)
  • Custom Schemas — Define Pydantic models with CSS selectors and type coercion
  • Multi-Format Output — Export to CSV, Excel, Parquet, JSON, JSONL, or SQLite
  • Response Caching — SQLite-based caching for faster development and debugging
  • MCP Server — Expose scraping tools to LLMs via the Model Context Protocol
  • Production Ready — robots.txt compliance, graceful shutdown, checkpoints, proxy support

Installation

pip install ergane

# With MCP server support (optional)
pip install ergane[mcp]

Quick Start

CLI — run from your terminal

# Use a built-in preset (no code needed)
ergane --preset quotes -o quotes.csv

# Crawl a custom URL
ergane -u https://example.com -n 100 -o data.parquet

# List available presets
ergane --list-presets

Python — embed in your application

import asyncio
from ergane import Crawler

async def main():
    async with Crawler(
        urls=["https://quotes.toscrape.com"],
        max_pages=20,
    ) as crawler:
        async for item in crawler.stream():
            print(item.url, item.title)

asyncio.run(main())

Python Library

Ergane's engine is a pure async library. The CLI is a thin wrapper around it — everything the CLI can do, your code can do too.

Crawler

The main entry point. Use it as an async context manager:

from ergane import Crawler

async with Crawler(
    urls=["https://example.com"],
    max_pages=50,
    concurrency=10,
    rate_limit=5.0,
) as crawler:
    results = await crawler.run()      # collect all items

Key parameters:

Parameter Default Description
urls (required) Seed URL(s) to start crawling
schema None Pydantic model for typed extraction
concurrency 10 Number of concurrent workers
max_pages 100 Maximum pages to crawl
max_depth 3 Maximum link-follow depth
rate_limit 10.0 Requests per second per domain
timeout 30.0 HTTP request timeout (seconds)
same_domain True Only follow links on the seed domain
hooks None List of CrawlHook instances
output None File path to write results
output_format "auto" csv, excel, parquet, json, jsonl, sqlite
cache False Enable SQLite response caching

run()

Executes the crawl and returns all extracted items as a list:

async with Crawler(urls=["https://example.com"], max_pages=10) as c:
    results = await c.run()
    print(f"Got {len(results)} items")

stream()

Yields items as they arrive — memory-efficient for large crawls:

async with Crawler(urls=["https://example.com"], max_pages=500) as c:
    async for item in c.stream():
        process(item)  # handle each item immediately

crawl()

One-shot convenience function — creates a Crawler, runs it, returns results:

from ergane import crawl

results = await crawl(
    urls=["https://example.com"],
    max_pages=10,
    concurrency=5,
)

Typed Extraction with Schemas

Pass a Pydantic model with CSS selectors to extract structured data:

from datetime import datetime
from pydantic import BaseModel
from ergane import Crawler, selector

class Quote(BaseModel):
    url: str
    crawled_at: datetime
    text: str = selector("span.text")
    author: str = selector("small.author")
    tags: list[str] = selector("div.tags a.tag")

async with Crawler(
    urls=["https://quotes.toscrape.com"],
    schema=Quote,
    max_pages=50,
) as crawler:
    for quote in await crawler.run():
        print(f"{quote.author}: {quote.text}")

The selector() helper supports:

Argument Description
css CSS selector string
attr Extract an attribute instead of text (e.g. "href", "src")
coerce Aggressive type coercion ("$19.99"19.99)
default Default value if selector matches nothing

Hooks

Hooks let you intercept and modify requests before they're sent, and responses after they're received. They follow the CrawlHook protocol:

from ergane import CrawlHook, CrawlRequest, CrawlResponse

class CrawlHook(Protocol):
    async def on_request(self, request: CrawlRequest) -> CrawlRequest | None: ...
    async def on_response(self, response: CrawlResponse) -> CrawlResponse | None: ...

Return the (possibly modified) object to continue, or None to skip/discard.

BaseHook

Subclass BaseHook and override only the methods you need:

from ergane import BaseHook, CrawlRequest

class SkipAdminPages(BaseHook):
    async def on_request(self, request: CrawlRequest) -> CrawlRequest | None:
        if "/admin" in request.url:
            return None  # skip this URL
        return request

Built-in Hooks

Hook Purpose
LoggingHook() Logs requests and responses at DEBUG level
AuthHeaderHook(headers) Injects custom headers (e.g. {"Authorization": "Bearer ..."})
StatusFilterHook(allowed) Discards responses outside allowed status codes (default: {200})

Using Hooks

from ergane import Crawler
from ergane.crawler.hooks import LoggingHook, AuthHeaderHook

async with Crawler(
    urls=["https://api.example.com"],
    hooks=[
        AuthHeaderHook({"Authorization": "Bearer token123"}),
        LoggingHook(),
    ],
) as crawler:
    results = await crawler.run()

Hooks run in order: for requests, each hook receives the output of the previous one. The same applies for responses.

MCP Server

Ergane includes an MCP (Model Context Protocol) server that lets LLMs crawl websites and extract structured data. Install the optional dependency:

pip install ergane[mcp]

Running the Server

# Via CLI subcommand
ergane mcp

# Via Python module
python -m ergane.mcp

Both start a stdio-based MCP server compatible with Claude Code, Claude Desktop, and other MCP clients.

Configuration

Add to your MCP client config (e.g. Claude Desktop claude_desktop_config.json):

{
  "mcpServers": {
    "ergane": {
      "command": "ergane",
      "args": ["mcp"]
    }
  }
}

Or for Claude Code (~/.claude/claude_code_config.json):

{
  "mcpServers": {
    "ergane": {
      "command": "ergane",
      "args": ["mcp"]
    }
  }
}

Available Tools

The MCP server exposes four tools:

list_presets_tool

Discover all built-in scraping presets with their target URLs and available fields.

extract_tool

Extract structured data from a single web page using CSS selectors.

Arguments:
  url          — URL to scrape (required)
  selectors    — Map of field names to CSS selectors, e.g. {"title": "h1", "price": ".price"}
  schema_yaml  — Full YAML schema (alternative to selectors)

scrape_preset_tool

Scrape a website using a built-in preset — zero configuration needed.

Arguments:
  preset     — Preset name, e.g. "hacker-news", "quotes" (required)
  max_pages  — Maximum pages to scrape (default: 5)

crawl_tool

Crawl one or more websites with full control over depth, concurrency, and output format.

Arguments:
  urls           — Starting URLs (required)
  schema_yaml    — YAML schema for CSS-based extraction
  max_pages      — Maximum pages to crawl (default: 10)
  max_depth      — Link-follow depth (default: 1, 0 = seed only)
  concurrency    — Concurrent requests (default: 5)
  output_format  — "json", "csv", or "jsonl" (default: "json")

Resources

Each built-in preset is also exposed as an MCP resource at preset://{name} (e.g. preset://hacker-news), allowing LLMs to browse preset details before scraping.

Built-in Presets

Preset Site Fields Extracted
hacker-news news.ycombinator.com title, link, score, author, comments
github-repos github.com/search name, description, stars, language, link
reddit old.reddit.com title, subreddit, score, author, comments, link
quotes quotes.toscrape.com quote, author, tags
amazon-products amazon.com title, price, rating, reviews, link
ebay-listings ebay.com title, price, condition, shipping, link
wikipedia-articles en.wikipedia.org title, link
bbc-news bbc.com/news title, summary, link

Custom Schemas

Define extraction rules in a YAML schema file:

# schema.yaml
name: ProductItem
fields:
  name:
    selector: "h1.product-title"
    type: str
  price:
    selector: "span.price"
    type: float
    coerce: true  # "$19.99" -> 19.99
  tags:
    selector: "span.tag"
    type: list[str]
  image_url:
    selector: "img.product"
    attr: src
    type: str
ergane -u https://example.com --schema schema.yaml -o products.parquet

Type coercion (coerce: true) handles common patterns: "$19.99"19.99, "1,234"1234, "yes"True.

Supported types: str, int, float, bool, datetime, list[T].

You can also load YAML schemas programmatically:

from ergane import Crawler, load_schema_from_yaml

ProductItem = load_schema_from_yaml("schema.yaml")

async with Crawler(
    urls=["https://example.com"],
    schema=ProductItem,
) as crawler:
    results = await crawler.run()

Output Formats

Output format is auto-detected from file extension:

ergane --preset quotes -o quotes.csv      # CSV
ergane --preset quotes -o quotes.xlsx     # Excel
ergane --preset quotes -o quotes.parquet  # Parquet (default)
ergane --preset quotes -o quotes.json     # JSON array
ergane --preset quotes -o quotes.jsonl    # JSONL (one object per line)
ergane --preset quotes -o quotes.sqlite   # SQLite database

You can also force a format with --format/-f regardless of file extension:

ergane --preset quotes -f jsonl -o output.dat
import polars as pl
df = pl.read_parquet("output.parquet")

Architecture

Ergane separates the engine (pure async library) from its three interfaces: the CLI (Rich progress bars, signal handling), the Python library (direct import), and the MCP server (LLM integration). Hooks plug into the pipeline at two points: after scheduling and after fetching.

         CLI (main.py)              Python Library             MCP Server
    ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
    │  Click options        │  │  from ergane import   │  │  FastMCP (stdio)     │
    │  Rich progress bar    │  │  Crawler / crawl()    │  │  4 tools + resources │
    │  Signal handling      │  │  stream()             │  │  ergane mcp          │
    │  Config file merge    │  │                       │  │                      │
    └──────────┬───────────┘  └────────────┬──────────┘  └──────────┬───────────┘
               │                           │                        │
               └───────────────┬───────────┴────────────────────────┘
                              │
                              ▼
               ┌──────────────────────────────────┐
               │         Crawler  (engine)         │
               │    Pure async · no I/O concerns   │
               │    Spawns N worker coroutines     │
               └──────────────┬───────────────────┘
                              │
              ┌───────────────┼───────────────────┐
              │               │                   │
              ▼               ▼                   ▼
      ┌──────────────┐ ┌───────────┐   ┌──────────────┐
      │  Scheduler   │ │  Fetcher  │   │   Pipeline   │
      │  URL frontier│ │  HTTP/2   │   │  Batch write │
      │  dedup queue │ │  retries  │   │  multi-format│
      └──────┬───────┘ └─────┬─────┘   └──────────────┘
             │               │
             ▼               ▼
  ┌──────────────────────────────────────────────────┐
  │                Worker loop (× N)                  │
  │                                                   │
  │  1. Scheduler.get()   → CrawlRequest              │
  │  2. hooks.on_request  → modify / skip             │
  │  3. Fetcher.fetch()   → CrawlResponse             │
  │  4. hooks.on_response → modify / discard          │
  │  5. Parser.extract()  → Pydantic model            │
  │  6. Pipeline.add()    → buffered output           │
  │  7. extract_links()   → new URLs → Scheduler      │
  └──────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────┐
  │               Cross-cutting concerns              │
  │                                                   │
  │  Cache ─── SQLite response cache with TTL         │
  │  Checkpoint ─ periodic JSON snapshots for resume  │
  │  Schema ── YAML → dynamic Pydantic model + coerce │
  └──────────────────────────────────────────────────┘

CLI Reference

Common Options

Option Short Default Description
--url -u none Start URL(s), can specify multiple
--output -o output.parquet Output file path
--max-pages -n 100 Maximum pages to crawl
--max-depth -d 3 Maximum crawl depth
--concurrency -c 10 Concurrent requests
--rate-limit -r 10.0 Requests per second per domain
--schema -s none YAML schema file for custom extraction
--preset -p none Use a built-in preset
--format -f auto Output format: csv, excel, parquet, json, jsonl, sqlite
--timeout -t 30 Request timeout in seconds
--proxy -x none HTTP/HTTPS proxy URL
--same-domain/--any-domain --same-domain Restrict crawling to seed domain
--ignore-robots false Ignore robots.txt
--cache false Enable response caching
--cache-dir .ergane_cache Cache directory
--cache-ttl 3600 Cache TTL in seconds
--resume Resume from checkpoint
--checkpoint-interval 100 Save checkpoint every N pages
--log-level INFO DEBUG, INFO, WARNING, ERROR
--log-file none Write logs to file
--no-progress Disable progress bar
--config -C none Config file path

Run ergane --help for the full list.

Advanced CLI Examples

# Crawl with a proxy
ergane -u https://example.com -o data.csv --proxy http://localhost:8080

# Resume an interrupted crawl (requires prior checkpoint)
ergane -u https://example.com -n 500 --resume

# Save checkpoints every 50 pages with debug logging
ergane -u https://example.com -n 500 --checkpoint-interval 50 \
    --log-level DEBUG --log-file crawl.log

# Use a YAML config file and override concurrency from CLI
ergane -u https://example.com -C config.yaml -c 20

# Combine preset with custom URL and explicit format
ergane --preset hacker-news -u https://news.ycombinator.com/newest \
    -f csv -o newest.csv -n 200

Configuration

Ergane looks for a config file in these locations (first match wins):

  1. Explicit path via --config/-C
  2. ~/.ergane.yaml
  3. ./.ergane.yaml
  4. ./ergane.yaml
crawler:
  max_pages: 100
  max_depth: 3
  concurrency: 10
  rate_limit: 10.0

defaults:
  output_format: "csv"
  checkpoint_interval: 100

logging:
  level: "INFO"
  file: null

CLI flags override config file values.

Troubleshooting

Getting empty or partial output

  • Check --max-depth: depth 0 means only the seed URL is crawled. Increase with -d 3 to follow links.
  • Same-domain filtering: by default Ergane only follows links on the same domain as the seed URL. Use --any-domain to crawl cross-domain.
  • Selector mismatch: if using a custom schema, verify your CSS selectors match the actual site HTML (sites change frequently).

Blocked by robots.txt

If a target site disallows your user-agent in robots.txt, Ergane will return 403 for those URLs. Options:

# Ignore robots.txt (use responsibly)
ergane -u https://example.com --ignore-robots -o data.csv

Rate limiting or 429 responses

Lower the request rate and concurrency:

ergane -u https://example.com -r 2 -c 3 -o data.csv

The built-in per-domain token-bucket rate limiter (-r) controls requests per second. Reducing concurrency (-c) also lowers overall load.

Timeouts and connection errors

Increase the request timeout and enable retries (3 retries is the default):

ergane -u https://slow-site.com -t 60 -o data.csv

Resuming after a crash

Ergane periodically saves checkpoints (default: every 100 pages). To resume:

ergane -u https://example.com -n 1000 --resume

The checkpoint file is automatically deleted after a successful crawl.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.7.0.tar.gz (182.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ergane-0.7.0-py3-none-any.whl (51.7 kB view details)

Uploaded Python 3

File details

Details for the file ergane-0.7.0.tar.gz.

File metadata

  • Download URL: ergane-0.7.0.tar.gz
  • Upload date:
  • Size: 182.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Pop!_OS","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ergane-0.7.0.tar.gz
Algorithm Hash digest
SHA256 426b63a91f3423735ea78a8fc953ccfbf6b3d363a64e7100dafdf94d7c5ba8b7
MD5 80c0dcf6d5a8f13d9386acf617db4e3a
BLAKE2b-256 07def9b7f5688e8c9c694cd0c9359952275b0113c125723dae8c786328c569ac

See more details on using hashes here.

File details

Details for the file ergane-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: ergane-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 51.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Pop!_OS","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ergane-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11553e80153ad779869ba4ff57f6a2f3629896d01a68b67945dbd4ec9e22bba4
MD5 4baef630062b207c4020cba39fa4e111
BLAKE2b-256 cb10b423892e1da6abe554aca66e1980d3c23294f4cc8030760ce656289f6544

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page