High-performance async web scraper with selectolax parsing
Project description
Ergane
High-performance async web scraper with HTTP/2 support, built with Python.
Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.
Features
- Programmatic API —
Crawler,crawl(), andstream()let you embed scraping in any Python application - Hook System — Intercept requests and responses with the
CrawlHookprotocol - HTTP/2 & Async — Fast concurrent connections with per-domain rate limiting and retry logic
- Fast Parsing — Selectolax HTML parsing (16x faster than BeautifulSoup)
- Built-in Presets — Pre-configured schemas for popular sites (no coding required)
- Custom Schemas — Define Pydantic models with CSS selectors and type coercion
- Multi-Format Output — Export to CSV, Excel, Parquet, JSON, JSONL, or SQLite
- Response Caching — SQLite-based caching for faster development and debugging
- MCP Server — Expose scraping tools to LLMs via the Model Context Protocol
- JavaScript Rendering — Render JS-heavy pages via Playwright
- Production Ready — robots.txt compliance, graceful shutdown, checkpoints, proxy support
Installation
pip install ergane
# With JavaScript rendering support
pip install ergane[js]
# With MCP server support
pip install ergane[mcp]
Quick Start
CLI
# Use a built-in preset (no code needed)
ergane --preset quotes -o quotes.csv
# Crawl a custom URL
ergane -u https://example.com -n 100 -o data.parquet
# List available presets
ergane --list-presets
Python
import asyncio
from ergane import Crawler
async def main():
async with Crawler(
urls=["https://quotes.toscrape.com"],
max_pages=20,
) as crawler:
async for item in crawler.stream():
print(item.url, item.title)
asyncio.run(main())
MCP Server
pip install ergane[mcp]
ergane mcp
Add to your Claude Desktop or Claude Code config:
{
"mcpServers": {
"ergane": {
"command": "ergane",
"args": ["mcp"]
}
}
}
The server exposes four tools: list_presets_tool, extract_tool, scrape_preset_tool, and crawl_tool.
Documentation
| Guide | Description |
|---|---|
| CLI Reference | Commands, flags, presets, schemas, config files, troubleshooting |
| Python Library | Crawler API, hooks, typed extraction, authentication, advanced usage |
| MCP Server | Setup, tool reference, error handling, result format |
Built-in Presets
| Preset | Site | Fields Extracted |
|---|---|---|
hacker-news |
news.ycombinator.com | title, link, score, author, comments |
github-repos |
github.com/search | name, description, stars, language, link |
reddit |
old.reddit.com | title, subreddit, score, author, comments, link |
quotes |
quotes.toscrape.com | quote, author, tags |
amazon-products |
amazon.com | title, price, rating, reviews, link |
ebay-listings |
ebay.com | title, price, condition, shipping, link |
wikipedia-articles |
en.wikipedia.org | title, link |
bbc-news |
bbc.com/news | title, summary, link |
Architecture
Ergane separates the engine (pure async library) from its three interfaces: the CLI (Rich progress bars, signal handling), the Python library (direct import), and the MCP server (LLM integration). Hooks plug into the pipeline at two points: after scheduling and after fetching.
CLI (main.py) Python Library MCP Server
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ Click options │ │ from ergane import │ │ FastMCP (stdio) │
│ Rich progress bar │ │ Crawler / crawl() │ │ 4 tools + resources │
│ Signal handling │ │ stream() │ │ ergane mcp │
│ CrawlOptions config │ │ │ │ │
└──────────┬───────────┘ └────────────┬──────────┘ └──────────┬───────────┘
│ │ │
└───────────────┬───────────┴────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Crawler (engine) │
│ Pure async · no I/O concerns │
│ Spawns N worker coroutines │
└──────────────┬───────────────────┘
│
┌───────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌───────────┐ ┌──────────────────────┐
│ Scheduler │ │ Fetcher │ │ Pipeline │
│ URL frontier│ │ HTTP/2 │ │ BatchWriter strategy│
│ dedup queue │ │ retries │ │ per-format writers │
└──────┬───────┘ └─────┬─────┘ └──────────────────────┘
│ │
▼ ▼
┌──────────────────────────────────────────────────┐
│ Worker loop (× N) │
│ │
│ 1. Scheduler.get() → CrawlRequest │
│ 2. hooks.on_request → modify / skip │
│ 3. Fetcher.fetch() → CrawlResponse │
│ 4. hooks.on_response → modify / discard │
│ 5. Parser.extract() → Pydantic model / dict │
│ 6. Pipeline.add() → buffered output │
│ 7. extract_links() → new URLs → Scheduler │
└──────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│ Cross-cutting concerns │
│ │
│ Cache ─── SQLite response cache with TTL │
│ Checkpoint ─ periodic JSON snapshots for resume │
│ Schema ── YAML → FieldConfig → extraction │
└──────────────────────────────────────────────────┘
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ergane-0.7.3.tar.gz.
File metadata
- Download URL: ergane-0.7.3.tar.gz
- Upload date:
- Size: 252.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6eea996a2663f5de313483260ce489907be650b938241ca742bbd02414e4d55f
|
|
| MD5 |
9247918dde4e029de311d269e90eb213
|
|
| BLAKE2b-256 |
ca730821614a087fdb50febb54b1d181ff4953250c0bfcbc68c8232ec6aa408f
|
Provenance
The following attestation bundles were made for ergane-0.7.3.tar.gz:
Publisher:
publish.yml on pyamin1878/ergane
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ergane-0.7.3.tar.gz -
Subject digest:
6eea996a2663f5de313483260ce489907be650b938241ca742bbd02414e4d55f - Sigstore transparency entry: 1001521952
- Sigstore integration time:
-
Permalink:
pyamin1878/ergane@635911b9abcf5a3e0ddea350ee05b40af45cdbf9 -
Branch / Tag:
refs/tags/v0.7.3 - Owner: https://github.com/pyamin1878
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@635911b9abcf5a3e0ddea350ee05b40af45cdbf9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file ergane-0.7.3-py3-none-any.whl.
File metadata
- Download URL: ergane-0.7.3-py3-none-any.whl
- Upload date:
- Size: 67.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6785c468b888d8af0230aa9bbe42e5f336ad8cba74631b7e3f92568a5f6efc51
|
|
| MD5 |
d794e4e1763ca0a08693c9432f7844c2
|
|
| BLAKE2b-256 |
629bb0804660d526159a4408c423564d6d8ba7b5888c395c177cee02e2aa0f36
|
Provenance
The following attestation bundles were made for ergane-0.7.3-py3-none-any.whl:
Publisher:
publish.yml on pyamin1878/ergane
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ergane-0.7.3-py3-none-any.whl -
Subject digest:
6785c468b888d8af0230aa9bbe42e5f336ad8cba74631b7e3f92568a5f6efc51 - Sigstore transparency entry: 1001521960
- Sigstore integration time:
-
Permalink:
pyamin1878/ergane@635911b9abcf5a3e0ddea350ee05b40af45cdbf9 -
Branch / Tag:
refs/tags/v0.7.3 - Owner: https://github.com/pyamin1878
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@635911b9abcf5a3e0ddea350ee05b40af45cdbf9 -
Trigger Event:
release
-
Statement type: