High-performance async web scraper with selectolax parsing
Project description
Ergane
High-performance async web scraper with HTTP/2 support, built with Python.
Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.
Features
- HTTP/2 & Async - Fast concurrent connections with rate limiting and retry logic
- Fast Parsing - Selectolax HTML parsing (16x faster than BeautifulSoup)
- Built-in Presets - Pre-configured schemas for popular sites (no coding required)
- Custom Schemas - Define Pydantic models with CSS selectors and type coercion
- Multi-Format Output - Export to CSV, Excel, Parquet, JSON, JSONL, or SQLite
- Response Caching - SQLite-based caching for faster development and debugging
- Production Ready - robots.txt compliance, graceful shutdown, checkpoints, proxy support
Installation
pip install ergane
Quick Start
Using Presets (Easiest)
# Use a preset - no schema needed!
ergane --preset quotes -o quotes.csv
# Export to Excel
ergane --preset hacker-news -o stories.xlsx
# List available presets
ergane --list-presets
Manual Crawling
# Crawl a single site
ergane -u https://example.com -n 100
# Custom output and settings
ergane -u https://docs.python.org -n 50 -c 20 -r 5 -o python_docs.parquet
Built-in Presets
| Preset | Site | Fields Extracted |
|---|---|---|
hacker-news |
news.ycombinator.com | title, link, score, author, comments |
github-repos |
github.com/search | name, description, stars, language, link |
reddit |
old.reddit.com | title, subreddit, score, author, comments, link |
quotes |
quotes.toscrape.com | quote, author, tags |
amazon-products |
amazon.com | title, price, rating, reviews, link |
ebay-listings |
ebay.com | title, price, condition, shipping, link |
wikipedia-articles |
en.wikipedia.org | title, link |
bbc-news |
bbc.com/news | title, summary, link |
Architecture
Ergane uses an async pipeline architecture orchestrated by a central Crawler engine. N worker coroutines run concurrently, each pulling URLs from the scheduler, fetching, parsing, and feeding results to the output pipeline.
CLI args ──→ Config ←── YAML file Presets / Custom Schema
merge (~/.ergane.yaml) │
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Crawler (engine) │
│ Spawns N async workers · signal handling · progress bar │
│ │
│ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ Worker loop (× N) ─ ─ ─ ─ ─ ─ ─ ┐ │
│ │
│ │ ┌───────────────────────────────────────────────┐ │ │
│ │ Scheduler │ │
│ │ │ Min-heap priority queue · URL dedup (set) │ │ │
│ │ Depth limit · asyncio.Event signaling │ │
│ │ └──────────────┬────────────────────────────────┘ │ │
│ │ get_nowait() → CrawlRequest │
│ │ ▼ │ │
│ ┌──────────────────────────────────────────────┐ │
│ │ │ Fetcher │ │ │
│ │ httpx AsyncClient (HTTP/2) · proxy support │ │
│ │ │ Per-domain token-bucket rate limiter │ │ │
│ │ Exponential backoff retry · robots.txt │ │
│ │ └──────┬───────────────────────────────────────┘ │ │
│ │ CrawlResponse │
│ │ ▼ │ │
│ ┌──────────────────────────────────────────────┐ │
│ │ │ Parser │ │ │
│ │ selectolax HTML parsing (16× BeautifulSoup) │ │
│ │ │ Schema mode → typed Pydantic model │ │ │
│ │ Legacy mode → ParsedItem │ │
│ │ │ Link extraction ─────────────────────┐ │ │ │
│ └──────┬───────────────────────────────┐│──────┘ │
│ │ │ model instance ││ new URLs│ │
│ ▼ │▼ │
│ │ ┌────────────────────┐ ┌────────────────┐ │ │
│ │ Pipeline │ │ Scheduler │ │
│ │ │ Buffer → batch │ │ .add_many() │ │ │
│ │ files (numbered) │ └────────────────┘ │
│ │ └────────┬───────────┘ │ │
│ │ │
│ └ ─ ─ ─ ─ ─ ┼─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │
│ │ │
└───────────────┼────────────────────────────────────────────┘
│ flush on batch_size
▼
┌──────────────────────────────────────────────────────────┐
│ Pipeline (output) │
│ Incremental batch files → consolidate & dedup by URL │
│ Parquet · CSV · Excel · JSON · JSONL · SQLite │
└──────────────────────────────┬───────────────────────────┘
│
▼
output.parquet
┌──────────────────────────────────────────────────────────────────┐
│ Cross-cutting concerns │
│ │
│ Checkpoint ─ periodic JSON snapshots of scheduler state & │
│ page count; enables --resume after interruption │
│ │
│ Cache ───── optional SQLite response cache with TTL │
│ (SHA-256 URL keys · non-blocking async I/O) │
│ │
│ Schema ──── YAML loader → dynamic Pydantic model creation │
│ type coercion ($19.99→19.99) · Parquet type mapping │
└──────────────────────────────────────────────────────────────────┘
CLI Options
Common options:
| Option | Short | Default | Description |
|---|---|---|---|
--url |
-u |
none | Start URL(s), can specify multiple |
--output |
-o |
output.parquet |
Output file path |
--max-pages |
-n |
100 |
Maximum pages to crawl |
--max-depth |
-d |
3 |
Maximum crawl depth |
--concurrency |
-c |
10 |
Concurrent requests |
--rate-limit |
-r |
10.0 |
Requests per second per domain |
--schema |
-s |
none | YAML schema file for custom extraction |
--preset |
-p |
none | Use a built-in preset |
--format |
-f |
auto |
Output format: csv, excel, parquet, json, jsonl, sqlite |
--cache |
false |
Enable response caching | |
--cache-dir |
.ergane_cache |
Cache directory | |
--cache-ttl |
3600 |
Cache TTL in seconds |
Run ergane --help for all options including proxy, resume, logging, and config settings.
Response Caching
Enable caching to speed up development and debugging workflows:
# First run - fetches from web, caches responses
ergane --preset quotes --cache -n 10 -o quotes.csv
# Second run - instant (served from cache)
ergane --preset quotes --cache -n 10 -o quotes.csv
# Custom cache settings
ergane --preset bbc-news --cache --cache-dir ./my_cache --cache-ttl 60 -o news.csv
Cache is stored in SQLite at .ergane_cache/response_cache.db by default.
Custom Schemas
Define extraction rules in a YAML schema file:
# schema.yaml
name: ProductItem
fields:
name:
selector: "h1.product-title"
type: str
price:
selector: "span.price"
type: float
coerce: true # "$19.99" -> 19.99
tags:
selector: "span.tag"
type: list[str]
image_url:
selector: "img.product"
attr: src
type: str
ergane -u https://example.com --schema schema.yaml -o products.parquet
Type coercion (coerce: true) handles common patterns: "$19.99" → 19.99, "1,234" → 1234, "yes" → True.
Supported types: str, int, float, bool, datetime, list[T].
Output Formats
Output format is auto-detected from file extension:
ergane --preset quotes -o quotes.csv # CSV
ergane --preset quotes -o quotes.xlsx # Excel
ergane --preset quotes -o quotes.parquet # Parquet (default)
ergane --preset quotes -o quotes.json # JSON array
ergane --preset quotes -o quotes.jsonl # JSONL (one object per line)
ergane --preset quotes -o quotes.sqlite # SQLite database
You can also force a format with --format/-f regardless of file extension:
ergane --preset quotes -f jsonl -o output.dat
import polars as pl
df = pl.read_parquet("output.parquet")
Advanced CLI Examples
# Crawl with a proxy
ergane -u https://example.com -o data.csv --proxy http://localhost:8080
# Resume an interrupted crawl (requires prior checkpoint)
ergane -u https://example.com -n 500 --resume
# Save checkpoints every 50 pages with debug logging
ergane -u https://example.com -n 500 --checkpoint-interval 50 \
--log-level DEBUG --log-file crawl.log
# Use a YAML config file and override concurrency from CLI
ergane -u https://example.com -C config.yaml -c 20
# Combine preset with custom URL and explicit format
ergane --preset hacker-news -u https://news.ycombinator.com/newest \
-f csv -o newest.csv -n 200
Troubleshooting
Getting empty or partial output
- Check
--max-depth: depth 0 means only the seed URL is crawled. Increase with-d 3to follow links. - Same-domain filtering: by default Ergane only follows links on the
same domain as the seed URL. Use
--any-domainto crawl cross-domain. - Selector mismatch: if using a custom schema, verify your CSS selectors match the actual site HTML (sites change frequently).
Blocked by robots.txt
If a target site disallows your user-agent in robots.txt, Ergane will
return 403 for those URLs. Options:
# Ignore robots.txt (use responsibly)
ergane -u https://example.com --ignore-robots -o data.csv
Rate limiting or 429 responses
Lower the request rate and concurrency:
ergane -u https://example.com -r 2 -c 3 -o data.csv
The built-in per-domain token-bucket rate limiter (-r) controls requests
per second. Reducing concurrency (-c) also lowers overall load.
Timeouts and connection errors
Increase the request timeout and enable retries (3 retries is the default):
ergane -u https://slow-site.com -t 60 -o data.csv
Resuming after a crash
Ergane periodically saves checkpoints (default: every 100 pages). To resume:
ergane -u https://example.com -n 1000 --resume
The checkpoint file is automatically deleted after a successful crawl.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ergane-0.5.0.tar.gz.
File metadata
- Download URL: ergane-0.5.0.tar.gz
- Upload date:
- Size: 129.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34123ae9fe13f52c7d3c5bdd0922614a43e8612f14c188100e25cc91582803a9
|
|
| MD5 |
73a25f33e4bc670499fe238bb4d0ff34
|
|
| BLAKE2b-256 |
2647babe6da9281d2cf338e33a3b1890fff6ef75fe94b22f36e200ef863e6e9d
|
Provenance
The following attestation bundles were made for ergane-0.5.0.tar.gz:
Publisher:
publish.yml on pyamin1878/ergane
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ergane-0.5.0.tar.gz -
Subject digest:
34123ae9fe13f52c7d3c5bdd0922614a43e8612f14c188100e25cc91582803a9 - Sigstore transparency entry: 926863711
- Sigstore integration time:
-
Permalink:
pyamin1878/ergane@2e0f4aea00d9c61c9f612b9cf138de3b325549f5 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/pyamin1878
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2e0f4aea00d9c61c9f612b9cf138de3b325549f5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file ergane-0.5.0-py3-none-any.whl.
File metadata
- Download URL: ergane-0.5.0-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f8f28aa5db4977eff990c8e177ab624f7ab082d9b4eb4a36d889c3af09553a7
|
|
| MD5 |
a855ded53fbf5d481d8a7f81bfcbd041
|
|
| BLAKE2b-256 |
91725f4015beb0ecdb95c86aafc8d76f437a920500a99cb6f9065616cce501ca
|
Provenance
The following attestation bundles were made for ergane-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on pyamin1878/ergane
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ergane-0.5.0-py3-none-any.whl -
Subject digest:
8f8f28aa5db4977eff990c8e177ab624f7ab082d9b4eb4a36d889c3af09553a7 - Sigstore transparency entry: 926863713
- Sigstore integration time:
-
Permalink:
pyamin1878/ergane@2e0f4aea00d9c61c9f612b9cf138de3b325549f5 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/pyamin1878
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2e0f4aea00d9c61c9f612b9cf138de3b325549f5 -
Trigger Event:
release
-
Statement type: