High-performance async web scraper with selectolax parsing

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pjams

These details have not been verified by PyPI

Project description

Ergane

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

Programmatic API — Crawler, crawl(), and stream() let you embed scraping in any Python application
Hook System — Intercept requests and responses with the CrawlHook protocol
HTTP/2 & Async — Fast concurrent connections with per-domain rate limiting and retry logic
Fast Parsing — Selectolax HTML parsing (16x faster than BeautifulSoup)
Built-in Presets — Pre-configured schemas for popular sites (no coding required)
Custom Schemas — Define Pydantic models with CSS selectors and type coercion
Multi-Format Output — Export to CSV, Excel, Parquet, JSON, JSONL, or SQLite
Response Caching — SQLite-based caching for faster development and debugging
Production Ready — robots.txt compliance, graceful shutdown, checkpoints, proxy support

Installation

pip install ergane

Quick Start

CLI — run from your terminal

# Use a built-in preset (no code needed)
ergane --preset quotes -o quotes.csv

# Crawl a custom URL
ergane -u https://example.com -n 100 -o data.parquet

# List available presets
ergane --list-presets

Python — embed in your application

import asyncio
from ergane import Crawler

async def main():
    async with Crawler(
        urls=["https://quotes.toscrape.com"],
        max_pages=20,
    ) as crawler:
        async for item in crawler.stream():
            print(item.url, item.title)

asyncio.run(main())

Python Library

Ergane's engine is a pure async library. The CLI is a thin wrapper around it — everything the CLI can do, your code can do too.

Crawler

The main entry point. Use it as an async context manager:

from ergane import Crawler

async with Crawler(
    urls=["https://example.com"],
    max_pages=50,
    concurrency=10,
    rate_limit=5.0,
) as crawler:
    results = await crawler.run()      # collect all items

Key parameters:

Parameter	Default	Description
`urls`	(required)	Seed URL(s) to start crawling
`schema`	`None`	Pydantic model for typed extraction
`concurrency`	`10`	Number of concurrent workers
`max_pages`	`100`	Maximum pages to crawl
`max_depth`	`3`	Maximum link-follow depth
`rate_limit`	`10.0`	Requests per second per domain
`timeout`	`30.0`	HTTP request timeout (seconds)
`same_domain`	`True`	Only follow links on the seed domain
`hooks`	`None`	List of `CrawlHook` instances
`output`	`None`	File path to write results
`output_format`	`"auto"`	`csv`, `excel`, `parquet`, `json`, `jsonl`, `sqlite`
`cache`	`False`	Enable SQLite response caching

run()

Executes the crawl and returns all extracted items as a list:

async with Crawler(urls=["https://example.com"], max_pages=10) as c:
    results = await c.run()
    print(f"Got {len(results)} items")

stream()

Yields items as they arrive — memory-efficient for large crawls:

async with Crawler(urls=["https://example.com"], max_pages=500) as c:
    async for item in c.stream():
        process(item)  # handle each item immediately

crawl()

One-shot convenience function — creates a Crawler, runs it, returns results:

from ergane import crawl

results = await crawl(
    urls=["https://example.com"],
    max_pages=10,
    concurrency=5,
)

Typed Extraction with Schemas

Pass a Pydantic model with CSS selectors to extract structured data:

from datetime import datetime
from pydantic import BaseModel
from ergane import Crawler, selector

class Quote(BaseModel):
    url: str
    crawled_at: datetime
    text: str = selector("span.text")
    author: str = selector("small.author")
    tags: list[str] = selector("div.tags a.tag")

async with Crawler(
    urls=["https://quotes.toscrape.com"],
    schema=Quote,
    max_pages=50,
) as crawler:
    for quote in await crawler.run():
        print(f"{quote.author}: {quote.text}")

The selector() helper supports:

Argument	Description
`css`	CSS selector string
`attr`	Extract an attribute instead of text (e.g. `"href"`, `"src"`)
`coerce`	Aggressive type coercion (`"$19.99"` → `19.99`)
`default`	Default value if selector matches nothing

Hooks

Hooks let you intercept and modify requests before they're sent, and responses after they're received. They follow the CrawlHook protocol:

from ergane import CrawlHook, CrawlRequest, CrawlResponse

class CrawlHook(Protocol):
    async def on_request(self, request: CrawlRequest) -> CrawlRequest | None: ...
    async def on_response(self, response: CrawlResponse) -> CrawlResponse | None: ...

Return the (possibly modified) object to continue, or None to skip/discard.

BaseHook

Subclass BaseHook and override only the methods you need:

from ergane import BaseHook, CrawlRequest

class SkipAdminPages(BaseHook):
    async def on_request(self, request: CrawlRequest) -> CrawlRequest | None:
        if "/admin" in request.url:
            return None  # skip this URL
        return request

Built-in Hooks

Hook	Purpose
`LoggingHook()`	Logs requests and responses at DEBUG level
`AuthHeaderHook(headers)`	Injects custom headers (e.g. `{"Authorization": "Bearer ..."}`)
`StatusFilterHook(allowed)`	Discards responses outside allowed status codes (default: `{200}`)

Using Hooks

from ergane import Crawler
from ergane.crawler.hooks import LoggingHook, AuthHeaderHook

async with Crawler(
    urls=["https://api.example.com"],
    hooks=[
        AuthHeaderHook({"Authorization": "Bearer token123"}),
        LoggingHook(),
    ],
) as crawler:
    results = await crawler.run()

Hooks run in order: for requests, each hook receives the output of the previous one. The same applies for responses.

Built-in Presets

Preset	Site	Fields Extracted
`hacker-news`	news.ycombinator.com	title, link, score, author, comments
`github-repos`	github.com/search	name, description, stars, language, link
`reddit`	old.reddit.com	title, subreddit, score, author, comments, link
`quotes`	quotes.toscrape.com	quote, author, tags
`amazon-products`	amazon.com	title, price, rating, reviews, link
`ebay-listings`	ebay.com	title, price, condition, shipping, link
`wikipedia-articles`	en.wikipedia.org	title, link
`bbc-news`	bbc.com/news	title, summary, link

Custom Schemas

Define extraction rules in a YAML schema file:

# schema.yaml
name: ProductItem
fields:
  name:
    selector: "h1.product-title"
    type: str
  price:
    selector: "span.price"
    type: float
    coerce: true  # "$19.99" -> 19.99
  tags:
    selector: "span.tag"
    type: list[str]
  image_url:
    selector: "img.product"
    attr: src
    type: str

ergane -u https://example.com --schema schema.yaml -o products.parquet

Type coercion (coerce: true) handles common patterns: "$19.99" → 19.99, "1,234" → 1234, "yes" → True.

Supported types: str, int, float, bool, datetime, list[T].

You can also load YAML schemas programmatically:

from ergane import Crawler, load_schema_from_yaml

ProductItem = load_schema_from_yaml("schema.yaml")

async with Crawler(
    urls=["https://example.com"],
    schema=ProductItem,
) as crawler:
    results = await crawler.run()

Output Formats

Output format is auto-detected from file extension:

ergane --preset quotes -o quotes.csv      # CSV
ergane --preset quotes -o quotes.xlsx     # Excel
ergane --preset quotes -o quotes.parquet  # Parquet (default)
ergane --preset quotes -o quotes.json     # JSON array
ergane --preset quotes -o quotes.jsonl    # JSONL (one object per line)
ergane --preset quotes -o quotes.sqlite   # SQLite database

You can also force a format with --format/-f regardless of file extension:

ergane --preset quotes -f jsonl -o output.dat

import polars as pl
df = pl.read_parquet("output.parquet")

Architecture

Ergane separates the engine (pure async library) from the CLI (Rich progress bars, signal handling). Hooks plug into the pipeline at two points: after scheduling and after fetching.

         CLI (main.py)                          Python Library
    ┌──────────────────────┐             ┌──────────────────────────┐
    │  Click options        │             │  from ergane import ...   │
    │  Rich progress bar    │             │  Crawler / crawl()       │
    │  Signal handling      │             │  stream()                │
    │  Config file merge    │             │                          │
    └──────────┬───────────┘             └────────────┬─────────────┘
               │                                      │
               └──────────────┬───────────────────────┘
                              │
                              ▼
               ┌──────────────────────────────────┐
               │         Crawler  (engine)         │
               │    Pure async · no I/O concerns   │
               │    Spawns N worker coroutines     │
               └──────────────┬───────────────────┘
                              │
              ┌───────────────┼───────────────────┐
              │               │                   │
              ▼               ▼                   ▼
      ┌──────────────┐ ┌───────────┐   ┌──────────────┐
      │  Scheduler   │ │  Fetcher  │   │   Pipeline   │
      │  URL frontier│ │  HTTP/2   │   │  Batch write │
      │  dedup queue │ │  retries  │   │  multi-format│
      └──────┬───────┘ └─────┬─────┘   └──────────────┘
             │               │
             ▼               ▼
  ┌──────────────────────────────────────────────────┐
  │                Worker loop (× N)                  │
  │                                                   │
  │  1. Scheduler.get()   → CrawlRequest              │
  │  2. hooks.on_request  → modify / skip             │
  │  3. Fetcher.fetch()   → CrawlResponse             │
  │  4. hooks.on_response → modify / discard          │
  │  5. Parser.extract()  → Pydantic model            │
  │  6. Pipeline.add()    → buffered output           │
  │  7. extract_links()   → new URLs → Scheduler      │
  └──────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────┐
  │               Cross-cutting concerns              │
  │                                                   │
  │  Cache ─── SQLite response cache with TTL         │
  │  Checkpoint ─ periodic JSON snapshots for resume  │
  │  Schema ── YAML → dynamic Pydantic model + coerce │
  └──────────────────────────────────────────────────┘

CLI Reference

Common Options

Option	Short	Default	Description
`--url`	`-u`	none	Start URL(s), can specify multiple
`--output`	`-o`	`output.parquet`	Output file path
`--max-pages`	`-n`	`100`	Maximum pages to crawl
`--max-depth`	`-d`	`3`	Maximum crawl depth
`--concurrency`	`-c`	`10`	Concurrent requests
`--rate-limit`	`-r`	`10.0`	Requests per second per domain
`--schema`	`-s`	none	YAML schema file for custom extraction
`--preset`	`-p`	none	Use a built-in preset
`--format`	`-f`	`auto`	Output format: `csv`, `excel`, `parquet`, `json`, `jsonl`, `sqlite`
`--timeout`	`-t`	`30`	Request timeout in seconds
`--proxy`	`-x`	none	HTTP/HTTPS proxy URL
`--same-domain/--any-domain`		`--same-domain`	Restrict crawling to seed domain
`--ignore-robots`		`false`	Ignore robots.txt
`--cache`		`false`	Enable response caching
`--cache-dir`		`.ergane_cache`	Cache directory
`--cache-ttl`		`3600`	Cache TTL in seconds
`--resume`			Resume from checkpoint
`--checkpoint-interval`		`100`	Save checkpoint every N pages
`--log-level`		`INFO`	`DEBUG`, `INFO`, `WARNING`, `ERROR`
`--log-file`		none	Write logs to file
`--no-progress`			Disable progress bar
`--config`	`-C`	none	Config file path

Run ergane --help for the full list.

Advanced CLI Examples

# Crawl with a proxy
ergane -u https://example.com -o data.csv --proxy http://localhost:8080

# Resume an interrupted crawl (requires prior checkpoint)
ergane -u https://example.com -n 500 --resume

# Save checkpoints every 50 pages with debug logging
ergane -u https://example.com -n 500 --checkpoint-interval 50 \
    --log-level DEBUG --log-file crawl.log

# Use a YAML config file and override concurrency from CLI
ergane -u https://example.com -C config.yaml -c 20

# Combine preset with custom URL and explicit format
ergane --preset hacker-news -u https://news.ycombinator.com/newest \
    -f csv -o newest.csv -n 200

Configuration

Ergane looks for a config file in these locations (first match wins):

Explicit path via --config/-C
~/.ergane.yaml
./.ergane.yaml
./ergane.yaml

crawler:
  max_pages: 100
  max_depth: 3
  concurrency: 10
  rate_limit: 10.0

defaults:
  output_format: "csv"
  checkpoint_interval: 100

logging:
  level: "INFO"
  file: null

CLI flags override config file values.

Troubleshooting

Getting empty or partial output

Check --max-depth: depth 0 means only the seed URL is crawled. Increase with -d 3 to follow links.
Same-domain filtering: by default Ergane only follows links on the same domain as the seed URL. Use --any-domain to crawl cross-domain.
Selector mismatch: if using a custom schema, verify your CSS selectors match the actual site HTML (sites change frequently).

Blocked by robots.txt

If a target site disallows your user-agent in robots.txt, Ergane will return 403 for those URLs. Options:

# Ignore robots.txt (use responsibly)
ergane -u https://example.com --ignore-robots -o data.csv

Rate limiting or 429 responses

Lower the request rate and concurrency:

ergane -u https://example.com -r 2 -c 3 -o data.csv

The built-in per-domain token-bucket rate limiter (-r) controls requests per second. Reducing concurrency (-c) also lowers overall load.

Timeouts and connection errors

Increase the request timeout and enable retries (3 retries is the default):

ergane -u https://slow-site.com -t 60 -o data.csv

Resuming after a crash

Ergane periodically saves checkpoints (default: every 100 pages). To resume:

ergane -u https://example.com -n 1000 --resume

The checkpoint file is automatically deleted after a successful crawl.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pjams

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.3

Feb 27, 2026

0.7.1

Feb 24, 2026

0.7.0

Feb 18, 2026

This version

0.6.0

Feb 13, 2026

0.5.0

Feb 7, 2026

0.4.0

Jan 26, 2026

0.3.1

Jan 26, 2026

0.3.0

Jan 26, 2026

0.2.0

Jan 26, 2026

0.1.0

Jan 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.6.0.tar.gz (137.4 kB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ergane-0.6.0-py3-none-any.whl (46.2 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file ergane-0.6.0.tar.gz.

File metadata

Download URL: ergane-0.6.0.tar.gz
Upload date: Feb 13, 2026
Size: 137.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`378fe8a1672fac923084ef3551e0e2c4ff39816a3cb98ba27cf3a489dc95c93a`
MD5	`ae484b596f87480af017c691d7d253bb`
BLAKE2b-256	`2973924585cc763b2f776bc2d49fbe4de45a0ba6adaf57b035076938659c617e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.6.0.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ergane-0.6.0.tar.gz
- Subject digest: 378fe8a1672fac923084ef3551e0e2c4ff39816a3cb98ba27cf3a489dc95c93a
- Sigstore transparency entry: 947245543
- Sigstore integration time: Feb 13, 2026
Source repository:
- Permalink: pyamin1878/ergane@c191212bd43d8bc6d9890f97bd6176aff483a3ce
- Branch / Tag: refs/tags/v0.6.0
- Owner: https://github.com/pyamin1878
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c191212bd43d8bc6d9890f97bd6176aff483a3ce
- Trigger Event: release

File details

Details for the file ergane-0.6.0-py3-none-any.whl.

File metadata

Download URL: ergane-0.6.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 46.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7c329398ca844226a2be7bddc91808737b1a79c048f1f493e2125006b7f8a373`
MD5	`e7eee29152a88fdf9e1272dfa7c277dc`
BLAKE2b-256	`a6604dd419b326f6d89e633c70c12dcbc132f6e4dd3b09fa0602144ab304c4f2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.6.0-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ergane-0.6.0-py3-none-any.whl
- Subject digest: 7c329398ca844226a2be7bddc91808737b1a79c048f1f493e2125006b7f8a373
- Sigstore transparency entry: 947245544
- Sigstore integration time: Feb 13, 2026
Source repository:
- Permalink: pyamin1878/ergane@c191212bd43d8bc6d9890f97bd6176aff483a3ce
- Branch / Tag: refs/tags/v0.6.0
- Owner: https://github.com/pyamin1878
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c191212bd43d8bc6d9890f97bd6176aff483a3ce
- Trigger Event: release

ergane 0.6.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Ergane

Features

Installation

Quick Start

CLI — run from your terminal

Python — embed in your application

Python Library

Crawler

run()

stream()

crawl()

Typed Extraction with Schemas

Hooks

BaseHook

Built-in Hooks

Using Hooks

Built-in Presets

Custom Schemas

Output Formats

Architecture

CLI Reference

Common Options

Advanced CLI Examples

Configuration

Troubleshooting

Getting empty or partial output

Blocked by robots.txt

Rate limiting or 429 responses

Timeouts and connection errors

Resuming after a crash

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance