High-performance async web scraper with selectolax parsing

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pjams

These details have not been verified by PyPI

Project description

Ergane

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

HTTP/2 Support - Fast concurrent connections via httpx
Rate Limiting - Per-domain token bucket throttling
Retry Logic - Exponential backoff (max 3 attempts)
robots.txt Compliance - Respects crawler directives by default
Fast HTML Parsing - Selectolax with CSS selector extraction (16x faster than BeautifulSoup)
Smart Scheduling - Priority queue with URL deduplication
Multi-Format Output - Export to CSV, Excel, or Parquet
Built-in Presets - Pre-configured schemas for common sites (no coding required)
Graceful Shutdown - Clean termination on SIGINT/SIGTERM
Custom Schemas - Define Pydantic models with CSS selectors for type-safe extraction
Native Types - Lists and nested objects stored as native Parquet types (not JSON strings)
Type Coercion - Extract "$19.99" as float(19.99), "1,234" as int(1234)
Proxy Support - Route requests through HTTP/HTTPS proxies
Resume/Checkpoint - Save and restore crawler state for long jobs
Structured Logging - Configurable log levels and file output
Progress Bar - Rich progress display with live stats
Config Files - YAML configuration for persistent settings

Installation

pip install ergane

For development:

pip install ergane[dev]

Quick Start

Using Presets (Easiest)

# Use a preset - no schema needed!
ergane --preset quotes -o quotes.csv

# Export to Excel
ergane --preset hacker-news -o stories.xlsx

# List available presets
ergane --list-presets

Manual Crawling

# Crawl a single site
ergane -u https://example.com -n 100

# Crawl multiple start URLs
ergane -u https://site1.com -u https://site2.com -n 500

# Custom output and settings
ergane -u https://docs.python.org -n 50 -c 20 -r 5 -o python_docs.parquet

Built-in Presets

Presets provide pre-configured schemas for popular websites:

Preset	Site	Fields Extracted
`hacker-news`	news.ycombinator.com	title, link, score, author, comments
`github-repos`	github.com/search	name, description, stars, language, link
`reddit`	old.reddit.com	title, subreddit, score, author, comments, link
`quotes`	quotes.toscrape.com	quote, author, tags

# See all presets
ergane --list-presets

# Use preset with custom settings
ergane --preset hacker-news -n 100 -o hn.xlsx

CLI Options

Option	Short	Default	Description
`--url`	`-u`	none	Start URL(s), can specify multiple
`--output`	`-o`	`output.parquet`	Output file path
`--max-pages`	`-n`	`100`	Maximum pages to crawl
`--max-depth`	`-d`	`3`	Maximum crawl depth from start URLs
`--concurrency`	`-c`	`10`	Concurrent requests
`--rate-limit`	`-r`	`10.0`	Requests per second per domain
`--timeout`	`-t`	`30.0`	Request timeout in seconds
`--same-domain`		`true`	Stay on same domain as start URLs
`--any-domain`		`false`	Follow links to any domain
`--ignore-robots`		`false`	Ignore robots.txt restrictions
`--schema`	`-s`	none	YAML schema file for custom output fields
`--format`	`-f`	`auto`	Output format: `auto`, `csv`, `excel`, `parquet`
`--preset`	`-p`	none	Use a built-in preset
`--list-presets`			Show available presets and exit
`--proxy`	`-x`	none	HTTP/HTTPS proxy URL
`--resume`			Resume from last checkpoint
`--checkpoint-interval`		`100`	Save checkpoint every N pages
`--log-level`		`INFO`	Logging level (DEBUG, INFO, WARNING, ERROR)
`--log-file`		none	Write logs to file
`--no-progress`			Disable progress bar
`--config`	`-C`	none	Config file path

Custom Schemas

Define your own output schema with CSS selectors for type-safe extraction:

Programmatic Usage

from pydantic import BaseModel
from datetime import datetime
from ergane.schema import selector

class ProductItem(BaseModel):
    url: str                    # Auto-populated from crawled URL
    crawled_at: datetime        # Auto-populated timestamp

    name: str = selector("h1.product-title")
    price: float = selector("span.price", coerce=True)  # "$19.99" -> 19.99
    tags: list[str] = selector("span.tag")              # Native list type
    image_url: str = selector("img.product", attr="src")
    in_stock: bool = selector("span.availability")

# Use with Crawler
from ergane import Crawler, CrawlConfig

config = CrawlConfig(output_schema=ProductItem)
crawler = Crawler(
    config=config,
    start_urls=["https://example.com/products"],
    output_path="products.parquet",
    max_pages=100,
    max_depth=2,
    same_domain=True,
)
await crawler.run()

YAML Schema (CLI)

Create a schema file schema.yaml:

name: ProductItem
fields:
  name:
    selector: "h1.product-title"
    type: str
  price:
    selector: "span.price"
    type: float
    coerce: true
  tags:
    selector: "span.tag"
    type: list[str]
  image_url:
    selector: "img.product"
    attr: src
    type: str

Then run:

ergane -u https://example.com --schema schema.yaml -o products.parquet

Type Coercion

The coerce=true option enables smart type conversion:

Input	Target Type	Result
`"$19.99"`	`float`	`19.99`
`"1,234"`	`int`	`1234`
`"yes"` / `"true"` / `"1"`	`bool`	`True`
`"2024-01-15"`	`datetime`	`datetime(2024, 1, 15)`

Supported Types

Python Type	Parquet Type	Example
`str`	`Utf8`	`"Hello"`
`int`	`Int64`	`42`
`float`	`Float64`	`3.14`
`bool`	`Boolean`	`True`
`datetime`	`Datetime`	`datetime.now()`
`list[T]`	`List(T)`	`["a", "b"]`
`BaseModel`	`Struct`	Nested object

Output Formats

Ergane supports multiple output formats, auto-detected from file extension:

Extension	Format	Best For
`.csv`	CSV	Universal compatibility, spreadsheets
`.xlsx`	Excel	Business users, non-technical stakeholders
`.parquet`	Parquet	Large datasets, data pipelines

# Auto-detect from extension
ergane --preset quotes -o quotes.csv
ergane --preset quotes -o quotes.xlsx
ergane --preset quotes -o quotes.parquet

# Or explicitly specify format
ergane --preset quotes -o data.out --format csv

Default Schema (without custom schema)

Column	Type	Description
`url`	string	Page URL
`title`	string	Page title
`text`	string	Extracted text content (max 10k chars)
`links`	string	JSON array of extracted links
`extracted_data`	string	JSON object of custom extractions
`crawled_at`	string	ISO timestamp

Reading Results

import polars as pl

# Read any format
df = pl.read_parquet("output.parquet")
df = pl.read_csv("output.csv")
df = pl.read_excel("output.xlsx")

print(df.head())

Benchmarks

Ergane uses selectolax for HTML parsing, which is significantly faster than BeautifulSoup:

Operation	Selectolax	BS4 + lxml	Speedup
Parse (small)	0.05ms	0.11ms	2.0x
Parse (large)	0.19ms	6.05ms	31.1x
Extract title	0.20ms	6.06ms	30.7x
Extract links	0.25ms	6.73ms	27.3x
Extract text	0.29ms	7.03ms	24.5x
CSS selector	0.20ms	7.25ms	35.7x

Average: 16x faster (1000 iterations, 34KB HTML)

Run the benchmark:

pip install beautifulsoup4 lxml
python benchmarks/parse_benchmark.py

Proxy Support

Route requests through a proxy server:

# HTTP proxy
ergane --preset quotes --proxy http://localhost:8080 -o quotes.csv

# Authenticated proxy
ergane -u https://example.com --proxy http://user:pass@proxy:8080 -o data.csv

Resume/Checkpoint

For long-running crawls, Ergane automatically saves checkpoints:

# Start a large crawl
ergane --preset quotes -n 1000 -o large.csv
# Press Ctrl+C to interrupt

# Resume from checkpoint
ergane --preset quotes -n 1000 -o large.csv --resume

# Customize checkpoint interval (default: 100 pages)
ergane --preset quotes -n 1000 --checkpoint-interval 50 -o large.csv

Checkpoints are stored in .ergane_checkpoint.json and automatically deleted on successful completion.

Config Files

Store persistent settings in a YAML config file:

# ~/.ergane.yaml or ./.ergane.yaml
crawler:
  rate_limit: 10.0
  concurrency: 20
  timeout: 30.0
  respect_robots_txt: true
  user_agent: "MyBot/1.0"
  proxy: null

defaults:
  max_pages: 100
  max_depth: 3
  same_domain: true
  output_format: parquet

logging:
  level: INFO
  file: null

Ergane searches for config files in order:

~/.ergane.yaml (home directory)
./.ergane.yaml (current directory, hidden)
./ergane.yaml (current directory)

Or specify explicitly:

ergane --config myconfig.yaml --preset quotes -o quotes.csv

CLI arguments always override config file values.

Logging

Control log output with --log-level and --log-file:

# Debug output
ergane --preset quotes --log-level DEBUG -o quotes.csv

# Save logs to file
ergane --preset quotes --log-level INFO --log-file crawl.log -o quotes.csv

# Disable progress bar (useful for scripts)
ergane --preset quotes --no-progress -o quotes.csv

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pjams

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.3

Feb 27, 2026

0.7.1

Feb 24, 2026

0.7.0

Feb 18, 2026

0.6.0

Feb 13, 2026

0.5.0

Feb 7, 2026

0.4.0

Jan 26, 2026

This version

0.3.1

Jan 26, 2026

0.3.0

Jan 26, 2026

0.2.0

Jan 26, 2026

0.1.0

Jan 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.3.1.tar.gz (120.6 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ergane-0.3.1-py3-none-any.whl (35.2 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file ergane-0.3.1.tar.gz.

File metadata

Download URL: ergane-0.3.1.tar.gz
Upload date: Jan 26, 2026
Size: 120.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`c63c73a9b5a8e4ffe73818d627f5b8a51c2d66d616feb7bc07221540be479e13`
MD5	`68c0224f28e4913ef220e9be07c9feae`
BLAKE2b-256	`c77fe22c8450e9d79d14e2d30acb0440724d92801929f669201e35d6148a1ebc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.3.1.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ergane-0.3.1.tar.gz
- Subject digest: c63c73a9b5a8e4ffe73818d627f5b8a51c2d66d616feb7bc07221540be479e13
- Sigstore transparency entry: 855092870
- Sigstore integration time: Jan 26, 2026
Source repository:
- Permalink: pyamin1878/ergane@bef60e86766791d3265149782bb68c8c5930cbd1
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/pyamin1878
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bef60e86766791d3265149782bb68c8c5930cbd1
- Trigger Event: release

File details

Details for the file ergane-0.3.1-py3-none-any.whl.

File metadata

Download URL: ergane-0.3.1-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 35.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc5cec03c5691c89cc727124c68853cb51e366c998cdd2690a1c0b7c0544c28c`
MD5	`af63c97511f04147ddf75ba48a86a294`
BLAKE2b-256	`4af7bcae098fa062a01e598cfa178b33668e978622380a8afbba0a9d24e33a6c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.3.1-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ergane-0.3.1-py3-none-any.whl
- Subject digest: cc5cec03c5691c89cc727124c68853cb51e366c998cdd2690a1c0b7c0544c28c
- Sigstore transparency entry: 855092872
- Sigstore integration time: Jan 26, 2026
Source repository:
- Permalink: pyamin1878/ergane@bef60e86766791d3265149782bb68c8c5930cbd1
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/pyamin1878
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@bef60e86766791d3265149782bb68c8c5930cbd1
- Trigger Event: release

ergane 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Ergane

Features

Installation

Quick Start

Using Presets (Easiest)

Manual Crawling

Built-in Presets

CLI Options

Custom Schemas

Programmatic Usage

YAML Schema (CLI)

Type Coercion

Supported Types

Output Formats

Default Schema (without custom schema)

Reading Results

Benchmarks

Proxy Support

Resume/Checkpoint

Config Files

Logging

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance