Skip to main content

High-performance async web scraper with selectolax parsing

Project description

Ergane

PyPI version License: MIT Python 3.10+

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

  • HTTP/2 Support - Fast concurrent connections via httpx
  • Rate Limiting - Per-domain token bucket throttling
  • Retry Logic - Exponential backoff (max 3 attempts)
  • robots.txt Compliance - Respects crawler directives by default
  • Fast HTML Parsing - Selectolax with CSS selector extraction (16x faster than BeautifulSoup)
  • Smart Scheduling - Priority queue with URL deduplication
  • Multi-Format Output - Export to CSV, Excel, or Parquet
  • Built-in Presets - Pre-configured schemas for common sites (no coding required)
  • Graceful Shutdown - Clean termination on SIGINT/SIGTERM
  • Custom Schemas - Define Pydantic models with CSS selectors for type-safe extraction
  • Native Types - Lists and nested objects stored as native Parquet types (not JSON strings)
  • Type Coercion - Extract "$19.99" as float(19.99), "1,234" as int(1234)
  • Proxy Support - Route requests through HTTP/HTTPS proxies
  • Resume/Checkpoint - Save and restore crawler state for long jobs
  • Structured Logging - Configurable log levels and file output
  • Progress Bar - Rich progress display with live stats
  • Config Files - YAML configuration for persistent settings

Installation

pip install ergane

For development:

pip install ergane[dev]

Quick Start

Using Presets (Easiest)

# Use a preset - no schema needed!
ergane --preset quotes -o quotes.csv

# Export to Excel
ergane --preset hacker-news -o stories.xlsx

# List available presets
ergane --list-presets

Manual Crawling

# Crawl a single site
ergane -u https://example.com -n 100

# Crawl multiple start URLs
ergane -u https://site1.com -u https://site2.com -n 500

# Custom output and settings
ergane -u https://docs.python.org -n 50 -c 20 -r 5 -o python_docs.parquet

Built-in Presets

Presets provide pre-configured schemas for popular websites:

Preset Site Fields Extracted
hacker-news news.ycombinator.com title, link, score, author, comments
github-repos github.com/search name, description, stars, language, link
reddit old.reddit.com title, subreddit, score, author, comments, link
quotes quotes.toscrape.com quote, author, tags
# See all presets
ergane --list-presets

# Use preset with custom settings
ergane --preset hacker-news -n 100 -o hn.xlsx

CLI Options

Option Short Default Description
--url -u none Start URL(s), can specify multiple
--output -o output.parquet Output file path
--max-pages -n 100 Maximum pages to crawl
--max-depth -d 3 Maximum crawl depth from start URLs
--concurrency -c 10 Concurrent requests
--rate-limit -r 10.0 Requests per second per domain
--timeout -t 30.0 Request timeout in seconds
--same-domain true Stay on same domain as start URLs
--any-domain false Follow links to any domain
--ignore-robots false Ignore robots.txt restrictions
--schema -s none YAML schema file for custom output fields
--format -f auto Output format: auto, csv, excel, parquet
--preset -p none Use a built-in preset
--list-presets Show available presets and exit
--proxy -x none HTTP/HTTPS proxy URL
--resume Resume from last checkpoint
--checkpoint-interval 100 Save checkpoint every N pages
--log-level INFO Logging level (DEBUG, INFO, WARNING, ERROR)
--log-file none Write logs to file
--no-progress Disable progress bar
--config -C none Config file path

Custom Schemas

Define your own output schema with CSS selectors for type-safe extraction:

Programmatic Usage

from pydantic import BaseModel
from datetime import datetime
from ergane.schema import selector

class ProductItem(BaseModel):
    url: str                    # Auto-populated from crawled URL
    crawled_at: datetime        # Auto-populated timestamp

    name: str = selector("h1.product-title")
    price: float = selector("span.price", coerce=True)  # "$19.99" -> 19.99
    tags: list[str] = selector("span.tag")              # Native list type
    image_url: str = selector("img.product", attr="src")
    in_stock: bool = selector("span.availability")

# Use with Crawler
from ergane import Crawler, CrawlConfig

config = CrawlConfig(output_schema=ProductItem)
crawler = Crawler(
    config=config,
    start_urls=["https://example.com/products"],
    output_path="products.parquet",
    max_pages=100,
    max_depth=2,
    same_domain=True,
)
await crawler.run()

YAML Schema (CLI)

Create a schema file schema.yaml:

name: ProductItem
fields:
  name:
    selector: "h1.product-title"
    type: str
  price:
    selector: "span.price"
    type: float
    coerce: true
  tags:
    selector: "span.tag"
    type: list[str]
  image_url:
    selector: "img.product"
    attr: src
    type: str

Then run:

ergane -u https://example.com --schema schema.yaml -o products.parquet

Type Coercion

The coerce=true option enables smart type conversion:

Input Target Type Result
"$19.99" float 19.99
"1,234" int 1234
"yes" / "true" / "1" bool True
"2024-01-15" datetime datetime(2024, 1, 15)

Supported Types

Python Type Parquet Type Example
str Utf8 "Hello"
int Int64 42
float Float64 3.14
bool Boolean True
datetime Datetime datetime.now()
list[T] List(T) ["a", "b"]
BaseModel Struct Nested object

Output Formats

Ergane supports multiple output formats, auto-detected from file extension:

Extension Format Best For
.csv CSV Universal compatibility, spreadsheets
.xlsx Excel Business users, non-technical stakeholders
.parquet Parquet Large datasets, data pipelines
# Auto-detect from extension
ergane --preset quotes -o quotes.csv
ergane --preset quotes -o quotes.xlsx
ergane --preset quotes -o quotes.parquet

# Or explicitly specify format
ergane --preset quotes -o data.out --format csv

Default Schema (without custom schema)

Column Type Description
url string Page URL
title string Page title
text string Extracted text content (max 10k chars)
links string JSON array of extracted links
extracted_data string JSON object of custom extractions
crawled_at string ISO timestamp

Reading Results

import polars as pl

# Read any format
df = pl.read_parquet("output.parquet")
df = pl.read_csv("output.csv")
df = pl.read_excel("output.xlsx")

print(df.head())

Benchmarks

Ergane uses selectolax for HTML parsing, which is significantly faster than BeautifulSoup:

Operation Selectolax BS4 + lxml Speedup
Parse (small) 0.05ms 0.11ms 2.0x
Parse (large) 0.19ms 6.05ms 31.1x
Extract title 0.20ms 6.06ms 30.7x
Extract links 0.25ms 6.73ms 27.3x
Extract text 0.29ms 7.03ms 24.5x
CSS selector 0.20ms 7.25ms 35.7x

Average: 16x faster (1000 iterations, 34KB HTML)

Run the benchmark:

pip install beautifulsoup4 lxml
python benchmarks/parse_benchmark.py

Proxy Support

Route requests through a proxy server:

# HTTP proxy
ergane --preset quotes --proxy http://localhost:8080 -o quotes.csv

# Authenticated proxy
ergane -u https://example.com --proxy http://user:pass@proxy:8080 -o data.csv

Resume/Checkpoint

For long-running crawls, Ergane automatically saves checkpoints:

# Start a large crawl
ergane --preset quotes -n 1000 -o large.csv
# Press Ctrl+C to interrupt

# Resume from checkpoint
ergane --preset quotes -n 1000 -o large.csv --resume

# Customize checkpoint interval (default: 100 pages)
ergane --preset quotes -n 1000 --checkpoint-interval 50 -o large.csv

Checkpoints are stored in .ergane_checkpoint.json and automatically deleted on successful completion.

Config Files

Store persistent settings in a YAML config file:

# ~/.ergane.yaml or ./.ergane.yaml
crawler:
  rate_limit: 10.0
  concurrency: 20
  timeout: 30.0
  respect_robots_txt: true
  user_agent: "MyBot/1.0"
  proxy: null

defaults:
  max_pages: 100
  max_depth: 3
  same_domain: true
  output_format: parquet

logging:
  level: INFO
  file: null

Ergane searches for config files in order:

  1. ~/.ergane.yaml (home directory)
  2. ./.ergane.yaml (current directory, hidden)
  3. ./ergane.yaml (current directory)

Or specify explicitly:

ergane --config myconfig.yaml --preset quotes -o quotes.csv

CLI arguments always override config file values.

Logging

Control log output with --log-level and --log-file:

# Debug output
ergane --preset quotes --log-level DEBUG -o quotes.csv

# Save logs to file
ergane --preset quotes --log-level INFO --log-file crawl.log -o quotes.csv

# Disable progress bar (useful for scripts)
ergane --preset quotes --no-progress -o quotes.csv

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.3.1.tar.gz (120.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ergane-0.3.1-py3-none-any.whl (35.2 kB view details)

Uploaded Python 3

File details

Details for the file ergane-0.3.1.tar.gz.

File metadata

  • Download URL: ergane-0.3.1.tar.gz
  • Upload date:
  • Size: 120.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.3.1.tar.gz
Algorithm Hash digest
SHA256 c63c73a9b5a8e4ffe73818d627f5b8a51c2d66d616feb7bc07221540be479e13
MD5 68c0224f28e4913ef220e9be07c9feae
BLAKE2b-256 c77fe22c8450e9d79d14e2d30acb0440724d92801929f669201e35d6148a1ebc

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.3.1.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ergane-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: ergane-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 35.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cc5cec03c5691c89cc727124c68853cb51e366c998cdd2690a1c0b7c0544c28c
MD5 af63c97511f04147ddf75ba48a86a294
BLAKE2b-256 4af7bcae098fa062a01e598cfa178b33668e978622380a8afbba0a9d24e33a6c

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.3.1-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page