Skip to main content

High-performance async web scraper with selectolax parsing

Project description

Ergane

PyPI version License: MIT Python 3.10+

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

  • HTTP/2 Support - Fast concurrent connections via httpx
  • Rate Limiting - Per-domain token bucket throttling
  • Retry Logic - Exponential backoff (max 3 attempts)
  • robots.txt Compliance - Respects crawler directives by default
  • Fast HTML Parsing - Selectolax with CSS selector extraction (16x faster than BeautifulSoup)
  • Smart Scheduling - Priority queue with URL deduplication
  • Multi-Format Output - Export to CSV, Excel, or Parquet
  • Built-in Presets - Pre-configured schemas for common sites (no coding required)
  • Graceful Shutdown - Clean termination on SIGINT/SIGTERM
  • Custom Schemas - Define Pydantic models with CSS selectors for type-safe extraction
  • Native Types - Lists and nested objects stored as native Parquet types (not JSON strings)
  • Type Coercion - Extract "$19.99" as float(19.99), "1,234" as int(1234)

Installation

pip install ergane

For development:

pip install ergane[dev]

Quick Start

Using Presets (Easiest)

# Use a preset - no schema needed!
ergane --preset quotes -o quotes.csv

# Export to Excel
ergane --preset hacker-news -o stories.xlsx

# List available presets
ergane --list-presets

Manual Crawling

# Crawl a single site
ergane -u https://example.com -n 100

# Crawl multiple start URLs
ergane -u https://site1.com -u https://site2.com -n 500

# Custom output and settings
ergane -u https://docs.python.org -n 50 -c 20 -r 5 -o python_docs.parquet

Built-in Presets

Presets provide pre-configured schemas for popular websites:

Preset Site Fields Extracted
hacker-news news.ycombinator.com title, link, score, author, comments
github-repos github.com/search name, description, stars, language, link
reddit old.reddit.com title, subreddit, score, author, comments, link
quotes quotes.toscrape.com quote, author, tags
# See all presets
ergane --list-presets

# Use preset with custom settings
ergane --preset hacker-news -n 100 -o hn.xlsx

CLI Options

Option Short Default Description
--url -u none Start URL(s), can specify multiple
--output -o output.parquet Output file path
--max-pages -n 100 Maximum pages to crawl
--max-depth -d 3 Maximum crawl depth from start URLs
--concurrency -c 10 Concurrent requests
--rate-limit -r 10.0 Requests per second per domain
--timeout -t 30.0 Request timeout in seconds
--same-domain true Stay on same domain as start URLs
--any-domain false Follow links to any domain
--ignore-robots false Ignore robots.txt restrictions
--schema -s none YAML schema file for custom output fields
--format -f auto Output format: auto, csv, excel, parquet
--preset -p none Use a built-in preset
--list-presets Show available presets and exit

Custom Schemas

Define your own output schema with CSS selectors for type-safe extraction:

Programmatic Usage

from pydantic import BaseModel
from datetime import datetime
from ergane.schema import selector

class ProductItem(BaseModel):
    url: str                    # Auto-populated from crawled URL
    crawled_at: datetime        # Auto-populated timestamp

    name: str = selector("h1.product-title")
    price: float = selector("span.price", coerce=True)  # "$19.99" -> 19.99
    tags: list[str] = selector("span.tag")              # Native list type
    image_url: str = selector("img.product", attr="src")
    in_stock: bool = selector("span.availability")

# Use with Crawler
from ergane import Crawler, CrawlConfig

config = CrawlConfig(output_schema=ProductItem)
crawler = Crawler(
    config=config,
    start_urls=["https://example.com/products"],
    output_path="products.parquet",
    max_pages=100,
    max_depth=2,
    same_domain=True,
)
await crawler.run()

YAML Schema (CLI)

Create a schema file schema.yaml:

name: ProductItem
fields:
  name:
    selector: "h1.product-title"
    type: str
  price:
    selector: "span.price"
    type: float
    coerce: true
  tags:
    selector: "span.tag"
    type: list[str]
  image_url:
    selector: "img.product"
    attr: src
    type: str

Then run:

ergane -u https://example.com --schema schema.yaml -o products.parquet

Type Coercion

The coerce=true option enables smart type conversion:

Input Target Type Result
"$19.99" float 19.99
"1,234" int 1234
"yes" / "true" / "1" bool True
"2024-01-15" datetime datetime(2024, 1, 15)

Supported Types

Python Type Parquet Type Example
str Utf8 "Hello"
int Int64 42
float Float64 3.14
bool Boolean True
datetime Datetime datetime.now()
list[T] List(T) ["a", "b"]
BaseModel Struct Nested object

Output Formats

Ergane supports multiple output formats, auto-detected from file extension:

Extension Format Best For
.csv CSV Universal compatibility, spreadsheets
.xlsx Excel Business users, non-technical stakeholders
.parquet Parquet Large datasets, data pipelines
# Auto-detect from extension
ergane --preset quotes -o quotes.csv
ergane --preset quotes -o quotes.xlsx
ergane --preset quotes -o quotes.parquet

# Or explicitly specify format
ergane --preset quotes -o data.out --format csv

Default Schema (without custom schema)

Column Type Description
url string Page URL
title string Page title
text string Extracted text content (max 10k chars)
links string JSON array of extracted links
extracted_data string JSON object of custom extractions
crawled_at string ISO timestamp

Reading Results

import polars as pl

# Read any format
df = pl.read_parquet("output.parquet")
df = pl.read_csv("output.csv")
df = pl.read_excel("output.xlsx")

print(df.head())

Benchmarks

Ergane uses selectolax for HTML parsing, which is significantly faster than BeautifulSoup:

Operation Selectolax BS4 + lxml Speedup
Parse (small) 0.05ms 0.11ms 2.0x
Parse (large) 0.19ms 6.05ms 31.1x
Extract title 0.20ms 6.06ms 30.7x
Extract links 0.25ms 6.73ms 27.3x
Extract text 0.29ms 7.03ms 24.5x
CSS selector 0.20ms 7.25ms 35.7x

Average: 16x faster (1000 iterations, 34KB HTML)

Run the benchmark:

pip install beautifulsoup4 lxml
python benchmarks/parse_benchmark.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.3.0.tar.gz (112.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ergane-0.3.0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file ergane-0.3.0.tar.gz.

File metadata

  • Download URL: ergane-0.3.0.tar.gz
  • Upload date:
  • Size: 112.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0f25f866e8f82fabd6d216004be424258defd494b97c3213007532ce7c4059aa
MD5 504b0775d5ccd42ec80f71afbc5e83b0
BLAKE2b-256 3a7710e559462e670ee39922915b39b882ca96ef2fef08396528bb7e51af2cc8

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.3.0.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ergane-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: ergane-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 29.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a2da0024ddb94d7f25fd1ea59bfc89b9e023163e14e713fae1ef975061c676a
MD5 81f83ab5f136953c64c67e8832725291
BLAKE2b-256 ad617c2fb26c603597312ac767916a368f0243960149c75e4eead1ad8b7e0dd5

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.3.0-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page