Skip to main content

High-performance async web scraper with selectolax parsing

Project description

Ergane

PyPI version License: MIT Python 3.10+

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

  • HTTP/2 Support - Fast concurrent connections via httpx
  • Rate Limiting - Per-domain token bucket throttling
  • Retry Logic - Exponential backoff (max 3 attempts)
  • robots.txt Compliance - Respects crawler directives by default
  • Fast HTML Parsing - Selectolax with CSS selector extraction (16x faster than BeautifulSoup)
  • Smart Scheduling - Priority queue with URL deduplication
  • Parquet Output - Efficient columnar storage via polars
  • Graceful Shutdown - Clean termination on SIGINT/SIGTERM
  • Custom Schemas - Define Pydantic models with CSS selectors for type-safe extraction
  • Native Types - Lists and nested objects stored as native Parquet types (not JSON strings)
  • Type Coercion - Extract "$19.99" as float(19.99), "1,234" as int(1234)

Installation

pip install ergane

For development:

pip install ergane[dev]

Quick Start

# Crawl a single site
ergane -u https://example.com -n 100

# Crawl multiple start URLs
ergane -u https://site1.com -u https://site2.com -n 500

# Custom output and settings
ergane -u https://docs.python.org -n 50 -c 20 -r 5 -o python_docs.parquet

CLI Options

Option Short Default Description
--url -u required Start URL(s), can specify multiple
--output -o output.parquet Output file path
--max-pages -n 100 Maximum pages to crawl
--max-depth -d 3 Maximum crawl depth from start URLs
--concurrency -c 10 Concurrent requests
--rate-limit -r 10.0 Requests per second per domain
--timeout -t 30.0 Request timeout in seconds
--same-domain true Stay on same domain as start URLs
--any-domain false Follow links to any domain
--ignore-robots false Ignore robots.txt restrictions
--schema -s none YAML schema file for custom output fields

Custom Schemas

Define your own output schema with CSS selectors for type-safe extraction:

Programmatic Usage

from pydantic import BaseModel
from datetime import datetime
from ergane.schema import selector

class ProductItem(BaseModel):
    url: str                    # Auto-populated from crawled URL
    crawled_at: datetime        # Auto-populated timestamp

    name: str = selector("h1.product-title")
    price: float = selector("span.price", coerce=True)  # "$19.99" -> 19.99
    tags: list[str] = selector("span.tag")              # Native list type
    image_url: str = selector("img.product", attr="src")
    in_stock: bool = selector("span.availability")

# Use with Crawler
from ergane import Crawler, CrawlConfig

config = CrawlConfig(output_schema=ProductItem)
crawler = Crawler(
    config=config,
    start_urls=["https://example.com/products"],
    output_path="products.parquet",
    max_pages=100,
    max_depth=2,
    same_domain=True,
)
await crawler.run()

YAML Schema (CLI)

Create a schema file schema.yaml:

name: ProductItem
fields:
  name:
    selector: "h1.product-title"
    type: str
  price:
    selector: "span.price"
    type: float
    coerce: true
  tags:
    selector: "span.tag"
    type: list[str]
  image_url:
    selector: "img.product"
    attr: src
    type: str

Then run:

ergane -u https://example.com --schema schema.yaml -o products.parquet

Type Coercion

The coerce=true option enables smart type conversion:

Input Target Type Result
"$19.99" float 19.99
"1,234" int 1234
"yes" / "true" / "1" bool True
"2024-01-15" datetime datetime(2024, 1, 15)

Supported Types

Python Type Parquet Type Example
str Utf8 "Hello"
int Int64 42
float Float64 3.14
bool Boolean True
datetime Datetime datetime.now()
list[T] List(T) ["a", "b"]
BaseModel Struct Nested object

Output Format

Results are saved as a Parquet file with the following schema:

Column Type Description
url string Page URL
title string Page title
text string Extracted text content (max 10k chars)
links string JSON array of extracted links
extracted_data string JSON object of custom extractions
crawled_at string ISO timestamp

Read results with polars:

import polars as pl

df = pl.read_parquet("output.parquet")
print(df.head())

Benchmarks

Ergane uses selectolax for HTML parsing, which is significantly faster than BeautifulSoup:

Operation Selectolax BS4 + lxml Speedup
Parse (small) 0.05ms 0.11ms 2.0x
Parse (large) 0.19ms 6.05ms 31.1x
Extract title 0.20ms 6.06ms 30.7x
Extract links 0.25ms 6.73ms 27.3x
Extract text 0.29ms 7.03ms 24.5x
CSS selector 0.20ms 7.25ms 35.7x

Average: 16x faster (1000 iterations, 34KB HTML)

Run the benchmark:

pip install beautifulsoup4 lxml
python benchmarks/parse_benchmark.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.2.0.tar.gz (105.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ergane-0.2.0-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file ergane-0.2.0.tar.gz.

File metadata

  • Download URL: ergane-0.2.0.tar.gz
  • Upload date:
  • Size: 105.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8f0ff5754b0dd113c6cb752365bdbb83dc5f6600622418ce2c83be429328d9af
MD5 4d1656a7c6c1406004a73cd156efe00e
BLAKE2b-256 1dc5b7cec835189a318e355da1259525fc050d0a9a7a0ece459ef0a00638ae9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.2.0.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ergane-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ergane-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a4d7e1e357ff7c538231d3fd89cbdf1ceb7a917f9d45799c013275f23e8bbdb7
MD5 64fc9acd81681460c1481728a50aedb0
BLAKE2b-256 be197bdc6a028ba2ab07cb93292ad25d2f8bef1966aa2262e04114757f7642bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.2.0-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page