Skip to main content

High-performance async web scraper with selectolax parsing

Project description

Ergane

PyPI version License: MIT Python 3.10+

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

  • HTTP/2 Support - Fast concurrent connections via httpx
  • Rate Limiting - Per-domain token bucket throttling
  • Retry Logic - Exponential backoff (max 3 attempts)
  • robots.txt Compliance - Respects crawler directives by default
  • Fast HTML Parsing - Selectolax with CSS selector extraction (16x faster than BeautifulSoup)
  • Smart Scheduling - Priority queue with URL deduplication
  • Parquet Output - Efficient columnar storage via polars
  • Graceful Shutdown - Clean termination on SIGINT/SIGTERM

Installation

pip install ergane

For development:

pip install ergane[dev]

Quick Start

# Crawl a single site
ergane -u https://example.com -n 100

# Crawl multiple start URLs
ergane -u https://site1.com -u https://site2.com -n 500

# Custom output and settings
ergane -u https://docs.python.org -n 50 -c 20 -r 5 -o python_docs.parquet

CLI Options

Option Short Default Description
--url -u required Start URL(s), can specify multiple
--output -o output.parquet Output file path
--max-pages -n 100 Maximum pages to crawl
--max-depth -d 3 Maximum crawl depth from start URLs
--concurrency -c 10 Concurrent requests
--rate-limit -r 10.0 Requests per second per domain
--timeout -t 30.0 Request timeout in seconds
--same-domain true Stay on same domain as start URLs
--any-domain false Follow links to any domain
--ignore-robots false Ignore robots.txt restrictions

Output Format

Results are saved as a Parquet file with the following schema:

Column Type Description
url string Page URL
title string Page title
text string Extracted text content (max 10k chars)
links string JSON array of extracted links
extracted_data string JSON object of custom extractions
crawled_at string ISO timestamp

Read results with polars:

import polars as pl

df = pl.read_parquet("output.parquet")
print(df.head())

Benchmarks

Ergane uses selectolax for HTML parsing, which is significantly faster than BeautifulSoup:

Operation Selectolax BS4 + lxml Speedup
Parse (small) 0.05ms 0.11ms 2.0x
Parse (large) 0.19ms 6.05ms 31.1x
Extract title 0.20ms 6.06ms 30.7x
Extract links 0.25ms 6.73ms 27.3x
Extract text 0.29ms 7.03ms 24.5x
CSS selector 0.20ms 7.25ms 35.7x

Average: 16x faster (1000 iterations, 34KB HTML)

Run the benchmark:

pip install beautifulsoup4 lxml
python benchmarks/parse_benchmark.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.1.0.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ergane-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file ergane-0.1.0.tar.gz.

File metadata

  • Download URL: ergane-0.1.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.1.0.tar.gz
Algorithm Hash digest
SHA256 103f89411a983d89d3ac2a3a753a9cb88bfebb907dd6549a6904efdc2fb45785
MD5 fc776f85a31a98f9b5d34dd571c2cf41
BLAKE2b-256 1a1f5f931f3a040aab0ef14890dd19383400103724aa04f2910160cb64d69f9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.1.0.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ergane-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ergane-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2dfe20eeb6aaa15356537a0ef5d292abcb69c41c2da473269a1893445af981d4
MD5 209617744f3cba508ef37499fe9ecc9a
BLAKE2b-256 c233f2340167eccabe861e2585ead430c8b025ae0a38f88360105843df212fc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.1.0-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page