Skip to main content

High-performance async web scraper with selectolax parsing

Project description

Ergane

PyPI version License: MIT Python 3.10+

High-performance async web scraper with HTTP/2 support, built with Python.

Named after Ergane, Athena's title as goddess of crafts and weaving in Greek mythology.

Features

  • HTTP/2 & Async - Fast concurrent connections with rate limiting and retry logic
  • Fast Parsing - Selectolax HTML parsing (16x faster than BeautifulSoup)
  • Built-in Presets - Pre-configured schemas for popular sites (no coding required)
  • Custom Schemas - Define Pydantic models with CSS selectors and type coercion
  • Multi-Format Output - Export to CSV, Excel, or Parquet with native types
  • Response Caching - SQLite-based caching for faster development and debugging
  • Production Ready - robots.txt compliance, graceful shutdown, checkpoints, proxy support

Installation

pip install ergane

Quick Start

Using Presets (Easiest)

# Use a preset - no schema needed!
ergane --preset quotes -o quotes.csv

# Export to Excel
ergane --preset hacker-news -o stories.xlsx

# List available presets
ergane --list-presets

Manual Crawling

# Crawl a single site
ergane -u https://example.com -n 100

# Custom output and settings
ergane -u https://docs.python.org -n 50 -c 20 -r 5 -o python_docs.parquet

Built-in Presets

Preset Site Fields Extracted
hacker-news news.ycombinator.com title, link, score, author, comments
github-repos github.com/search name, description, stars, language, link
reddit old.reddit.com title, subreddit, score, author, comments, link
quotes quotes.toscrape.com quote, author, tags
amazon-products amazon.com title, price, rating, reviews, link
ebay-listings ebay.com title, price, condition, shipping, link
wikipedia-articles en.wikipedia.org title, link
bbc-news bbc.com/news title, summary, link

CLI Options

Common options:

Option Short Default Description
--url -u none Start URL(s), can specify multiple
--output -o output.parquet Output file path
--max-pages -n 100 Maximum pages to crawl
--max-depth -d 3 Maximum crawl depth
--concurrency -c 10 Concurrent requests
--rate-limit -r 10.0 Requests per second per domain
--schema -s none YAML schema file for custom extraction
--preset -p none Use a built-in preset
--format -f auto Output format: csv, excel, parquet
--cache false Enable response caching
--cache-dir .ergane_cache Cache directory
--cache-ttl 3600 Cache TTL in seconds

Run ergane --help for all options including proxy, resume, logging, and config settings.

Response Caching

Enable caching to speed up development and debugging workflows:

# First run - fetches from web, caches responses
ergane --preset quotes --cache -n 10 -o quotes.csv

# Second run - instant (served from cache)
ergane --preset quotes --cache -n 10 -o quotes.csv

# Custom cache settings
ergane --preset bbc-news --cache --cache-dir ./my_cache --cache-ttl 60 -o news.csv

Cache is stored in SQLite at .ergane_cache/response_cache.db by default.

Custom Schemas

Define extraction rules in a YAML schema file:

# schema.yaml
name: ProductItem
fields:
  name:
    selector: "h1.product-title"
    type: str
  price:
    selector: "span.price"
    type: float
    coerce: true  # "$19.99" -> 19.99
  tags:
    selector: "span.tag"
    type: list[str]
  image_url:
    selector: "img.product"
    attr: src
    type: str
ergane -u https://example.com --schema schema.yaml -o products.parquet

Type coercion (coerce: true) handles common patterns: "$19.99"19.99, "1,234"1234, "yes"True.

Supported types: str, int, float, bool, datetime, list[T].

Output Formats

Output format is auto-detected from file extension:

ergane --preset quotes -o quotes.csv      # CSV
ergane --preset quotes -o quotes.xlsx     # Excel
ergane --preset quotes -o quotes.parquet  # Parquet (default)
import polars as pl
df = pl.read_parquet("output.parquet")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ergane-0.4.0.tar.gz (121.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ergane-0.4.0-py3-none-any.whl (37.0 kB view details)

Uploaded Python 3

File details

Details for the file ergane-0.4.0.tar.gz.

File metadata

  • Download URL: ergane-0.4.0.tar.gz
  • Upload date:
  • Size: 121.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.4.0.tar.gz
Algorithm Hash digest
SHA256 c90e927651ea37a6c10ca9cb9971be38807bffd6f517ce8b9bdf228fd5c5b516
MD5 ecbb85151187b3fb593e6f06b04ddf02
BLAKE2b-256 f7e846bf3b887f5b45e44b949d6245e2af3e13b0451e917ec8e655b37e9d7df4

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.4.0.tar.gz:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ergane-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: ergane-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 37.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ergane-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e86f466ade0c096f56821339cdc6cc065f0fd17c37b58e38e4d0d596db485362
MD5 3434772fed575691659a4510281ac89f
BLAKE2b-256 d45e37ecac877afd9ab2ee64215ffaf3d083a26320b8da39180af75815519ccb

See more details on using hashes here.

Provenance

The following attestation bundles were made for ergane-0.4.0-py3-none-any.whl:

Publisher: publish.yml on pyamin1878/ergane

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page