Skip to main content

Async web scraping library for SiteSudharo — downloads resources to local/S3/R2 with full metadata

Project description

SSscraper

Async web scraping library for SiteSudharo.
Built on Crawl4ai — downloads page resources to local disk, AWS S3, or Cloudflare R2 and returns structured metadata for every file.

Features

  • Full-page scraping via Crawl4ai (headless browser)
  • Downloads: images, CSS, fonts, JS, documents, video, audio
  • Storage backends: Local · S3 · R2 (drop-in swappable)
  • Per-resource metadata: URL, stored path, content-type, size, MD5 + SHA-256
  • Async & concurrent — configurable parallelism
  • Retry with backoff, per-file size cap

Quick start

import asyncio
from ssscraper import SScraper, LocalStorage, ScraperConfig

async def main():
    scraper = SScraper(
        storage=LocalStorage("./downloads"),
        config=ScraperConfig(download_images=True, download_css=True, download_fonts=True),
    )
    result = await scraper.scrape("https://example.com")
    print(f"Downloaded {len(result.succeeded)} resources")
    for r in result.resources:
        print(r.model_dump_json())

asyncio.run(main())

Storage backends

Local

from ssscraper import LocalStorage
storage = LocalStorage(base_dir="./downloads")

AWS S3

from ssscraper import S3Storage
storage = S3Storage(
    bucket="my-bucket",
    prefix="sitesudharo",
    region="us-east-1",
    aws_access_key_id="...",
    aws_secret_access_key="...",
)

Cloudflare R2

from ssscraper import R2Storage
storage = R2Storage(
    bucket="my-bucket",
    account_id="<cf-account-id>",
    access_key_id="...",
    secret_access_key="...",
    prefix="sitesudharo",
    public_domain="assets.yourdomain.com",  # optional
)

ScrapeResult shape

result.page            # PageMetadata — url, title, description, scraped_at
result.html            # raw HTML
result.markdown        # Crawl4ai markdown
result.resources       # List[ResourceMetadata]
result.succeeded       # filter: status == SUCCESS
result.failed          # filter: status == FAILED
result.images          # filter: type == image
result.stylesheets     # filter: type == css
result.fonts           # filter: type == font

ResourceMetadata fields

Field Type Description
original_url str Source URL
storage_key str Relative key inside the storage backend
stored_path str Absolute local path or full cloud URL
resource_type ResourceType image / css / javascript / font / …
content_type str HTTP Content-Type
size_bytes int File size
checksum_md5 str MD5 hex digest
checksum_sha256 str SHA-256 hex digest
status ResourceStatus success / failed / skipped
error str Error message if failed
downloaded_at datetime UTC timestamp

Install

pip install -e .          # from source
pip install ssscraper     # once published

Python ≥ 3.11 required.

Config options

See ssscraper/config.py — all fields are optional with sensible defaults.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssscraper-0.1.0.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ssscraper-0.1.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file ssscraper-0.1.0.tar.gz.

File metadata

  • Download URL: ssscraper-0.1.0.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.5

File hashes

Hashes for ssscraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e5b519dd382ac5b78f61e5b024afe974409bd120f99d0abb99cbfb9c3bc59d4
MD5 11ca2ca29dfcf2e0c917406eba1f4883
BLAKE2b-256 c6a95de2cbc9eebfd441a15a4f642d5dcb9d972b5941dcaeb3f27bba3c8f1158

See more details on using hashes here.

File details

Details for the file ssscraper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ssscraper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.5

File hashes

Hashes for ssscraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f497ddc0761252135a51277f39235e0ed9e0d0dc13b3aa65c18a2636f1287a12
MD5 6b2c410082d6d2337e7fea0758916e1d
BLAKE2b-256 bbb986902b1136b4ec2b5e0c525c3e6a35d8b34283f23ed4e712b2821d160928

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page