Async web scraping library for SiteSudharo — downloads resources to local/S3/R2 with full metadata
Project description
SSscraper
Async web scraping library for SiteSudharo.
Built on Crawl4ai — downloads page resources to local disk, AWS S3, or Cloudflare R2 and returns structured metadata for every file.
Features
- Full-page scraping via Crawl4ai (headless browser)
- Downloads: images, CSS, fonts, JS, documents, video, audio
- Storage backends: Local · S3 · R2 (drop-in swappable)
- Per-resource metadata: URL, stored path, content-type, size, MD5 + SHA-256
- Async & concurrent — configurable parallelism
- Retry with backoff, per-file size cap
Quick start
import asyncio
from ssscraper import SScraper, LocalStorage, ScraperConfig
async def main():
scraper = SScraper(
storage=LocalStorage("./downloads"),
config=ScraperConfig(download_images=True, download_css=True, download_fonts=True),
)
result = await scraper.scrape("https://example.com")
print(f"Downloaded {len(result.succeeded)} resources")
for r in result.resources:
print(r.model_dump_json())
asyncio.run(main())
Storage backends
Local
from ssscraper import LocalStorage
storage = LocalStorage(base_dir="./downloads")
AWS S3
from ssscraper import S3Storage
storage = S3Storage(
bucket="my-bucket",
prefix="sitesudharo",
region="us-east-1",
aws_access_key_id="...",
aws_secret_access_key="...",
)
Cloudflare R2
from ssscraper import R2Storage
storage = R2Storage(
bucket="my-bucket",
account_id="<cf-account-id>",
access_key_id="...",
secret_access_key="...",
prefix="sitesudharo",
public_domain="assets.yourdomain.com", # optional
)
ScrapeResult shape
result.page # PageMetadata — url, title, description, scraped_at
result.html # raw HTML
result.markdown # Crawl4ai markdown
result.resources # List[ResourceMetadata]
result.succeeded # filter: status == SUCCESS
result.failed # filter: status == FAILED
result.images # filter: type == image
result.stylesheets # filter: type == css
result.fonts # filter: type == font
ResourceMetadata fields
| Field | Type | Description |
|---|---|---|
original_url |
str | Source URL |
storage_key |
str | Relative key inside the storage backend |
stored_path |
str | Absolute local path or full cloud URL |
resource_type |
ResourceType | image / css / javascript / font / … |
content_type |
str | HTTP Content-Type |
size_bytes |
int | File size |
checksum_md5 |
str | MD5 hex digest |
checksum_sha256 |
str | SHA-256 hex digest |
status |
ResourceStatus | success / failed / skipped |
error |
str | Error message if failed |
downloaded_at |
datetime | UTC timestamp |
Install
pip install -e . # from source
pip install ssscraper # once published
Python ≥ 3.11 required.
Config options
See ssscraper/config.py — all fields are optional with sensible defaults.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ssscraper-0.1.0.tar.gz.
File metadata
- Download URL: ssscraper-0.1.0.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e5b519dd382ac5b78f61e5b024afe974409bd120f99d0abb99cbfb9c3bc59d4
|
|
| MD5 |
11ca2ca29dfcf2e0c917406eba1f4883
|
|
| BLAKE2b-256 |
c6a95de2cbc9eebfd441a15a4f642d5dcb9d972b5941dcaeb3f27bba3c8f1158
|
File details
Details for the file ssscraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ssscraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f497ddc0761252135a51277f39235e0ed9e0d0dc13b3aa65c18a2636f1287a12
|
|
| MD5 |
6b2c410082d6d2337e7fea0758916e1d
|
|
| BLAKE2b-256 |
bbb986902b1136b4ec2b5e0c525c3e6a35d8b34283f23ed4e712b2821d160928
|