Async-first web scraping framework built on wreq and scraper-rs
Project description
silkworm-rs
Async-first web scraping framework built on wreq (HTTP with browser impersonation) and scraper-rs (fast HTML parsing). Silkworm gives you a minimal Spider/Request/Response model, middlewares, and pipelines so you can script quick scrapes or build larger crawlers without boilerplate.
NEW: Use silkworm-mcp to build scrapers.
Features
- Async engine with configurable concurrency, bounded queue backpressure (defaults to
concurrency * 10), and per-request timeouts. - wreq-powered HTTP client: browser impersonation, redirect following with loop detection, query merging, and proxy support via
request.meta["proxy"]. - Typed spiders and callbacks that can return items or
Requestobjects;HTMLResponseships helper methods plusResponse.followto reuse callbacks. - Middlewares: User-Agent rotation/default, proxy rotation, retry with exponential backoff + optional sleep codes, flexible delays (fixed/random/custom),
SkipNonHTMLMiddlewareto drop non-HTML callbacks, andCloudflareCrawlMiddlewarefor Browser Rendering crawl jobs. - Pipelines: JSON Lines, SQLite, XML (nested data preserved), and CSV (flattens dicts and lists) out of the box.
- Structured logging via
logly(SILKWORM_LOG_LEVEL=DEBUG), plus periodic/final crawl statistics (requests/sec, queue size, memory, seen URLs).
Installation
From PyPI with pip:
pip install silkworm-rs
From PyPI with uv (recommended for faster installs):
uv pip install silkworm-rs
# or if using uv's project management:
uv add silkworm-rs
From source:
uv venv # install uv from https://docs.astral.sh/uv/getting-started/ if needed
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -e .
Targets Python 3.13+; dependencies are pinned in pyproject.toml.
Quick start
Define a spider by subclassing Spider, implementing parse, and yielding items or follow-up Request objects. This example writes quotes to data/quotes.jl and enables basic user agent, retry, and non-HTML filtering middlewares.
from silkworm import HTMLResponse, Response, Spider, run_spider
from silkworm.middlewares import (
RetryMiddleware,
SkipNonHTMLMiddleware,
UserAgentMiddleware,
)
from silkworm.pipelines import JsonLinesPipeline
class QuotesSpider(Spider):
name = "quotes"
start_urls = ("https://quotes.toscrape.com/",)
async def parse(self, response: Response):
if not isinstance(response, HTMLResponse):
return
html = response
for quote in await html.select(".quote"):
text_el = await quote.select_first(".text")
author_el = await quote.select_first(".author")
if text_el is None or author_el is None:
continue
tags = await quote.select(".tag")
yield {
"text": text_el.text,
"author": author_el.text,
"tags": [t.text for t in tags],
}
if next_link := await html.select_first("li.next > a"):
yield html.follow(next_link.attr("href"), callback=self.parse)
if __name__ == "__main__":
run_spider(
QuotesSpider,
request_middlewares=[UserAgentMiddleware()],
response_middlewares=[
SkipNonHTMLMiddleware(),
RetryMiddleware(max_times=3, sleep_http_codes=[429, 503]),
],
item_pipelines=[JsonLinesPipeline("data/quotes.jl")],
concurrency=16,
request_timeout=10,
log_stats_interval=30,
)
run_spider/crawl knobs:
concurrency: number of concurrent HTTP requests; default 16.max_pending_requests: queue bound to avoid unbounded memory use (defaults toconcurrency * 10).request_timeout: per-request timeout (seconds).keep_alive: reuse HTTP connections when supported by the underlying client (sendsConnection: keep-alive).html_max_size_bytes: limit HTML parsed intoAsyncDocumentto avoid huge payloads.log_stats_interval: seconds between periodic stats logs; final stats are always emitted.request_middlewares/response_middlewares/item_pipelines: plug-ins run on every request/response/item.- use
run_spider_rsloop(...)instead ofrun_spider(...)to run under rsloop (requirespip install silkworm-rs[rsloop]). - use
run_spider_uvloop(...)instead ofrun_spider(...)to run under uvloop (requirespip install silkworm-rs[uvloop]). - use
run_spider_winloop(...)instead ofrun_spider(...)to run under winloop on Windows (requirespip install silkworm-rs[winloop]).
Built-in middlewares and pipelines
from silkworm.middlewares import (
CloudflareCrawlMiddleware,
DelayMiddleware,
ProxyMiddleware,
RetryMiddleware,
SkipNonHTMLMiddleware,
UserAgentMiddleware,
)
from silkworm.pipelines import (
CallbackPipeline, # invoke a custom callback function on each item
CSVPipeline,
JsonLinesPipeline,
MsgPackPipeline, # requires: pip install silkworm-rs[msgpack]
SQLitePipeline,
XMLPipeline,
TaskiqPipeline, # requires: pip install silkworm-rs[taskiq]
PolarsPipeline, # requires: pip install silkworm-rs[polars]
ExcelPipeline, # requires: pip install silkworm-rs[excel]
YAMLPipeline, # requires: pip install silkworm-rs[yaml]
AvroPipeline, # requires: pip install silkworm-rs[avro]
ElasticsearchPipeline, # requires: pip install silkworm-rs[elasticsearch]
MongoDBPipeline, # requires: pip install silkworm-rs[mongodb]
MySQLPipeline, # requires: pip install silkworm-rs[mysql]
PostgreSQLPipeline, # requires: pip install silkworm-rs[postgresql]
S3JsonLinesPipeline, # requires: pip install silkworm-rs[s3]
VortexPipeline, # requires: pip install silkworm-rs[vortex]
WebhookPipeline, # sends items to webhook endpoints using wreq
GoogleSheetsPipeline, # requires: pip install silkworm-rs[gsheets]
SnowflakePipeline, # requires: pip install silkworm-rs[snowflake]
FTPPipeline, # requires: pip install silkworm-rs[ftp]
SFTPPipeline, # requires: pip install silkworm-rs[sftp]
CassandraPipeline, # requires: pip install silkworm-rs[cassandra]
CouchDBPipeline, # requires: pip install silkworm-rs[couchdb]
DynamoDBPipeline, # requires: pip install silkworm-rs[dynamodb]
DuckDBPipeline, # requires: pip install silkworm-rs[duckdb]
)
run_spider(
QuotesSpider,
request_middlewares=[
UserAgentMiddleware(), # rotate/custom user agent
DelayMiddleware(min_delay=0.3, max_delay=1.2), # polite throttling
# ProxyMiddleware with round-robin selection (default)
# ProxyMiddleware(proxies=["http://user:pass@proxy1:8080", "http://proxy2:8080"]),
# ProxyMiddleware with random selection
# ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True),
# ProxyMiddleware from file with random selection
# ProxyMiddleware(proxy_file="proxies.txt", random_selection=True),
],
response_middlewares=[
RetryMiddleware(max_times=3, sleep_http_codes=[403, 429]), # backoff + retry
SkipNonHTMLMiddleware(), # drop callbacks for images/APIs/etc
],
item_pipelines=[
JsonLinesPipeline("data/quotes.jl"),
SQLitePipeline("data/quotes.db", table="quotes"),
XMLPipeline("data/quotes.xml", root_element="quotes", item_element="quote"),
CSVPipeline("data/quotes.csv", fieldnames=["author", "text", "tags"]),
MsgPackPipeline("data/quotes.msgpack"),
],
)
DelayMiddlewarestrategies:delay=1.0(fixed),min_delay/max_delay(random), ordelay_func(custom).ProxyMiddlewaresupports three modes:- Round-robin (default):
ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"])cycles through proxies in order. - Random selection:
ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True)randomly selects a proxy for each request. - From file:
ProxyMiddleware(proxy_file="proxies.txt")loads proxies from a file (one proxy per line, blank lines ignored). Combine withrandom_selection=Truefor random selection from the file.
- Round-robin (default):
RetryMiddlewarebacks off withasyncio.sleep; any status insleep_http_codesis retried even if not inretry_http_codes.SkipNonHTMLMiddlewarechecksContent-Typeand optionally sniffs the body (sniff_bytes) to avoid running HTML callbacks on binary/API responses.CloudflareCrawlMiddlewareis opt-in per request viarequest.meta["cloudflare_crawl"]; it submits a Cloudflare Browser Rendering crawl job, polls until completion, and hands your callback a synthetic JSONResponsewith the final API payload.JsonLinesPipelinewrites items to a local JSON Lines file and, whenopendalis installed, appends asynchronously via the filesystem backend (use_opendal=Falseto stick to a regular file handle).CSVPipelineflattens nested dicts (e.g.,{"user": {"name": "Alice"}}->user_name) and joins lists with commas;XMLPipelinepreserves nesting.MsgPackPipelinewrites items in binary MessagePack format using ormsgpack for fast and compact serialization (requirespip install silkworm-rs[msgpack]).TaskiqPipelinesends items to a Taskiq queue for distributed processing (requirespip install silkworm-rs[taskiq]).PolarsPipelinewrites items to a Parquet file using Polars for efficient columnar storage (requirespip install silkworm-rs[polars]).ExcelPipelinewrites items to an Excel .xlsx file (requirespip install silkworm-rs[excel]).YAMLPipelinewrites items to a YAML file (requirespip install silkworm-rs[yaml]).AvroPipelinewrites items to an Avro file with optional schema (requirespip install silkworm-rs[avro]).ElasticsearchPipelinesends items to an Elasticsearch index (requirespip install silkworm-rs[elasticsearch]).MongoDBPipelinesends items to a MongoDB collection (requirespip install silkworm-rs[mongodb]).MySQLPipelinesends items to a MySQL database table as JSON (requirespip install silkworm-rs[mysql]).PostgreSQLPipelinesends items to a PostgreSQL database table as JSONB (requirespip install silkworm-rs[postgresql]).S3JsonLinesPipelinewrites items to AWS S3 in JSON Lines format using async OpenDAL (requirespip install silkworm-rs[s3]).VortexPipelinewrites items to a Vortex file for high-performance columnar storage with 100x faster random access and 10-20x faster scans compared to Parquet (requirespip install silkworm-rs[vortex]).WebhookPipelinesends items to webhook endpoints via HTTP POST/PUT using wreq (same HTTP client as the spider) with support for batching and custom headers.GoogleSheetsPipelineappends items to Google Sheets with automatic flattening of nested data structures (requirespip install silkworm-rs[gsheets]and service account credentials).SnowflakePipelinesends items to Snowflake data warehouse tables as JSON (requirespip install silkworm-rs[snowflake]).FTPPipelinewrites items to an FTP server in JSON Lines format (requirespip install silkworm-rs[ftp]).SFTPPipelinewrites items to an SFTP server in JSON Lines format with support for password or key-based authentication (requirespip install silkworm-rs[sftp]).CassandraPipelinesends items to Apache Cassandra database tables (requirespip install silkworm-rs[cassandra]).CouchDBPipelinesends items to CouchDB databases as documents (requirespip install silkworm-rs[couchdb]).DynamoDBPipelinesends items to AWS DynamoDB tables with automatic table creation (requirespip install silkworm-rs[dynamodb]).DuckDBPipelinesends items to a DuckDB database table as JSON (requirespip install silkworm-rs[duckdb]).CallbackPipelineinvokes a custom callback function (sync or async) on each item, enabling inline processing logic without creating a full pipeline class. See example below.
Using CallbackPipeline for custom processing
Process items with custom callback functions without creating a full pipeline class:
from silkworm.pipelines import CallbackPipeline
# Sync callback
def print_item(item, spider):
print(f"[{spider.name}] {item}")
return item
# Async callback
async def validate_item(item, spider):
# Could do async operations like database checks
if len(item.get("text", "")) < 10:
print(f"Warning: Short text in item")
return item
# Modifying callback
def enrich_item(item, spider):
item["spider_name"] = spider.name
item["processed"] = True
return item
run_spider(
QuotesSpider,
item_pipelines=[
CallbackPipeline(callback=print_item),
CallbackPipeline(callback=validate_item),
CallbackPipeline(callback=enrich_item),
],
)
Callbacks receive (item, spider) and should return the processed item (or None to return the original item unchanged).
Streaming items to a queue with TaskiqPipeline
Stream scraped items to a Taskiq queue for distributed processing:
from taskiq import InMemoryBroker
from silkworm.pipelines import TaskiqPipeline
broker = InMemoryBroker()
@broker.task
async def process_item(item):
# Your item processing logic here
print(f"Processing: {item}")
# Save to database, send to another service, etc.
pipeline = TaskiqPipeline(broker, task=process_item)
run_spider(MySpider, item_pipelines=[pipeline])
This enables distributed processing, retries, rate limiting, and other Taskiq features. See examples/taskiq_quotes_spider.py for a complete example.
Handling non-HTML responses
Keep crawls cheap when URLs mix HTML and binaries/APIs:
response_middlewares=[SkipNonHTMLMiddleware(sniff_bytes=1024)]
# Tighten HTML parsing size (bytes) to avoid loading huge bodies into scraper-rs
run_spider(MySpider, html_max_size_bytes=1_000_000)
Performance optimization with rsloop
For improved async performance, enable rsloop as a drop-in replacement for asyncio's event loop:
pip install silkworm-rs[rsloop]
# or with uv:
uv pip install silkworm-rs[rsloop]
Then call run_spider_rsloop (same signature as run_spider):
from silkworm import run_spider_rsloop
run_spider_rsloop(
QuotesSpider,
concurrency=32,
)
Performance optimization with uvloop
For improved async performance, enable uvloop (a fast, drop-in replacement for asyncio's event loop):
pip install silkworm-rs[uvloop]
# or with uv:
uv pip install silkworm-rs[uvloop]
Then call run_spider_uvloop (same signature as run_spider):
from silkworm import run_spider_uvloop
run_spider_uvloop(
QuotesSpider,
concurrency=32,
)
uvloop can provide 2-4x performance improvement for I/O-bound workloads.
Performance optimization with winloop (Windows)
For Windows users who want improved async performance, enable winloop (a Windows-compatible alternative to uvloop):
pip install silkworm-rs[winloop]
# or with uv:
uv pip install silkworm-rs[winloop]
Then call run_spider_winloop (same signature as run_spider):
from silkworm import run_spider_winloop
run_spider_winloop(
QuotesSpider,
concurrency=32,
)
winloop provides significant performance improvements on Windows, similar to what uvloop offers on Unix-like systems.
Running spiders with trio
If you prefer trio over asyncio, you can use run_spider_trio instead of run_spider:
pip install silkworm-rs[trio]
# or with uv:
uv pip install silkworm-rs[trio]
Then use run_spider_trio:
from silkworm import run_spider_trio
run_spider_trio(
QuotesSpider,
concurrency=16,
request_timeout=10,
)
This runs your spider using trio as the async backend via trio-asyncio compatibility layer.
JavaScript rendering with Lightpanda (CDP)
For pages that require JavaScript execution, you can use Lightpanda (or any CDP-compatible browser) instead of the standard HTTP client. This uses the Chrome DevTools Protocol (CDP) to control a browser.
Installation
pip install silkworm-rs[cdp]
# or with uv:
uv pip install silkworm-rs[cdp]
Starting Lightpanda
lightpanda --remote-debugging-port=9222
Or use Chrome/Chromium:
chromium --remote-debugging-port=9222 --headless
Using CDP in your spider
There are two ways to use CDP: the convenience API or custom spider integration.
Convenience API (simple one-off fetches)
import asyncio
from silkworm import fetch_html_cdp
async def main():
# Fetch HTML with JavaScript rendering
text, doc = await fetch_html_cdp(
"https://example.com",
ws_endpoint="ws://127.0.0.1:9222",
timeout=30.0
)
# Extract data from rendered page
title = doc.select_first("title")
print(title.text if title else "No title")
asyncio.run(main())
Full Spider Integration
from silkworm import HTMLResponse, Request, Response, Spider
from silkworm.cdp import CDPClient
class LightpandaSpider(Spider):
name = "lightpanda"
start_urls = ("https://example.com/",)
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._cdp_client = None
async def start_requests(self):
# Connect to CDP endpoint
self._cdp_client = CDPClient(
ws_endpoint="ws://127.0.0.1:9222",
timeout=30.0
)
await self._cdp_client.connect()
for url in self.start_urls:
yield Request(url=url, callback=self.parse)
async def parse(self, response: Response):
if not isinstance(response, HTMLResponse):
return
# Extract links from JavaScript-rendered page
for link in await response.select("a"):
href = link.attr("href")
if href:
yield {"url": href}
async def close(self):
if self._cdp_client:
await self._cdp_client.close()
See examples/lightpanda_simple.py and examples/lightpanda_spider.py for complete working examples.
Note: CDP support is experimental. For production use, consider using dedicated browser automation tools or the standard HTTP client when JavaScript rendering is not required.
Logging and crawl statistics
- Structured logs via
logly; setSILKWORM_LOG_LEVEL=DEBUGfor verbose request/response/middleware output. - Periodic statistics with
log_stats_interval; final stats always include elapsed time, queue size, requests/sec, seen URLs, items scraped, errors, and memory MB.
Limitations
- By default, HTTP fetches are wreq-based without JavaScript execution; pages requiring client-side rendering can use the optional CDP integration (see "JavaScript rendering with Lightpanda" section) or external browser automation tools.
- Request deduplication keys only on
Request.url; query params, HTTP method, and body are ignored, so same-URL requests with different params/data are dropped unless you setdont_filter=Trueor make the URL unique yourself. - HTML parsing auto-detects encoding (BOM, HTTP headers/meta, charset detection fallback) but still enforces a
html_max_size_bytes/doc_max_size_bytescap (default 5 MB) inscraper-rsselectors, so very large pages may need a higher limit or preprocessing. - Several pipelines buffer all items in memory until close (PolarsPipeline, ExcelPipeline, YAMLPipeline, AvroPipeline, VortexPipeline, S3JsonLinesPipeline, FTPPipeline, SFTPPipeline), which can bloat RAM on long crawls; prefer streaming pipelines like JsonLines/CSV/SQLite for high-volume runs.
- Many destination pipelines rely on optional extras; CassandraPipeline is disabled on Windows because
cassandra-driverdepends on libev there.
Examples
python examples/quotes_spider.py→data/quotes.jlpython examples/quotes_spider_trio.py→data/quotes_trio.jl(demonstrates trio backend)python examples/quotes_spider_winloop.py→data/quotes_winloop.jl(demonstrates winloop backend for Windows)python examples/hackernews_spider.py --pages 5→data/hackernews.jlpython examples/lobsters_spider.py --pages 2→data/lobsters.jlpython examples/url_titles_spider.py --urls-file data/url_titles.jl --output data/titles.jl(includesSkipNonHTMLMiddlewareand stricter HTML size limits)python examples/export_formats_demo.py --pages 2→ JSONL, XML, and CSV outputs indata/python examples/taskiq_quotes_spider.py --pages 2→ demonstrates TaskiqPipeline for queue-based processingpython examples/sitemap_spider.py --sitemap-url https://example.com/sitemap.xml --pages 50→data/sitemap_meta.jl(extracts meta tags and Open Graph data from sitemap URLs)python examples/lightpanda_simple.py→ demonstrates CDP/Lightpanda for JavaScript rendering (requirespip install silkworm-rs[cdp]and running Lightpanda)python examples/lightpanda_spider.py→ full spider example using CDP/Lightpanda
Convenience API
For one-off fetches without a full spider:
Standard HTTP fetch
import asyncio
from silkworm import fetch_html
async def main():
text, doc = await fetch_html("https://example.com")
title = await doc.select_first("title")
print(title.text if title else "No title")
asyncio.run(main())
CDP-based fetch (with JavaScript rendering)
import asyncio
from silkworm import fetch_html_cdp
async def main():
# Requires Lightpanda/Chrome running with CDP enabled
text, doc = await fetch_html_cdp("https://example.com")
title = await doc.select_first("title")
print(title.text if title else "No title")
asyncio.run(main())
Contributing
Pull requests and issues are welcome. To set up a dev environment, install uv, create a Python 3.13 virtualenv, and sync dev dependencies:
uv venv --python python3.13
uv sync --group dev
Run the checks before opening a PR:
just fmt && just lint && just typecheck && just test
Acknowledgements
Silkworm is built on top of excellent open-source projects:
wreq- HTTP client with browser impersonation capabilities- scraper-rs - Fast HTML parsing library
- logly - Structured logging
- rxml - XML parsing and writing
We are grateful to the maintainers and contributors of these projects for their work.
License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file silkworm_rs-0.4.6.tar.gz.
File metadata
- Download URL: silkworm_rs-0.4.6.tar.gz
- Upload date:
- Size: 270.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd42025a603692d2ee43f58d8a7cca899bd6082da6249b4aa5b182f474321843
|
|
| MD5 |
05091384e3e2e80d8d274cfb46fcd8b6
|
|
| BLAKE2b-256 |
391067cc331b9f9073f60bcd8f2ff4e7da63538cb5e2c08242cb38158b7cd9e4
|
Provenance
The following attestation bundles were made for silkworm_rs-0.4.6.tar.gz:
Publisher:
release.yml on BitingSnakes/silkworm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
silkworm_rs-0.4.6.tar.gz -
Subject digest:
cd42025a603692d2ee43f58d8a7cca899bd6082da6249b4aa5b182f474321843 - Sigstore transparency entry: 1186469844
- Sigstore integration time:
-
Permalink:
BitingSnakes/silkworm@bc9d12c28779138911d4724a32043f0be0b08411 -
Branch / Tag:
refs/tags/v0.4.6 - Owner: https://github.com/BitingSnakes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bc9d12c28779138911d4724a32043f0be0b08411 -
Trigger Event:
push
-
Statement type:
File details
Details for the file silkworm_rs-0.4.6-py3-none-any.whl.
File metadata
- Download URL: silkworm_rs-0.4.6-py3-none-any.whl
- Upload date:
- Size: 62.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f2bdcce26855fb2775ff6b2fe02c2bda6b7f4e12aabcedfc91eb33c03112b99
|
|
| MD5 |
866a9cd27fe0d10de05c219b0dde4ef7
|
|
| BLAKE2b-256 |
13a2eaf0e2257ba59aebf7296612f60147c3723ded180773d34a51606317f255
|
Provenance
The following attestation bundles were made for silkworm_rs-0.4.6-py3-none-any.whl:
Publisher:
release.yml on BitingSnakes/silkworm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
silkworm_rs-0.4.6-py3-none-any.whl -
Subject digest:
1f2bdcce26855fb2775ff6b2fe02c2bda6b7f4e12aabcedfc91eb33c03112b99 - Sigstore transparency entry: 1186469861
- Sigstore integration time:
-
Permalink:
BitingSnakes/silkworm@bc9d12c28779138911d4724a32043f0be0b08411 -
Branch / Tag:
refs/tags/v0.4.6 - Owner: https://github.com/BitingSnakes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bc9d12c28779138911d4724a32043f0be0b08411 -
Trigger Event:
push
-
Statement type: