LLM-native Python web scraping library.
Project description
๐ธ๏ธ Silkweb
The LLM-native Python web scraping library. Fetch anything. Extract everything. No selectors required.
Silkweb is a fully local, open-source Python library that unifies the entire web scraping stack โ HTTP fetching, JavaScript rendering, anti-bot bypass, HTML parsing, and LLM-powered data extraction โ behind a single import. It is the first library where you can type a plain-English question and receive a validated, typed Python object, without writing a single CSS selector or XPath expression, with all processing running privately on your own machine.
Table of Contents
- Why Silkweb?
- Installation
- Quick Start
- Core Concepts
- Fetcher Tiers
- LLM Auto-Extraction
- Natural Language Querying
- SilkQL Query Language
- HTML Parsing & Selectors
- Anti-Bot & Stealth
- LLM Providers & Configuration
- Output Formats
- Caching
- Crawling & Concurrency
- Session Management & Authentication
- Hidden API Discovery
- Watch & Change Detection
- CLI Reference
- Error Handling
- Observability
- Developer Experience
- Architecture Deep Dive
- Configuration Reference
- Recipes Library
- FAQ
1. Why Silkweb?
The problem with today's scraping ecosystem
Building a production web scraper in 2025 means gluing together at least four separate libraries:
httpx (fetch) + Playwright (JS render) + BeautifulSoup (parse) + custom glue (retry/cache/schema)
None of them talk to each other. None of them have LLM integration. None of them bypass modern anti-bot systems out of the box. And none of them let you just ask for what you want.
What Silkweb does differently
| Capability | Traditional approach | Silkweb |
|---|---|---|
| Fetch a page | requests.get(url) |
silkweb.fetch(url) โ auto-selects HTTP, stealth HTTP, or browser |
| Parse data | Write CSS/XPath selectors | Describe what you want in plain English |
| Handle JS | Manually configure Playwright | Automatic, transparent |
| Bypass Cloudflare | Multiple plugins, trial and error | Built-in, auto-escalating tiers |
| LLM extraction | No support | First-class, locally private |
| Output typing | Manual Pydantic boilerplate | Schema inferred or user-provided |
| Cache LLM calls | Not applicable | Synthesized selectors persist; LLM called once per template |
| Run locally | Not applicable | Fully offline with Ollama |
The key insight: extract once with an LLM, scrape millions with CSS
When Silkweb first encounters a page template, it uses an LLM to understand the structure and synthesize robust CSS/XPath selectors. Those selectors are cached. Every subsequent request to pages of the same template uses pure, fast selector-based extraction โ zero LLM cost after the first page. This makes LLM-quality extraction economically viable at scale.
2. Installation
Minimal install (no browser, no LLM)
pip install silkweb
With browser support (Playwright)
pip install "silkweb[browser]"
playwright install chromium
With stealth browser support (nodriver + Camoufox)
pip install "silkweb[stealth]"
Full install (all features)
pip install "silkweb[all]"
With specific LLM providers
pip install "silkweb[ollama]" # Ollama local models
pip install "silkweb[openai]" # OpenAI GPT-4o etc.
pip install "silkweb[anthropic]" # Anthropic Claude
pip install "silkweb[llama-cpp]" # llama.cpp embedded inference
Requirements
- Python 3.10 or higher
- For local LLM features: Ollama (recommended) or llama.cpp
- For browser features: Chromium (auto-downloaded by Playwright)
3. Quick Start
One-liner extraction
import silkweb
# Ask a plain English question about any URL
prices = silkweb.ask("https://news.ycombinator.com", "top 10 stories with scores and authors")
print(prices)
# [{'title': '...', 'score': 312, 'author': 'pg'}, ...]
Typed extraction with a Pydantic model
from silkweb import extract
from pydantic import BaseModel
class Story(BaseModel):
title: str
url: str
score: int
author: str
comments: int
stories: list[Story] = extract(
"https://news.ycombinator.com",
schema=Story,
prompt="all front page stories"
)
Zero-LLM CSS scraping (traditional mode)
import silkweb
page = silkweb.fetch("https://example.com")
titles = page.css("h1, h2, h3") # CSS selector
links = page.xpath("//a/@href") # XPath
text = page.text # main content via Trafilatura
Async usage
import asyncio
import silkweb
async def main():
page = await silkweb.async_fetch("https://example.com")
data = await silkweb.async_ask(page, "product name and price")
return data
asyncio.run(main())
4. Core Concepts
The Page object
Every fetch returns a SilkPage object โ the central data structure in Silkweb:
page = silkweb.fetch("https://example.com")
page.html # raw HTML string
page.text # main content text (Trafilatura-cleaned)
page.markdown # LLM-ready Markdown (ReaderLM-v2)
page.url # final URL (after redirects)
page.status # HTTP status code
page.headers # response headers
page.metadata # Open Graph, JSON-LD, Twitter Cards, author, date
page.fetch_tier # which fetcher tier was used (0-4)
# Selectors
page.css("selector") # returns list of SilkElement
page.xpath("expression") # returns list of SilkElement
page.find("product title") # adaptive selector (text/structure)
# LLM extraction
page.ask("product name and price")
page.extract(schema=Product)
page.query("{ products[] { name price } }") # SilkQL
The SilkElement object
element = page.css("h1")[0]
element.text # inner text
element.html # inner HTML
element.attrs # dict of attributes
element.xpath # XPath address of this element (provenance)
element["href"] # attribute shorthand
element.parent # parent element
element.children # list of child elements
element.siblings # list of sibling elements
Provenance
Every extracted field carries a __silk_meta__ provenance record:
products = page.extract(schema=Product)
print(products[0].__silk_meta__)
# {
# 'url': 'https://example.com/store',
# 'fetched_at': '2025-04-30T12:00:00Z',
# 'fetch_tier': 1,
# 'xpath': '/html/body/div[2]/article[1]/h2',
# 'llm_model': 'ollama/qwen2.5:14b',
# 'selector_from_cache': True,
# 'confidence': 0.97
# }
5. Fetcher Tiers
Silkweb uses a five-tier fetcher architecture. By default, it starts at the cheapest tier and automatically escalates when it detects blocks, JS-only content, or CAPTCHA challenges.
Tier 0 โ httpx (async HTTP)
Standard HTTP/1.1 and HTTP/2 requests via httpx. Fastest and cheapest. Used for REST APIs, simple static pages, and sitemaps.
page = silkweb.fetch(url, tier=0)
When used automatically: static pages, APIs, URLs that return non-HTML content.
Tier 1 โ Stealth HTTP (curl_cffi)
HTTP requests with real-browser TLS fingerprints (JA3/JA4), HTTP/2 frame ordering, and header profiles matching Chrome, Firefox, or Safari. Bypasses the majority of WAF-based blocks without launching a browser.
page = silkweb.fetch(url, tier=1)
# Specify browser profile
page = silkweb.fetch(url, tier=1, impersonate="chrome_124")
# Available: chrome_120, chrome_124, firefox_121, safari_17, edge_122
When used automatically: first retry after a 403 on Tier 0; sites known to use basic WAF checks.
Tier 2 โ Playwright (browser)
Full headless Chromium, Firefox, or WebKit browser via Playwright. Renders JavaScript, executes dynamic content, and supports network interception.
page = silkweb.fetch(url, tier=2)
# Advanced options
page = silkweb.fetch(
url,
tier=2,
browser="firefox", # "chromium" | "firefox" | "webkit"
wait_until="networkidle", # "load" | "domcontentloaded" | "networkidle"
wait_for="css:.product", # wait for a specific element
timeout=30_000, # ms
viewport={"width": 1920, "height": 1080},
intercept_requests=True, # capture XHR for hidden API discovery
)
When used automatically: JS-rendered pages, SPAs, pages with dynamic content loading.
Tier 3 โ Stealth Browser
Automatically selects the best stealth approach based on detected fingerprinting technology:
- nodriver: Direct Chrome CDP connection (no WebDriver protocol). Best for Cloudflare Turnstile, DataDome, PerimeterX.
- Camoufox: Patched Firefox binary with C++-level fingerprint spoofing. Best for sites fingerprinting Firefox.
- Patchright: Patched Playwright Chromium. Middle ground.
page = silkweb.fetch(url, tier=3)
# Force a specific stealth engine
page = silkweb.fetch(url, tier=3, stealth_engine="nodriver") # default
page = silkweb.fetch(url, tier=3, stealth_engine="camoufox")
page = silkweb.fetch(url, tier=3, stealth_engine="patchright")
When used automatically: Cloudflare challenge pages, 403s on Tier 1, sites with known aggressive bot detection.
Tier 4 โ Vision-Agent (LLM-driven browser)
An LLM agent controls a browser autonomously โ clicking, scrolling, filling forms โ until the target data is reachable. Powered by a vision LLM (default: Claude Sonnet for screenshot analysis).
page = silkweb.fetch(
url,
tier=4,
goal="navigate to the product listing for laptops and extract all items",
max_steps=10
)
When used automatically: sites that require human-like interaction sequences to reveal data. Only activated explicitly or after manual configuration.
Auto-escalation
# Auto-escalation is on by default
page = silkweb.fetch(url)
# Silkweb tries Tier 0 โ detects Cloudflare โ upgrades to Tier 1 โ success
# Disable auto-escalation
page = silkweb.fetch(url, auto_escalate=False)
# Set maximum tier for auto-escalation
page = silkweb.fetch(url, max_tier=2) # will not use stealth browser
6. LLM Auto-Extraction
Auto-extraction is Silkweb's flagship feature. Given any page, Silkweb uses a decomposed three-model pipeline to understand the structure, infer a schema, extract structured data, and synthesize CSS/XPath selectors that are cached for future use.
How it works (the three-model pipeline)
HTML Page
โ
โผ
[Model 1: ReaderLM-v2] โ HTML โ clean Flat JSON / Markdown
โ (removes nav, scripts, ads, boilerplate)
โ
โผ
[Token Budget Planner] โ if too large: chunk with DOM-aware splitting
โ
โผ
[Model 2: Qwen 2.5 Coder 14B] โ infer schema from cleaned content + user prompt
โ (generates a Pydantic model automatically)
โ
โผ
[Model 3: LLM Extractor] โ extract data matching the schema (JSON-mode)
โ (returns JSON with XPath provenance per field)
โ
โผ
[Model 2 again: Selector Compiler] โ synthesize robust CSS + XPath selectors
โ with adaptive fallbacks
โ
โผ
[Selector Cache] โ stored keyed by (domain, DOM-skeleton-hash)
โ
โผ
[Pydantic Validator] โ validate and return typed result
On all future pages matching the same template, only the selector step runs โ no LLM calls.
Basic auto-extraction
import silkweb
# Let Silkweb figure everything out
result = silkweb.ask("https://books.toscrape.com", "all books with title, price and rating")
Extraction with your own schema
from pydantic import BaseModel
from typing import Optional
import silkweb
class Book(BaseModel):
title: str
price: float
rating: int # 1-5
in_stock: bool
description: Optional[str] = None
books = silkweb.extract(
"https://books.toscrape.com",
schema=Book,
prompt="all books on the page"
)
# returns list[Book] โ fully validated
Controlling the extraction pipeline
result = silkweb.extract(
url,
schema=Product,
prompt="all products",
# Model overrides
cleaner_model="ollama/reader-lm-v2",
extraction_model="ollama/qwen2.5:14b",
selector_model="ollama/qwen2.5-coder:14b",
# Chunking strategy when page is large
chunk_strategy="bm25", # "bm25" | "semantic" | "dom" | "token"
max_tokens_per_chunk=8_000,
# HTML representation fed to the LLM
representation="flat_json", # "flat_json" | "slim_html" | "markdown"
# Cache behaviour
use_cache=True,
force_llm=False, # bypass cache and always call LLM
# Provenance
include_provenance=True, # attach __silk_meta__ to each result
)
Streaming extraction
For large pages, results stream back as they are extracted:
async for product in silkweb.async_stream_extract(url, schema=Product):
print(product.name, product.price)
Schema inference without extraction
# Just infer the schema โ don't extract data yet
schema = silkweb.infer_schema("https://amazon.com/dp/B0001", hint="product page")
print(schema.model_json_schema())
# { "title": "Product", "properties": { "name": {...}, "price": {...}, ... } }
7. Natural Language Querying
Natural language queries let you describe what you want in plain English. Silkweb compiles the query to a Pydantic schema, extracts the data, and returns typed Python objects.
silkweb.ask() โ the simplest interface
import silkweb
# Returns list[dict] โ schema inferred
data = silkweb.ask(url, "all product names and their prices in euros")
# Returns a specific type when unambiguous
count = silkweb.ask(url, "total number of results as an integer") # โ int
# With context
data = silkweb.ask(url, "only the out-of-stock products with their restock dates")
Query modifiers
Natural language modifiers Silkweb understands:
silkweb.ask(url, "top 5 articles by comment count") # limit + sort
silkweb.ask(url, "all links that go to external domains") # filtering
silkweb.ask(url, "every table on the page as separate lists") # multi-entity
silkweb.ask(url, "the main article text and its author") # mixed types
silkweb.ask(url, "prices converted to USD using current rate") # transformation
Conversational / interactive mode
with silkweb.Session("https://example.com/store") as session:
session.fetch() # fetch once
products = session.ask("all products")
cheap = session.ask("only products under $50")
rated = session.ask("their star ratings too") # incremental
# Refine iteratively without re-fetching
final = session.ask("format as a table sorted by price")
REPL
Launch an interactive exploration session from the terminal:
silkweb shell https://example.com/store
Silkweb Shell v0.1.0 | https://example.com/store | Tier 1
Type a query, a SilkQL expression, or Python. Tab-complete available.
silk> ask("all product names and prices")
[{'name': 'Widget A', 'price': 29.99}, ...]
silk> ask("only the ones in stock")
[{'name': 'Widget A', 'price': 29.99}, ...]
silk> page.css("h1").text
'Best Widgets Online'
silk> page.metadata
{'title': ..., 'description': ..., 'author': ..., 'date': ...}
8. SilkQL Query Language
SilkQL is Silkweb's open-source structured query language for the web. Inspired by AgentQL, it is locally compilable, type-safe, and reusable across websites.
Syntax overview
{
field_name(type_coercion, modifier)
collection[] {
field
nested_field {
sub_field
}
}
}
Basic example
import silkweb
query = """
{
products[] {
name
price(currency)
rating(float)
reviews_count(int)
in_stock(bool)
image_url(url)
product_url(url)
}
total_results(int)
pagination {
current_page(int)
next_page_url(url)
}
}
"""
result = silkweb.query(url, query)
Type coercions
SilkQL automatically coerces extracted strings to typed Python values:
| Coercion | Input example | Python type | Output |
|---|---|---|---|
(int) |
"1,234" |
int |
1234 |
(float) |
"โฌ29.99" |
float |
29.99 |
(currency) |
"$1,234.56" |
float |
1234.56 |
(bool) |
"In Stock" |
bool |
True |
(url) |
"/products/1" |
str |
"https://example.com/products/1" |
(iso_date) |
"Apr 30, 2025" |
datetime |
datetime(2025, 4, 30) |
(list) |
"Red, Blue, Green" |
list[str] |
["Red", "Blue", "Green"] |
(json) |
'{"key": 1}' |
dict |
{"key": 1} |
Field modifiers
name(optional) โ field may not exist; returns None instead of error
price(currency, optional)
tags(list, min_count=1) โ at least 1 item required
id(int, unique) โ deduplicate if same value found multiple times
Automatic pagination
When a query includes a next_page_url(url) field in a pagination block, Silkweb automatically follows it and merges results:
result = silkweb.query(
url,
query,
follow_pagination=True,
max_pages=20
)
# result.products โ merged across all pages
# result.pages_scraped โ number of pages traversed
Compiling SilkQL to Pydantic
from silkweb.silkql import compile_query
PydanticModel = compile_query(query)
# PydanticModel is now a usable Pydantic BaseModel subclass
SilkQL in Python (code API)
from silkweb import Q
result = silkweb.query(url, Q.root(
Q.list("products",
Q.field("name"),
Q.field("price", type="currency"),
Q.field("rating", type="float", optional=True),
),
Q.field("next_page", type="url", optional=True)
))
9. HTML Parsing & Selectors
Silkweb provides a rich selector API on top of lxml and its own adaptive selector engine.
CSS selectors
page = silkweb.fetch(url)
# Returns list[SilkElement]
items = page.css(".product-card")
# Chained
prices = page.css(".product-card").css(".price")
# First match
title = page.css_first("h1")
# Text shorthand
title_text = page.css_first("h1").text
XPath selectors
links = page.xpath("//a[@class='product-link']/@href")
prices = page.xpath("//span[contains(@class, 'price')]/text()")
Adaptive selectors
Adaptive selectors generate multiple fallback strategies and return the first that matches, making scrapers resilient to CSS class renames:
# Tries: class match โ text match โ structural position โ attribute similarity
items = page.find(".product-title", adaptive=True)
# Explicit fallback chain
items = page.find(
primary=".product-card h2",
fallbacks=[
"//div[@data-type='product']//h2",
".item-name",
"//h2[contains(@class, 'title')]",
]
)
Built-in smart extractors
# Extract all tables โ list of DataFrames
tables = page.tables()
# Extract all JSON-LD structured data
json_ld = page.json_ld() # list[dict]
# Extract Open Graph / Twitter Card metadata
meta = page.metadata # {'title': ..., 'description': ..., 'image': ..., 'author': ...}
# Extract all links
links = page.links() # all links
external = page.links(external=True) # external only
# Extract main article text (Trafilatura)
article = page.article() # {'title', 'text', 'author', 'date', 'language'}
# Extract hydration data (Next.js / Nuxt / Remix / SvelteKit)
data = page.hydration_data() # parsed JSON from __NEXT_DATA__, __NUXT__, etc.
# often contains the complete page data as JSON
Repeated pattern detection (no LLM)
# Automatically detect and extract repeating record structures
records = page.detect_records()
# [{'title': '...', 'price': '...', 'image': '...'}, ...]
10. Anti-Bot & Stealth
Silkweb bundles the most comprehensive open-source anti-bot stack available, all configured automatically.
TLS & HTTP fingerprinting
Via curl_cffi, Silkweb mimics the exact TLS handshake, cipher suite order, HTTP/2 settings frames, and header order of real browsers.
silkweb.fetch(url, impersonate="chrome_124")
silkweb.fetch(url, impersonate="firefox_121")
silkweb.fetch(url, impersonate="safari_17")
silkweb.fetch(url, impersonate="edge_122")
Proxy management
# Single proxy
silkweb.fetch(url, proxy="http://user:pass@proxy.example.com:8080")
# Proxy pool with automatic rotation
silkweb.configure(proxies=[
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"socks5://user:pass@proxy3.example.com:1080",
])
# Rotation strategy
silkweb.configure(
proxies=my_proxy_list,
proxy_rotation="per_request", # "per_request" | "per_domain" | "on_failure" | "sticky"
sticky_session_ttl=300, # seconds (for sticky mode)
)
Rate limiting
silkweb.configure(
rate_limit={
"global": 10, # max 10 requests/second globally
"per_domain": 2, # max 2 requests/second per domain
"respect_crawl_delay": True, # honor robots.txt Crawl-delay
"jitter": 0.3, # add up to 30% random delay
}
)
Behavioral stealth
silkweb.configure(
stealth={
"human_mouse": True, # Bezier-curve mouse movements
"human_typing": True, # randomized typing speed/delays
"random_scroll": True, # natural scroll patterns
"viewport_noise": True, # slight viewport randomization
"timezone": "America/New_York",
"locale": "en-US",
"geolocation": {"lat": 40.7, "lng": -74.0},
}
)
CAPTCHA solving
silkweb.configure(
captcha_solver="local", # "local" | "2captcha" | "anticaptcha" | "capsolver"
captcha_api_key="...", # for cloud solvers
)
The "local" solver handles:
- Cloudflare Turnstile via SeleniumBase UC Mode strategy
- reCAPTCHA v2 via audio challenge solver
- hCaptcha via WASM-based solver
Robots.txt compliance
# Default: respect robots.txt
silkweb.fetch(url)
# Override (use responsibly and legally)
silkweb.fetch(url, respect_robots=False)
# Just check without fetching
allowed = silkweb.robots_allowed(url, user_agent="SilkwebBot/1.0")
11. LLM Providers & Configuration
Configuring providers
import silkweb
silkweb.configure(
# Assign models per task
cleaner_model = "ollama/reader-lm-v2", # HTML โ Markdown / Flat JSON
schema_model = "ollama/qwen2.5-coder:14b", # schema inference + selector synthesis
extraction_model = "ollama/qwen2.5:14b", # data extraction
embedding_model = "ollama/nomic-embed-text", # BM25/semantic chunking
vision_model = "anthropic/claude-3-5-sonnet-20241022", # vision fallback only
)
Supported providers and model URI format
"ollama/<model>" โ Ollama at localhost:11434
"openai/<model>" โ OpenAI API
"anthropic/<model>" โ Anthropic API
"google/<model>" โ Google Gemini API
"groq/<model>" โ Groq API
"mistral/<model>" โ Mistral API
"together/<model>" โ Together AI
"bedrock/<region>/<model>" โ AWS Bedrock
"azure/<deployment>" โ Azure OpenAI
"vertex/<project>/<model>" โ Google Vertex AI
"llamacpp/<path/to/model.gguf>" โ llama.cpp embedded (no server needed)
"vllm/<model>" โ vLLM server
"lmstudio/<model>" โ LM Studio (OpenAI-compatible)
"mlx/<model>" โ Apple MLX (Apple Silicon)
"openai_compatible/<base_url>/<model>" โ Any OpenAI-compatible endpoint
Recommended local models by use case
| Task | Recommended model | VRAM | Notes |
|---|---|---|---|
| HTML cleaning | reader-lm-v2 |
2 GB | Jina specialist, 512K context |
| Schema synthesis | qwen2.5-coder:14b |
8 GB | Best code/structure understanding |
| Data extraction | qwen2.5:14b |
8 GB | Best overall for structured output |
| Embeddings | nomic-embed-text |
0.5 GB | Fast, high quality |
| Vision fallback | llava:13b or cloud |
8 GB | For screenshot-based extraction |
| Reasoning | deepseek-r1:14b |
8 GB | Complex multi-step extractions |
Bundled starter mode
On first import, Silkweb auto-detects Ollama and available models:
import silkweb
# Auto-configure from detected local models
silkweb.auto_configure()
# Or pull recommended models automatically
silkweb.setup_recommended_models()
# Downloads: reader-lm-v2, qwen2.5-coder:14b, nomic-embed-text via Ollama
API keys
# Environment variables (recommended)
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...
# Or in code
silkweb.configure(
api_keys={
"openai": "sk-...",
"anthropic": "sk-ant-...",
}
)
12. Output Formats
Python dict / list (default)
data = silkweb.ask(url, "products")
# [{'name': '...', 'price': 29.99}, ...]
Pydantic models
products: list[Product] = silkweb.extract(url, schema=Product)
products[0].model_dump()
products[0].model_dump_json()
Pandas DataFrame
df = silkweb.to_dataframe(url, "all products")
# Auto-detected if pandas is installed: silkweb.ask() returns a DataFrame
import pandas as pd
df = silkweb.ask(url, "all products") # returns DataFrame when pandas is active
Polars DataFrame
import polars as pl
df = silkweb.ask(url, "all products") # returns Polars DataFrame when polars is active
# Explicit
df = silkweb.to_polars(url, "all products")
JSON / JSONL
silkweb.to_json(url, "products", output="products.json")
silkweb.to_jsonl(url, "products", output="products.jsonl")
silkweb.to_json(url, "products", output="products.json.gz") # auto-gzip
CSV
silkweb.to_csv(url, "products", output="products.csv")
Parquet
silkweb.to_parquet(url, "products", output="products.parquet")
DuckDB / SQLite
silkweb.to_duckdb(url, "products", db="store.duckdb", table="products")
silkweb.to_sqlite(url, "products", db="store.sqlite", table="products")
Markdown (for RAG)
md = silkweb.to_markdown(url) # full page as Markdown
silkweb.to_markdown(url, output="page.md") # save to file
HuggingFace Dataset
dataset = silkweb.to_dataset(url, "all articles")
dataset.push_to_hub("your-org/dataset-name")
13. Caching
Three-layer cache
Layer 1 โ HTTP cache: stores raw HTTP responses with conditional GET support (ETag / Last-Modified). Prevents redundant network requests.
Layer 2 โ Rendered page cache: stores post-JavaScript DOM snapshots. Prevents redundant browser launches.
Layer 3 โ Selector cache: stores LLM-synthesized CSS/XPath selectors keyed by (domain, DOM-skeleton-hash). This is the most important cache โ it means the LLM is called only once per page template.
Cache configuration
silkweb.configure(
cache={
"enabled": True,
"backend": "sqlite", # "sqlite" | "redis" | "memory"
"path": "~/.silkweb/cache", # for sqlite
"redis_url": "redis://localhost:6379", # for redis backend
"http_ttl": 3600, # HTTP cache TTL in seconds (1 hour)
"page_ttl": 1800, # Rendered page cache TTL (30 min)
"selector_ttl": None, # Selector cache TTL (None = forever)
"max_size_gb": 5, # Max cache size
}
)
Managing the cache
# Inspect cache stats
stats = silkweb.cache.stats()
# {'http_entries': 1234, 'page_entries': 89, 'selector_entries': 42, 'size_mb': 234}
# Clear specific cache layers
silkweb.cache.clear(layer="http")
silkweb.cache.clear(layer="selectors")
silkweb.cache.clear() # clear all
# Clear selectors for a specific domain (force LLM re-learning)
silkweb.cache.clear_domain("amazon.com", layer="selectors")
# Force bypass cache for a single request
page = silkweb.fetch(url, no_cache=True)
data = silkweb.ask(url, "products", force_llm=True)
14. Crawling & Concurrency
Simple multi-URL fetch
pages = silkweb.fetch_all([url1, url2, url3], concurrency=10)
Full crawl
results = silkweb.crawl(
start_url="https://example.com",
# What to follow
follow_links=True,
allowed_domains=["example.com"],
url_pattern=r"/products/\d+", # regex filter on URLs to follow
# Extraction
schema=Product,
prompt="product data",
# Limits
max_pages=1000,
max_depth=3,
concurrency=20,
per_domain_concurrency=5,
# Callbacks
on_page=lambda page: print(f"scraped {page.url}"),
on_item=lambda item: db.insert(item),
on_error=lambda url, err: logger.error(f"failed {url}: {err}"),
# Dedup
dedup=True, # skip already-visited URLs
dedup_backend="sqlite", # or "redis" for distributed
# Output
output="products.jsonl",
)
Sitemap crawl
# Crawl all URLs from a sitemap
results = silkweb.crawl_sitemap(
"https://example.com/sitemap.xml",
schema=Article,
prompt="article content",
concurrency=30,
)
Feed crawl
# Crawl an RSS/Atom feed
items = silkweb.crawl_feed("https://news.ycombinator.com/rss")
Async streaming crawl
async for item in silkweb.async_crawl(start_url, schema=Product):
await db.insert(item)
15. Session Management & Authentication
Basic session persistence
# Create a named session (cookies, storage, headers persist)
session = silkweb.Session("my_session")
# Log in once
session.fetch("https://example.com/login")
session.fill("#username", "user@example.com")
session.fill("#password", "password123")
session.click("#login-btn")
session.wait_for(".dashboard")
# Save session to disk
session.save() # saves to ~/.silkweb/sessions/my_session.silkweb
# Later, resume without logging in again
session = silkweb.Session.load("my_session")
page = session.fetch("https://example.com/protected-data")
Action recorder
# Record a browser session interactively
silkweb.record("my_login_flow")
# Opens a browser โ you log in manually โ recording is saved
# Replay the recording
silkweb.replay("my_login_flow")
page = silkweb.fetch("https://example.com/data", session="my_login_flow")
OAuth / SSO hand-off
# Opens a real browser for OAuth flow, captures tokens, then switches to headless
session = silkweb.oauth_session(
url="https://app.example.com",
session_name="example_oauth"
)
16. Hidden API Discovery
One of Silkweb's most powerful features: instead of scraping the DOM of a JavaScript-heavy page, discover the underlying JSON API it calls and use that directly.
api_info = silkweb.discover_api("https://example.com/store")
print(api_info)
# {
# 'endpoints': [
# {
# 'url': 'https://api.example.com/v2/products?page=1&limit=24',
# 'method': 'GET',
# 'headers': { 'x-api-token': '...' },
# 'response_schema': { 'items': [...], 'total': 1234 },
# 'pagination': 'cursor',
# }
# ],
# 'generated_scraper': '...', # Python code using httpx directly
# }
# Generate and save a pure-httpx scraper (no browser needed)
silkweb.discover_api(
"https://example.com/store",
output="example_api_scraper.py"
)
The generated scraper uses direct HTTP calls โ typically 10โ100ร faster than DOM scraping.
17. Watch & Change Detection
Monitor pages for changes and extract diffs automatically.
Basic watch
# Watch a page and print changes
silkweb.watch(
"https://example.com/pricing",
schema=PricingPlan,
interval=3600, # check every hour
on_change=lambda diff: print(diff),
)
Diff structure
{
'url': 'https://example.com/pricing',
'checked_at': '2025-04-30T12:00:00Z',
'previous_checked_at': '2025-04-30T11:00:00Z',
'changed': True,
'changes': [
{
'field': 'price',
'record_id': 'plan_pro',
'old_value': 49.0,
'new_value': 59.0,
'change_type': 'modified',
},
{
'field': 'name',
'change_type': 'added',
'new_value': 'Enterprise Plus',
}
]
}
Watch with webhook / callback
silkweb.watch(
url,
schema=Product,
interval=1800,
on_change=lambda diff: requests.post("https://myapp.com/webhook", json=diff),
on_error=lambda err: logger.error(err),
notify_on_no_change=False, # silent when nothing changed
)
Running multiple watches
# Background watcher (non-blocking)
watcher = silkweb.Watcher()
watcher.add("https://site1.com/products", schema=Product, interval=3600)
watcher.add("https://site2.com/prices", schema=Price, interval=1800)
watcher.start() # runs in background thread
# ...
watcher.stop()
18. CLI Reference
# Fetch a URL and print cleaned text
silkweb fetch https://example.com
# Fetch with specific tier
silkweb fetch https://example.com --tier 1
# Ask a natural language question
silkweb ask https://example.com "all product names and prices"
# Extract with a schema file
silkweb extract https://example.com --schema product.py --output products.json
# Open interactive shell
silkweb shell https://example.com
# Crawl a site
silkweb crawl https://example.com --url-pattern "/products/*" --schema product.py --output products.jsonl
# Discover hidden APIs
silkweb discover-api https://example.com --output scraper.py
# Watch a page for changes
silkweb watch https://example.com "prices" --interval 3600
# Manage local models
silkweb models list
silkweb models pull qwen2.5:14b
silkweb models recommend # shows recommended models for your hardware
# Cache management
silkweb cache stats
silkweb cache clear --layer selectors
silkweb cache clear --domain amazon.com
# Validate a SilkQL query
silkweb silkql validate query.silk
# Browse the recipe library
silkweb recipes list
silkweb recipes show hacker-news
silkweb recipes run hacker-news --output hn.json
19. Error Handling
Exception hierarchy
SilkwebError
โโโ SilkwebFetchError
โ โโโ SilkwebHTTPError โ non-2xx response
โ โโโ SilkwebTimeoutError โ request timed out
โ โโโ SilkwebBlockedError โ bot detection confirmed
โ โโโ SilkwebRenderError โ JS rendering failed
โโโ SilkwebExtractionError
โ โโโ SilkwebSchemaError โ Pydantic validation failed
โ โโโ SilkwebLLMError โ LLM call failed or returned invalid JSON
โ โโโ SilkwebSelectorError โ no elements matched selector
โโโ SilkwebCacheError
โโโ SilkwebConfigError
Error context
Every exception carries structured context:
try:
data = silkweb.ask(url, "products")
except silkweb.SilkwebBlockedError as e:
print(e.url) # URL that was blocked
print(e.status_code) # 403
print(e.tier_tried) # which tier failed
print(e.challenge_type) # "cloudflare_turnstile"
print(e.html_snippet) # first 500 chars of response
Retry configuration
silkweb.configure(
retry={
"max_attempts": 5,
"backoff": "exponential", # "exponential" | "linear" | "constant"
"backoff_base": 2, # seconds
"backoff_max": 60, # max seconds between retries
"jitter": True,
"retry_on": [429, 503, 502, 520], # HTTP codes to retry
"auto_escalate_on_block": True, # upgrade tier on BlockedError
}
)
Self-healing selectors
silkweb.configure(
self_heal={
"enabled": True,
"threshold": 0, # re-trigger LLM if 0 elements matched
"validation_fn": None, # custom Pydantic validator to trigger re-heal
"max_heal_attempts": 3,
}
)
20. Observability
Structured logging
import silkweb
import logging
silkweb.configure(
log_level="INFO", # "DEBUG" | "INFO" | "WARNING" | "ERROR"
log_format="json", # "json" | "text"
log_file="silkweb.log",
)
Log output (JSON format):
{
"timestamp": "2025-04-30T12:00:00Z",
"event": "fetch_completed",
"url": "https://example.com",
"tier": 1,
"status_code": 200,
"duration_ms": 234,
"cache_hit": false,
"llm_calls": 0
}
OpenTelemetry traces
silkweb.configure(
telemetry={
"enabled": True,
"exporter": "otlp", # "otlp" | "jaeger" | "zipkin" | "console"
"endpoint": "http://localhost:4317",
"service_name": "my-scraper",
}
)
Each scraping operation generates spans for: HTTP fetch โ JS render โ LLM clean โ LLM extract โ cache write โ validation.
Prometheus metrics
# Expose metrics endpoint
silkweb.configure(metrics_port=9090)
Available metrics:
silkweb_requests_total{tier, status, domain}silkweb_request_duration_seconds{tier, domain}silkweb_llm_calls_total{model, task}silkweb_llm_duration_seconds{model, task}silkweb_cache_hits_total{layer}silkweb_blocks_total{domain, challenge_type}
Replay / debugging
# Save a session for debugging
silkweb.configure(replay_dir="./silkweb_replays")
# Replay a session deterministically (uses saved HTML, no network)
silkweb.replay("./silkweb_replays/session_2025-04-30.silkweb")
21. Developer Experience
VS Code Extension
Install "Silkweb" from the VS Code Marketplace for:
- SilkQL syntax highlighting and autocompletion
- Inline schema preview from a URL
- One-click "Scrape this URL" command
- Selector cache browser sidebar
Browser DevTools Extension
Install "Silkweb Inspector" for Chrome/Firefox:
- Point and click on page elements
- Generates SilkQL query automatically
- Shows cached selectors for the current domain
- Live extraction preview
Jupyter Notebook support
import silkweb
# Rich HTML rendering in notebooks
page = silkweb.fetch(url)
silkweb.display(page) # renders page screenshot + metadata
products = silkweb.ask(url, "products")
silkweb.display(products) # renders as interactive table
Testing
# Mock mode โ no real HTTP requests
with silkweb.mock_mode():
silkweb.mock.register("https://example.com", html="<h1>Test</h1>")
page = silkweb.fetch("https://example.com")
assert page.css_first("h1").text == "Test"
# Replay mode โ use recorded sessions
with silkweb.replay_mode("./fixtures/example_session.silkweb"):
data = silkweb.ask("https://example.com", "products")
22. Architecture Deep Dive
Module layout
silkweb/
โโโ __init__.py # public API surface
โโโ fetch/
โ โโโ tiers/
โ โ โโโ httpx.py # Tier 0
โ โ โโโ curl_cffi.py # Tier 1
โ โ โโโ playwright.py # Tier 2
โ โ โโโ stealth.py # Tier 3 (nodriver / camoufox)
โ โ โโโ agent.py # Tier 4 (LLM vision agent)
โ โโโ orchestrator.py # auto-escalation logic
โ โโโ fingerprint.py # TLS/HTTP profile management
โโโ parse/
โ โโโ page.py # SilkPage, SilkElement
โ โโโ selectors.py # CSS + XPath + adaptive
โ โโโ content.py # Trafilatura, article extraction
โ โโโ hydration.py # Next.js / Nuxt / Remix JSON
โ โโโ patterns.py # repeated-record detection
โโโ llm/
โ โโโ providers/ # OpenAI, Anthropic, Ollama, llama.cpp, etc.
โ โโโ pipelines/
โ โ โโโ clean.py # ReaderLM-v2 / Trafilatura
โ โ โโโ schema.py # schema inference
โ โ โโโ extract.py # data extraction
โ โ โโโ selectors.py # selector synthesis
โ โ โโโ heal.py # self-healing
โ โโโ chunking/ # token, BM25, semantic, DOM-aware
โ โโโ representations/ # flat_json, slim_html, markdown
โ โโโ constrained.py # Outlines / lm-format-enforcer
โ โโโ prompts/ # versioned prompt templates
โโโ silkql/
โ โโโ parser.py # SilkQL grammar and parser
โ โโโ compiler.py # SilkQL โ Pydantic model
โ โโโ executor.py # SilkQL โ extraction pipeline
โโโ cache/
โ โโโ http.py # hishel-based HTTP cache
โ โโโ page.py # rendered-page cache
โ โโโ selectors.py # selector + schema cache
โโโ crawl/
โ โโโ crawler.py # full-site crawler
โ โโโ queue.py # async request queue
โ โโโ dedup.py # URL deduplication
โโโ stealth/
โ โโโ proxy.py # proxy pool management
โ โโโ rate_limit.py # token-bucket rate limiter
โ โโโ captcha.py # CAPTCHA solvers
โ โโโ behavior.py # mouse / scroll / typing
โโโ session/
โ โโโ session.py # session persistence
โ โโโ recorder.py # action recorder / replayer
โโโ watch.py # page change detection
โโโ discover.py # hidden API discovery
โโโ output/ # pandas, polars, json, csv, parquet, duckdb
โโโ config.py # global configuration
โโโ exceptions.py # typed exception hierarchy
โโโ observability/ # logging, OTEL, Prometheus
โโโ cli/ # Typer CLI commands
Dependency philosophy
Silkweb has a zero-LangChain, zero-LlamaIndex policy. All LLM provider integrations are direct SDK calls through a thin 300-line LLMProvider abstraction. This keeps the install small, avoids API breakage, and makes Silkweb's transitive dependency tree manageable.
Core dependencies
| Package | Purpose |
|---|---|
httpx |
Async HTTP client |
curl_cffi |
Browser-fingerprint HTTP |
playwright |
Browser automation |
lxml |
HTML/XML parser (CSS via lxml.cssselect, XPath via lxml) |
parsel |
Scrapy-style CSS/XPath |
trafilatura |
Article/content extraction |
pydantic v2 |
Schema validation |
anyio |
Async backend (asyncio + trio) |
hishel |
HTTP caching |
diskcache |
Disk-based cache |
typer + rich |
CLI |
structlog |
Structured logging |
outlines |
Constrained LLM decoding |
diskcache |
Present as a dependency; not currently used as a cache_backend implementation |
Optional dependencies (extras)
| Extra | Packages | Purpose |
|---|---|---|
browser |
playwright, playwright-stealth | Full browser support |
stealth |
nodriver, camoufox, patchright | Stealth browsers |
ollama |
ollama | Local Ollama models |
openai |
openai | OpenAI API |
anthropic |
anthropic | Anthropic Claude |
llama-cpp |
llama-cpp-python | Embedded llama.cpp |
vllm |
vllm | vLLM server |
pandas |
pandas | DataFrame output |
polars |
polars | Polars DataFrame output |
duckdb |
duckdb | DuckDB output |
otel |
opentelemetry-* | OpenTelemetry tracing |
23. Configuration Reference
Full configuration with all defaults:
import silkweb
silkweb.configure(
# === LLM Models ===
cleaner_model="ollama/reader-lm-v2",
schema_model="ollama/qwen2.5-coder:14b",
extraction_model="ollama/qwen2.5:14b",
embedding_model="ollama/nomic-embed-text",
vision_model=None, # None = disabled unless needed
# === Fetcher ===
default_tier="auto", # "auto" | 0 | 1 | 2 | 3 | 4
max_tier=3, # max tier for auto-escalation
auto_escalate=True,
timeout=30_000, # ms
user_agent="Mozilla/5.0 ...", # default browser UA
impersonate="chrome_124", # default curl_cffi profile
headers={}, # default extra headers
# === Extraction ===
chunk_strategy="bm25", # "bm25" | "semantic" | "dom" | "token"
max_tokens_per_chunk=8_000,
representation="flat_json", # "flat_json" | "slim_html" | "markdown"
include_provenance=True,
force_llm=False,
hydration_first=True, # try Next.js/Nuxt JSON before DOM
# === Cache ===
cache_enabled=True,
cache_backend="sqlite",
cache_path="~/.silkweb/cache",
http_cache_ttl=3600,
page_cache_ttl=1800,
selector_cache_ttl=None,
# === Proxy & Rate Limiting ===
proxies=[],
proxy_rotation="on_failure",
rate_limit_global=None,
rate_limit_per_domain=2,
respect_robots=True,
# === Retry ===
max_retries=3,
retry_backoff="exponential",
retry_backoff_base=2,
# === Stealth ===
human_mouse=False,
human_typing=False,
captcha_solver=None,
# === Output ===
default_output_format="python", # "python" | "json" | "csv" | "parquet" | "df"
auto_detect_dataframe=True, # return DataFrame if pandas/polars imported
# === Observability ===
log_level="WARNING",
log_format="text",
metrics_port=None,
telemetry_enabled=False,
)
24. Recipes Library
Silkweb ships with community-contributed, version-pinned schemas and configurations for common scraping targets. Recipes are fully offline and use only the cached selector system.
silkweb recipes list
| Recipe | Description |
|---|---|
hacker-news |
Front page stories, scores, authors, comments |
github-repo |
Stars, forks, topics, README content |
github-issues |
Issue list with labels, assignees, timestamps |
amazon-product |
Title, ASIN, price, rating, reviews, variants |
amazon-search |
Search results with prices and ratings |
google-serp |
Organic results, featured snippets, PAA |
reddit-posts |
Post list with scores, authors, flairs |
linkedin-profile |
Public profile: headline, experience, education |
twitter-profile |
Bio, followers, following, pinned tweet |
youtube-video |
Title, views, description, channel, upload date |
wikipedia |
Article text, infobox, categories, references |
imdb-movie |
Title, rating, cast, plot, genres |
arxiv-paper |
Title, authors, abstract, categories, PDF link |
product-listing |
Generic e-commerce product listing (any site) |
news-article |
Generic article extraction (any news site) |
# Use a recipe
import silkweb
articles = silkweb.recipes.run(
"hacker-news",
url="https://news.ycombinator.com",
)
# Preview a recipe
print(silkweb.recipes.show("amazon-product"))
# Contribute a recipe
silkweb.recipes.create(
name="my-recipe",
url="https://example.com",
schema=MySchema,
description="Extracts products from example.com",
)
25. FAQ
Q: Does Silkweb work without any LLM? Yes. All LLM features are opt-in. Silkweb works as a fast, stealth-capable scraping library without any LLM configured.
Q: Is my data sent to a cloud LLM? Only if you configure a cloud provider. The default configuration uses Ollama on localhost. All processing is private and local by default.
Q: How does the selector cache work? The first time Silkweb extracts data from a URL template, it uses the LLM pipeline and stores the resulting selectors in a local SQLite database. All future requests to pages with the same DOM structure use only CSS/XPath โ no LLM call. The cache is keyed by a hash of the DOM skeleton (tag structure without content), so it is resilient to content changes.
Q: What happens when a cached selector stops working? Self-healing is enabled by default. If a cached selector returns 0 results or fails Pydantic validation, Silkweb automatically re-invokes the LLM to synthesize new selectors, then updates the cache.
Q: How large can pages be? Silkweb handles large pages through its token budget planner. ReaderLM-v2 typically reduces a 200K-token raw HTML page to 5โ20K tokens. If still too large for the configured model context, DOM-aware chunking splits by semantic boundaries and results are merged.
Q: Can I use Silkweb for authenticated scraping?
Yes. Use silkweb.Session for session persistence, silkweb.record() for recording login flows, and the OAuth hand-off for SSO. Sessions are stored as portable .silkweb-session files.
Q: Is Silkweb legal to use?
Silkweb is a tool. Whether scraping a particular website is legal depends on the website's Terms of Service, local laws (CFAA, GDPR, etc.), and the nature of the data. By default, Silkweb respects robots.txt. Always check the legal context for your specific use case.
Q: How does Silkweb compare to Scrapy? Scrapy is a mature, powerful framework optimized for large-scale crawls with a complex component model. Silkweb prioritizes developer ergonomics and LLM-first extraction. They serve different needs. For very large-scale production crawls (millions of pages/day), Scrapy's ecosystem is unmatched. For rapid development, LLM extraction, and local-first use, Silkweb is simpler and more powerful.
Q: What is SilkQL? SilkQL is Silkweb's open-source structured query language for describing what to extract from a web page. It is inspired by AgentQL but is fully local, open-source, and compiles to Pydantic models. See Section 8.
Q: Can I contribute a recipe?
Yes. Recipes are YAML files in the silkweb-recipes repository. Submit a pull request with your schema, a sample URL, and expected output.
License
MIT License. Copyright ยฉ 2025 Silkweb Contributors.
Contributing
Contributions are welcome. See CONTRIBUTING.md for guidelines.
Acknowledgements
Silkweb builds on the shoulders of giants: Scrapy, Playwright, nodriver, Camoufox, curl_cffi, Trafilatura, lxml, Pydantic, Crawl4AI, ScrapeGraphAI, AgentQL, and the open-weights model community (Qwen, Meta, Jina AI).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file silkweb-0.1.0.tar.gz.
File metadata
- Download URL: silkweb-0.1.0.tar.gz
- Upload date:
- Size: 184.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6816925a1951937f0825c0aa1d5eb8a67bcc597bda6fd705a34ef2ae9c54c76
|
|
| MD5 |
b611f6d3d35290a0b0919d464e9dfb5e
|
|
| BLAKE2b-256 |
5c2992b5fffd44161761090d8796eb10f91dc6d79e7cc072c12aba1cea4491b3
|
File details
Details for the file silkweb-0.1.0-py3-none-any.whl.
File metadata
- Download URL: silkweb-0.1.0-py3-none-any.whl
- Upload date:
- Size: 141.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d67e0579ff9fbd82eb31dfe6ca05be0f74a234b519e9a548efebd385f068eeb2
|
|
| MD5 |
7b7daa17563fa2e7b4df157cb33136dd
|
|
| BLAKE2b-256 |
bc4d80add56cc46b8dcdbfa30b2b11b4f110135aec0c0e3b6aade40ea6af506c
|