Agentic Web Scraper
Project description
Agentic web scraper with schema-driven extraction.
What is Scrapurrr?
Define a Pydantic schema, point it at a URL, and get back typed data. Scrapurrr handles rendering, anti-detection, pagination, and extraction automatically.
Core features:
- Schema-driven extraction. Define what you want, get a typed object back.
- Interactive chat CLI. Talk to scrapurrr in natural language, navigate pages, extract elements.
- Element inspection. Get CSS selectors, XPath, full XPath, JS path, outerHTML, and styles for any element.
- Agent mode. Autonomous navigation, clicking, scrolling, and form-filling across pages.
- 100+ LLM providers. OpenAI, Anthropic, Groq, Ollama, or any LiteLLM-compatible endpoint.
- Smart fetching. HTTP-first with automatic browser fallback for JS-heavy pages.
- Stealth built-in. Fingerprint masking, human-like behavior, proxy rotation.
- Batch and pagination. Concurrent multi-URL extraction with auto-pagination.
- MCP server. Expose scraping as tools for AI assistants.
Install
pip install scrapurrr
playwright install chromium
Quick Start
import asyncio
from pydantic import BaseModel
from scrapurrr import Scrapurrr
class Article(BaseModel):
title: str
author: str
published: str
async def main():
async with Scrapurrr(provider="openai/gpt-4o", api_key="sk-...") as scraper:
article = await scraper.extract("https://example.com/article", Article)
print(article.title)
asyncio.run(main())
Interactive Chat
Start an interactive scraping session from the terminal:
scrapurrr -p ollama/llama3 chat
scrapurrr v0.1.0
> go to https://shop.example.com
Navigated to https://shop.example.com
> find "price"
Found 3 elements matching "price":
[0] span "$29.99"
css: span.product-price
xpath: //span[@class='product-price']
> get xpath of all buttons
[0] //button[@class='add-to-cart'] "Add to Cart"
[1] //button[@id='search'] "Search"
> what products are on this page?
There are 4 products listed: Widget Pro ($29.99), Widget Max ($49.99)...
> exit
The browser stays open between messages. Direct commands like go to, find, get xpath, scroll, click, and back run instantly without calling the LLM. Everything else goes through the LLM with full page context.
Element Extraction
Extract CSS selectors, XPath, JS path, outerHTML, and computed styles for any element on a page.
async with Scrapurrr(provider="ollama/llama3") as scraper:
# All elements on a page
elements = await scraper.extract_elements("https://shop.example.com")
# Filter by tag or text
buttons = await scraper.extract_elements("https://shop.example.com", tag="button")
prices = await scraper.extract_elements("https://shop.example.com", text="price")
# Single element lookup
el = await scraper.find_element("Add to Cart", url="https://shop.example.com")
print(el.css) # "button.add-to-cart"
print(el.xpath) # "//button[@class='add-to-cart']"
print(el.full_xpath) # "/html/body/div[2]/main/button[3]"
print(el.js_path) # "document.querySelector('button.add-to-cart')"
print(el.outer_html) # "<button class='add-to-cart'>Add to Cart</button>"
print(el.styles) # {"color": "white", "backgroundColor": "#1a73e8", ...}
Usage
Extract from a single page
class Product(BaseModel):
name: str
price: str
rating: str
async with Scrapurrr(provider="ollama/llama3") as scraper:
product = await scraper.extract("https://shop.example.com/item/42", Product)
Extract a list of items
class Job(BaseModel):
title: str
company: str
location: str
async with Scrapurrr(provider="ollama/llama3") as scraper:
jobs = await scraper.extract("https://jobs.example.com/python", list[Job])
Agent mode
The agent navigates, clicks, scrolls, and fills forms autonomously.
class SearchResult(BaseModel):
title: str
url: str
snippet: str
async with Scrapurrr(provider="openai/gpt-4o", api_key="sk-...") as scraper:
results = await scraper.agent(
task="Go to https://news.ycombinator.com and collect the top 5 stories",
schema=list[SearchResult],
max_steps=15,
)
Batch extraction
urls = ["https://shop.com/product/1", "https://shop.com/product/2", ...]
async with Scrapurrr(provider="ollama/llama3") as scraper:
products = await scraper.extract_many(urls, Product, concurrency=10)
Auto-pagination
async with Scrapurrr(provider="ollama/llama3") as scraper:
all_products = await scraper.extract_all_pages(
"https://shop.com/products?page=1",
schema=list[Product],
max_pages=20,
)
Providers
Provider strings follow LiteLLM format: provider/model.
# OpenAI
scraper = Scrapurrr(provider="openai/gpt-4o", api_key="sk-...")
# Anthropic
scraper = Scrapurrr(provider="anthropic/claude-sonnet-4-20250514", api_key="sk-ant-...")
# Groq
scraper = Scrapurrr(provider="groq/llama-3.1-70b-versatile", api_key="gsk_...")
# Ollama (local, no key needed)
scraper = Scrapurrr(provider="ollama/llama3")
# Self-hosted (vLLM, LM Studio)
scraper = Scrapurrr(provider="openai/mistral-7b", base_url="http://localhost:8000/v1")
Configuration
Copy the example config and point to it:
cp examples/scrapurrr.yaml scrapurrr.yaml
from pathlib import Path
scraper = Scrapurrr(config_path=Path("scrapurrr.yaml"))
Constructor arguments override the config file. Environment variables are supported with the env: prefix:
llm:
provider: openai/gpt-4o
api_key: env:OPENAI_API_KEY
CLI
# Interactive chat session
scrapurrr -p ollama/llama3 chat
# Extract from a URL
scrapurrr extract "https://example.com/product" -s models:Product
# Save as CSV
scrapurrr extract "https://example.com/product" -s models:Product -o result.csv --format csv
# Agent mode
scrapurrr agent "Collect the top 10 products from https://shop.example.com" \
-s models:Product --max-steps 30
# Batch extract from URL list
scrapurrr batch urls.txt -s models:Product --concurrency 10 -o results.json
# Start MCP server
scrapurrr serve
The -s flag takes module:Class format, a Pydantic model importable from your working directory.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapurrr-0.1.4.tar.gz.
File metadata
- Download URL: scrapurrr-0.1.4.tar.gz
- Upload date:
- Size: 220.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ba2213ae569b10183d6f4463ea2841ecc5875bcc0b9c9631ed108f3e04d490f
|
|
| MD5 |
08357e522c9cc246d3593c84b2afcc83
|
|
| BLAKE2b-256 |
d80a34b7d296536ffd0c780e6bdf4e4456a69c6bd6e0127931ceb5bcdb9ee3de
|
File details
Details for the file scrapurrr-0.1.4-py3-none-any.whl.
File metadata
- Download URL: scrapurrr-0.1.4-py3-none-any.whl
- Upload date:
- Size: 80.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
067f53e151fbae86c0865a8a126f99530291f850e17c365a8e78ba8e389036c0
|
|
| MD5 |
1b02820ea8897954c1b81b8fac9ec91e
|
|
| BLAKE2b-256 |
24832fec051aa15f8d3a8e87d5dd3aa108ddfa241582c4eada081e754b81a128
|