scrapurrr

Agentic Web Scraper

These details have not been verified by PyPI

Project description

Scrapurrr

Agentic web scraper with schema-driven extraction.

Python License Version

What is Scrapurrr?

Define a Pydantic schema, point it at a URL, and get back typed data. Scrapurrr handles rendering, anti-detection, pagination, and extraction automatically.

Core features:

Schema-driven extraction. Define what you want, get a typed object back.
Interactive chat CLI. Talk to scrapurrr in natural language, navigate pages, extract elements.
Element inspection. Get CSS selectors, XPath, full XPath, JS path, outerHTML, and styles for any element.
Agent mode. Autonomous navigation, clicking, scrolling, and form-filling across pages.
100+ LLM providers. OpenAI, Anthropic, Groq, Ollama, or any LiteLLM-compatible endpoint.
Smart fetching. HTTP-first with automatic browser fallback for JS-heavy pages.
Stealth built-in. Fingerprint masking, human-like behavior, proxy rotation.
Batch and pagination. Concurrent multi-URL extraction with auto-pagination.
MCP server. Expose scraping as tools for AI assistants.

Install

pip install scrapurrr
playwright install chromium

Quick Start

import asyncio
from pydantic import BaseModel
from scrapurrr import Scrapurrr

class Article(BaseModel):
    title: str
    author: str
    published: str

async def main():
    async with Scrapurrr(provider="openai/gpt-4o", api_key="sk-...") as scraper:
        article = await scraper.extract("https://example.com/article", Article)
        print(article.title)

asyncio.run(main())

Interactive Chat

Start an interactive scraping session from the terminal:

scrapurrr -p ollama/llama3 chat

scrapurrr v0.1.0

> go to https://shop.example.com
Navigated to https://shop.example.com

> find "price"
Found 3 elements matching "price":
  [0] span "$29.99"
      css: span.product-price
      xpath: //span[@class='product-price']

> get xpath of all buttons
  [0] //button[@class='add-to-cart']    "Add to Cart"
  [1] //button[@id='search']            "Search"

> what products are on this page?
There are 4 products listed: Widget Pro ($29.99), Widget Max ($49.99)...

> exit

The browser stays open between messages. Direct commands like go to, find, get xpath, scroll, click, and back run instantly without calling the LLM. Everything else goes through the LLM with full page context.

Element Extraction

Extract CSS selectors, XPath, JS path, outerHTML, and computed styles for any element on a page.

async with Scrapurrr(provider="ollama/llama3") as scraper:
    # All elements on a page
    elements = await scraper.extract_elements("https://shop.example.com")

    # Filter by tag or text
    buttons = await scraper.extract_elements("https://shop.example.com", tag="button")
    prices = await scraper.extract_elements("https://shop.example.com", text="price")

    # Single element lookup
    el = await scraper.find_element("Add to Cart", url="https://shop.example.com")
    print(el.css)        # "button.add-to-cart"
    print(el.xpath)      # "//button[@class='add-to-cart']"
    print(el.full_xpath) # "/html/body/div[2]/main/button[3]"
    print(el.js_path)    # "document.querySelector('button.add-to-cart')"
    print(el.outer_html) # "<button class='add-to-cart'>Add to Cart</button>"
    print(el.styles)     # {"color": "white", "backgroundColor": "#1a73e8", ...}

Usage

Extract from a single page

class Product(BaseModel):
    name: str
    price: str
    rating: str

async with Scrapurrr(provider="ollama/llama3") as scraper:
    product = await scraper.extract("https://shop.example.com/item/42", Product)

Extract a list of items

class Job(BaseModel):
    title: str
    company: str
    location: str

async with Scrapurrr(provider="ollama/llama3") as scraper:
    jobs = await scraper.extract("https://jobs.example.com/python", list[Job])

Agent mode

The agent navigates, clicks, scrolls, and fills forms autonomously.

class SearchResult(BaseModel):
    title: str
    url: str
    snippet: str

async with Scrapurrr(provider="openai/gpt-4o", api_key="sk-...") as scraper:
    results = await scraper.agent(
        task="Go to https://news.ycombinator.com and collect the top 5 stories",
        schema=list[SearchResult],
        max_steps=15,
    )

Batch extraction

urls = ["https://shop.com/product/1", "https://shop.com/product/2", ...]

async with Scrapurrr(provider="ollama/llama3") as scraper:
    products = await scraper.extract_many(urls, Product, concurrency=10)

Auto-pagination

async with Scrapurrr(provider="ollama/llama3") as scraper:
    all_products = await scraper.extract_all_pages(
        "https://shop.com/products?page=1",
        schema=list[Product],
        max_pages=20,
    )

Providers

Provider strings follow LiteLLM format: provider/model.

# OpenAI
scraper = Scrapurrr(provider="openai/gpt-4o", api_key="sk-...")

# Anthropic
scraper = Scrapurrr(provider="anthropic/claude-sonnet-4-20250514", api_key="sk-ant-...")

# Groq
scraper = Scrapurrr(provider="groq/llama-3.1-70b-versatile", api_key="gsk_...")

# Ollama (local, no key needed)
scraper = Scrapurrr(provider="ollama/llama3")

# Self-hosted (vLLM, LM Studio)
scraper = Scrapurrr(provider="openai/mistral-7b", base_url="http://localhost:8000/v1")

Configuration

Copy the example config and point to it:

cp examples/scrapurrr.yaml scrapurrr.yaml

from pathlib import Path
scraper = Scrapurrr(config_path=Path("scrapurrr.yaml"))

Constructor arguments override the config file. Environment variables are supported with the env: prefix:

llm:
  provider: openai/gpt-4o
  api_key: env:OPENAI_API_KEY

CLI

# Interactive chat session
scrapurrr -p ollama/llama3 chat

# Extract from a URL
scrapurrr extract "https://example.com/product" -s models:Product

# Save as CSV
scrapurrr extract "https://example.com/product" -s models:Product -o result.csv --format csv

# Agent mode
scrapurrr agent "Collect the top 10 products from https://shop.example.com" \
  -s models:Product --max-steps 30

# Batch extract from URL list
scrapurrr batch urls.txt -s models:Product --concurrency 10 -o results.json

# Start MCP server
scrapurrr serve

The -s flag takes module:Class format, a Pydantic model importable from your working directory.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.1

Apr 6, 2026

0.6.0

Apr 6, 2026

0.5.7

Apr 6, 2026

0.5.6

Apr 6, 2026

0.5.5

Apr 6, 2026

0.5.4

Apr 6, 2026

0.5.3

Apr 6, 2026

0.5.2

Apr 6, 2026

0.5.1

Apr 6, 2026

0.5.0

Apr 6, 2026

0.4.2

Apr 6, 2026

0.4.1

Apr 6, 2026

0.4.0

Apr 6, 2026

0.3.2

Apr 6, 2026

0.3.1

Apr 6, 2026

0.3.0

Apr 6, 2026

0.2.7

Apr 6, 2026

0.2.6

Apr 6, 2026

0.2.5

Apr 6, 2026

0.2.4

Apr 6, 2026

0.2.3

Apr 6, 2026

0.2.2

Apr 6, 2026

0.2.1

Apr 6, 2026

0.2.0

Apr 6, 2026

0.1.6

Apr 6, 2026

0.1.5

Apr 6, 2026

0.1.4

Apr 6, 2026

0.1.3

Apr 6, 2026

0.1.2

Apr 6, 2026

This version

0.1.1

Apr 6, 2026

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapurrr-0.1.1.tar.gz (217.7 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapurrr-0.1.1-py3-none-any.whl (77.0 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file scrapurrr-0.1.1.tar.gz.

File metadata

Download URL: scrapurrr-0.1.1.tar.gz
Upload date: Apr 6, 2026
Size: 217.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for scrapurrr-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ff952185d8491300e0e1488eb24b357be6d9cfbf104a6c6b7f7dfc3f48715ff7`
MD5	`2a920963b9bd3cfe16512b74a0315ce5`
BLAKE2b-256	`0205d046ad302668d15d4f29cfe43b07b635e04a2774c7b42a9b511fada766e7`

See more details on using hashes here.

File details

Details for the file scrapurrr-0.1.1-py3-none-any.whl.

File metadata

Download URL: scrapurrr-0.1.1-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 77.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for scrapurrr-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f73f67f24c8248d8f048fc334d0acec333f3a3bb8a1fc766d3b9b0c72b01c5e8`
MD5	`8ded70d5308b1ff163efc65ee799bff7`
BLAKE2b-256	`d7879d02efc90458bb2ab32152ca251f10c7842446ea78bb4bdcd7ced28ae678`

See more details on using hashes here.

scrapurrr 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

What is Scrapurrr?

Install

Quick Start

Interactive Chat

Element Extraction

Usage

Extract from a single page

Extract a list of items

Agent mode

Batch extraction

Auto-pagination

Providers

Configuration

CLI

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes