YAML-driven web scraper framework with AI agent integrations

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Scrapit

A modular, YAML-driven web scraper framework. Describe any scraping target in a config file — Scrapit handles fetching, parsing, transforming, validating, and storing the data.

No code required for new targets. Just write a YAML.

Features

Feature	Description
YAML directives	Declarative scrape configs — selectors, transforms, validation, cache
Two backends	BeautifulSoup (fast, static) or Playwright (JS-rendered)
Fallback selectors	Per-field list of CSS selectors tried in order
`all: true`	Extract all matches for a selector, not just the first
Pagination	Follow "next page" links automatically
Spider mode	Discover and scrape all linked pages from an index
Multi-site	Scrape multiple URLs with the same spec in one directive
Transform pipeline	Declarative field transforms: strip, regex, float, replace, split…
Validation	Per-field rules: required, type, min/max, pattern, enum
Four output backends	JSON, CSV (append), SQLite (zero-config), MongoDB
HTTP cache	File-based cache with TTL — avoid re-fetching during dev
Change detection	Diff result against previous run, fire webhook on change
Webhook notifications	POST JSON payload to any URL when changes detected
Stats reporter	Field coverage %, timing, error count per run
Hook system	Register callbacks for scrape lifecycle events
Async queue	RabbitMQ producer/consumer for background processing
Structured logging	Console + `output/scraper.log`

Installation

git clone <repo-url>
cd scrapit

python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

# Only if you use the playwright backend:
playwright install chromium

Copy and fill in your credentials:

cp scraper/.env.example .env

# MongoDB (optional)
MONGO_URI=mongodb+srv://user:pass@cluster/
MONGO_DATABASE=mydb
MONGO_COLLECTION=scraped

# RabbitMQ (optional)
RABBITMQ_HOST=localhost
RABBITMQ_PORT=5672
RABBITMQ_USER=guest
RABBITMQ_PASS=guest

# Webhook notifications (optional)
SCRAPIT_WEBHOOK_URL=https://hooks.example.com/...

Quick Start

# scrape Wikipedia, save to JSON
python -m scraper.main scrape wikipedia --json

# scrape Hacker News (paginated), save to SQLite
python -m scraper.main scrape hn --sqlite

# spider Books to Scrape, preview only
python -m scraper.main scrape books --preview

# scrape all directives in the default folder
python -m scraper.main batch --json

# list available directives
python -m scraper.main list

CLI Reference

`scrape` — single directive

python -m scraper.main scrape <directive> [--json|--csv|--sqlite|--mongo] [--preview] [--diff]

<directive> can be a name (wikipedia), filename (wikipedia.yaml), or path.

Flag	Description
`--json`	Save to `output/<name>.json` (default)
`--csv`	Append to `output/<name>.csv`
`--sqlite`	Save to `output/scrapit.db`
`--mongo`	Save to MongoDB
`--excel`	Append to `output/<name>.xlsx`
`--preview`	Print result, do not save
`--diff`	Compare with previous JSON output and show changes

`batch` — all directives in a folder

python -m scraper.main batch [folder] [--json|--csv|--sqlite|--mongo|--excel] [--preview] [--diff]

Default folder: scraper/directives/

`list` — inspect directives

python -m scraper.main list [--dir path/to/folder]

Shows site, backend, fields, transforms, validation rules, and cache config.

`query` — read stored data

# recent scrapes from SQLite
python -m scraper.main query --backend sqlite --limit 10

# filter by directive name
python -m scraper.main query --directive wikipedia

# filter by URL fragment
python -m scraper.main query --url wikipedia.org

`cache` — manage HTTP cache

python -m scraper.main cache stats        # show cache size and entry count
python -m scraper.main cache clear        # delete all cached responses
python -m scraper.main cache invalidate --url https://example.com

Writing Directives

Minimal directive

site: https://example.com
use: beautifulsoup   # or playwright

scrape:
  field_name:
    - 'css-selector'
    - attr: text     # 'text' = inner text, or any HTML attribute (href, src, …)

All directive options

site: https://example.com
use: beautifulsoup       # or playwright

# ── Mode ─────────────────────────────────────────────────────────────────────
mode: single             # single (default) | spider

# ── Multiple sites (same scrape spec applied to each) ────────────────────────
sites:
  - https://example.com/page-1
  - https://example.com/page-2

# ── Request options ───────────────────────────────────────────────────────────
retries: 3               # HTTP retries with exponential backoff (bs4)
timeout: 15              # seconds (bs4) or milliseconds (playwright)
headers:                 # extra HTTP headers
  Authorization: Bearer xxx
cookies:                 # bs4: dict  |  playwright: list of {name,value,domain}
  session_id: abc123
proxy: http://proxy:8080

# Proxy with authentication:
# proxy: http://user:password@proxy:8080
# proxy: https://user:password@proxy:8080
# proxy: socks5://user:password@proxy:1080

# Using environment variable (set in .env):
# proxy: ${PROXY_URL}

# ── Cache ─────────────────────────────────────────────────────────────────────
cache:
  ttl: 3600              # seconds (0 = disabled)

# ── Playwright-only ───────────────────────────────────────────────────────────
wait_for: '#content'     # wait for selector before parsing
screenshot: true         # save full-page screenshot to output/

# ── Scrape spec ───────────────────────────────────────────────────────────────
scrape:
  title:
    - 'h1'               # single selector
    - attr: text

  image:
    - ['img.hero', 'img.main', 'img']   # fallback selectors
    - attr: src

  all_links:
    - 'a.result'
    - attr: href
      all: true          # return list of all matches

# ── Pagination (bs4 only) ─────────────────────────────────────────────────────
paginate:
  selector: 'a.next-page'
  attr: href
  max_pages: 5

# ── Spider mode (bs4 only) ────────────────────────────────────────────────────
follow:
  selector: 'a.article-link'
  attr: href
  max: 50                # max pages to scrape
  same_domain: true      # stay on same domain

# ── Transform pipeline ────────────────────────────────────────────────────────
transform:
  price:
    - strip
    - {replace: {"€": "", ",": "."}}
    - float
  title:
    - strip
    - upper
  description:
    - strip
    - {slice: {end: 200}}
  tags:
    - {split: ","}
    - first

# ── Validation ────────────────────────────────────────────────────────────────
validate:
  title:
    required: true
    min_length: 2
    max_length: 500
  price:
    type: float
    min: 0
  status:
    in: [active, inactive, pending]

# ── Notifications ─────────────────────────────────────────────────────────────
notify:
  webhook: https://hooks.slack.com/...   # called when --diff detects changes

Available transforms

Transform	Argument	Description
`strip`	—	Strip whitespace
`lower` / `upper` / `title`	—	Change case
`int` / `float`	—	Parse number (removes non-numeric chars)
`regex`	`pattern`	Extract first regex match
`regex_group`	`{pattern, group}`	Extract specific capture group
`replace`	`{old: new}`	String substitution (multiple pairs)
`split`	`","`	Split string into list
`join`	`", "`	Join list into string
`first` / `last`	—	Pick first/last item from list
`default`	`value`	Fallback if value is None
`slice`	`{start, end}` or `N`	Substring / sublist
`prepend` / `append`	`"str"`	Add text before/after
`remove_tags`	—	Strip HTML tags
`template`	`"prefix {value}"`	String template with `{value}` or `{other_field}`

Available validation rules

Rule	Example	Description
`required`	`true`	Must not be None
`type`	`float`	Type check: `str`, `int`, `float`, `list`, `bool`
`not_empty`	`true`	Must not be empty string/list
`min` / `max`	`0` / `1000`	Numeric range
`min_length` / `max_length`	`2` / `500`	String/list length
`pattern`	`^\d{4}$`	Regex must match
`in`	`[a, b, c]`	Value must be in enum
`not_in`	`[a, b, c]`	Value must NOT be in enum

Output

All outputs go to output/ at the project root.

File	Description
`output/<name>.json`	Last scrape as JSON
`output/<name>.csv`	All scrapes in append-mode CSV
`output/scrapit.db`	SQLite database with all scrapes
`output/scraper.log`	Full log (also printed to console)
`output/<name>_<ts>.png`	Screenshot (Playwright + `screenshot: true`)

Project Structure

scrapit/
  scraper/
    main.py                   CLI entry point (scrape/batch/list/query/cache)
    config.py                 Environment variables and paths
    logger.py                 Logging → console + output/scraper.log
    hooks.py                  Lifecycle hook registry
    reporter.py               Timing and field coverage stats
    directives/               Built-in example directives
      wikipedia.yaml
      hn.yaml                 Hacker News (paginated)
      books.yaml              Books to Scrape (spider mode)
      github_trending.yaml    GitHub trending (all: true)
    scrapers/
      __init__.py             Pipeline dispatcher
      bs4_scraper.py          BeautifulSoup backend
      playwright_scraper.py   Playwright backend
      paginator.py            Pagination support
      spider.py               Spider / link-following
    transforms/
      __init__.py             Transform pipeline engine
    validators/
      __init__.py             Validation engine
    storage/
      mongo.py                MongoDB (lazy connect)
      json_file.py            JSON output
      csv_file.py             CSV output (append)
      sqlite.py               SQLite (zero-config)
      diff.py                 Change detection
    cache/
      __init__.py             HTTP cache with TTL
    notifications/
      __init__.py             Webhook notifications
    queue/
      producer.py             RabbitMQ producer
      consumer.py             RabbitMQ consumer
  output/                     Generated data (gitignored)
  .cache/                     HTTP cache (gitignored)
  requirements.txt
  .env

Hook System

from scraper import hooks

@hooks.on("after_scrape")
def log_result(result, dados):
    print(f"scraped {result['url']} — {len(result)} fields")

@hooks.on("on_change")
def alert(changes, result):
    print(f"change in {result['url']}: {list(changes.keys())}")

@hooks.on("on_error")
def handle_error(exc, dados):
    print(f"failed on {dados['site']}: {exc}")

Available events: before_scrape, after_scrape, on_error, on_save, on_change

AI Agent Integrations

Scrapit integrates natively with every major AI agent framework. Give any agent the ability to scrape the web on demand — no boilerplate required.

MCP Server (Claude Desktop, Cursor, Claude Code)

The fastest way to add Scrapit to Claude:

# Claude Code
claude mcp add scrapit -- python -m scraper.integrations.mcp

For Claude Desktop, add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "scrapit": {
      "command": "python",
      "args": ["-m", "scraper.integrations.mcp"],
      "cwd": "/path/to/scrapit"
    }
  }
}

After adding, Claude will have 4 web scraping tools available automatically.

Anthropic SDK (native tool use)

import anthropic
from scraper.integrations.anthropic import as_anthropic_tools, handle_tool_call

client = anthropic.Anthropic()
tools  = as_anthropic_tools()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What are the top posts on Hacker News?"}],
)

for block in response.content:
    if block.type == "tool_use":
        result = handle_tool_call(block.name, block.input)

# Or use the built-in agent loop:
from scraper.integrations.anthropic import ScrapitAnthropicAgent

agent = ScrapitAnthropicAgent(model="claude-opus-4-6")
answer = agent.run("Summarize the top 3 Hacker News posts today.")

LangChain / CrewAI / LangGraph

from scraper.integrations.langchain import ScrapitToolkit
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI

tools = ScrapitToolkit().get_tools()
# → [ScrapitTool, ScrapitPageTool, ScrapitSelectorTool]

agent = initialize_agent(
    tools=tools,
    llm=ChatOpenAI(model="gpt-4o"),
    agent=AgentType.OPENAI_FUNCTIONS,
)
agent.run("What does the Wikipedia article on Python say?")

Works with CrewAI — pass ScrapitToolkit().get_tools() to any Agent(tools=[...]).

OpenAI SDK (function calling)

from openai import OpenAI
from scraper.integrations.openai import as_openai_functions, handle_function_call

client = OpenAI()
tools  = as_openai_functions()

response = client.chat.completions.create(
    model="gpt-4o", tools=tools,
    messages=[{"role": "user", "content": "Scrape the top GitHub trending repos."}],
)

# Or use the built-in agent loop:
from scraper.integrations.openai import ScrapitOpenAIAgent

agent = ScrapitOpenAIAgent(model="gpt-4o")
answer = agent.run("What are the trending Python repos on GitHub today?")

LlamaIndex (RAG pipelines)

from scraper.integrations.llamaindex import ScrapitReader
from llama_index.core import VectorStoreIndex

reader = ScrapitReader()
docs   = reader.load_data(urls=["https://site1.com", "https://site2.com"])  # parallel

index  = VectorStoreIndex.from_documents(docs)
engine = index.as_query_engine()
response = engine.query("Summarize the main points.")

Quick programmatic API (no YAML needed)

from scraper.integrations import scrape_url, scrape_page, scrape_with_selectors, scrape_many

# Clean text — ready to feed to an LLM
text = scrape_url("https://news.ycombinator.com")

# Structured metadata: title, description, links, word_count
page = scrape_page("https://example.com")

# Agent-defined extraction with CSS selectors — no YAML needed
data = scrape_with_selectors(
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000",
    selectors={"title": "h1", "price": "p.price_color"},
)

# Parallel scraping
pages = scrape_many(["https://a.com", "https://b.com"], mode="page")

# Run a directive and get structured data
data = scrape_directive("wikipedia")

Optional dependencies

All integration dependencies are lazy — Scrapit works without any of them installed. Install only what you need:

pip install anthropic          # Anthropic SDK integration
pip install openai             # OpenAI integration
pip install langchain-core     # LangChain / CrewAI / LangGraph
pip install llama-index-core   # LlamaIndex
pip install mcp                # MCP server (Claude Desktop / Cursor / Claude Code)

Async Queue (RabbitMQ)

Send a directive to the background queue:

from scraper.queue.producer import call_producer
call_producer("directives/wikipedia.yaml")

Start a consumer worker:

python -m scraper.queue.consumer

Workers scrape each received directive and save to MongoDB automatically.

Programmatic Usage

import asyncio
from scraper.scrapers import grab_elements_by_directive
from scraper.storage import json_file

result = asyncio.run(grab_elements_by_directive("scraper/directives/wikipedia.yaml"))
json_file.save(result, "wikipedia")

Contributing

Contributions are welcome! Whether it's a bug fix, a new transform, a new storage backend, or just sharing a directive YAML that works for a site you scraped.

See CONTRIBUTING.md for a full guide on how to get started.

Quick ways to contribute:

Share a directive — open an issue with the "Share a Directive" template
New transform — add a function to scraper/transforms/__init__.py and open a PR
Bug report — use the bug report issue template

Requirements

Python 3.10+
requests, bs4, pyyaml — always required
playwright — only for playwright backend
pymongo, python-dotenv — only for MongoDB
pika — only for RabbitMQ queue
SQLite is included in Python's stdlib (no install needed)

License

MIT © João Benedet Machado

Proxy Configuration

You can use proxies in your directives to route requests through proxy servers.

Basic Usage

site: example.com
use: bs4

scrape:
  url: https://example.com/products
  items:
    selector: ".product"
    fields:
      name:
        selector: "h3"
        method: text

# Proxy configuration
proxy:
  enabled: true
  url: "http://proxy.example.com:8080"
  # Or use environment variables
  # url: "${HTTP_PROXY}"

Using with Environment Variables

proxy:
  enabled: true
  url: "${HTTP_PROXY}"  # Reads from HTTP_PROXY or HTTPS_PROXY env var

Proxies for Different Protocols

proxy:
  http: "http://http-proxy.example.com:8080"
  https: "https://https-proxy.example.com:8080"

Rotating Proxies

proxy:
  enabled: true
  rotate: true
  proxies:
    - "http://proxy1.example.com:8080"
    - "http://proxy2.example.com:8080"

Common Proxy Providers

SmartProxy - Residential proxies
Oxylabs - Enterprise proxies
ScraperAPI - API with proxy rotation
ScrapingBee - Headless browser with proxies

Best Practices

Use residential proxies for sensitive sites
Rotate proxies to avoid blocks
Set appropriate delay between requests
Monitor proxy health and replace dead proxies

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

joaobenedetmachado

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

Mar 7, 2026

0.2.0

Mar 6, 2026

This version

0.1.0

Mar 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapit_scraper-0.1.0.tar.gz (42.0 kB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapit_scraper-0.1.0-py3-none-any.whl (59.2 kB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file scrapit_scraper-0.1.0.tar.gz.

File metadata

Download URL: scrapit_scraper-0.1.0.tar.gz
Upload date: Mar 5, 2026
Size: 42.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapit_scraper-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f06c9fde4d362395fa79dcc574f614fdd87c1186bc533890841fd7977a48b13d`
MD5	`774605b911d56897c9bcd5816802c37d`
BLAKE2b-256	`54b07d959ac13e1c795aadbcae9740c69c931f8be47c91adb276eb326d1bac9b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapit_scraper-0.1.0.tar.gz:

Publisher: publish.yml on joaobenedetmachado/scrapit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapit_scraper-0.1.0.tar.gz
- Subject digest: f06c9fde4d362395fa79dcc574f614fdd87c1186bc533890841fd7977a48b13d
- Sigstore transparency entry: 1042906573
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: joaobenedetmachado/scrapit@baaf33fef9e2b6d287edbc43bf134ba299e0f245
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/joaobenedetmachado
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@baaf33fef9e2b6d287edbc43bf134ba299e0f245
- Trigger Event: release

File details

Details for the file scrapit_scraper-0.1.0-py3-none-any.whl.

File metadata

Download URL: scrapit_scraper-0.1.0-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 59.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapit_scraper-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d9396cd3f06e85f6ebc13bdcfd3aab3eb7877f596697e4584610b72c6eebcbf`
MD5	`60a2a040385b18e8440c1ab42fe38406`
BLAKE2b-256	`0ac2c0259d71fc533a5b660e1dbb0385074f9bdb35a089dcf8613580c6553f4d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapit_scraper-0.1.0-py3-none-any.whl:

Publisher: publish.yml on joaobenedetmachado/scrapit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapit_scraper-0.1.0-py3-none-any.whl
- Subject digest: 9d9396cd3f06e85f6ebc13bdcfd3aab3eb7877f596697e4584610b72c6eebcbf
- Sigstore transparency entry: 1042906622
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: joaobenedetmachado/scrapit@baaf33fef9e2b6d287edbc43bf134ba299e0f245
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/joaobenedetmachado
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@baaf33fef9e2b6d287edbc43bf134ba299e0f245
- Trigger Event: release

scrapit-scraper 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Scrapit

Features

Installation

Quick Start

CLI Reference

scrape — single directive

batch — all directives in a folder

list — inspect directives

query — read stored data

cache — manage HTTP cache

Writing Directives

Minimal directive

All directive options

Available transforms

Available validation rules

Output

Project Structure

Hook System

AI Agent Integrations

MCP Server (Claude Desktop, Cursor, Claude Code)

Anthropic SDK (native tool use)

LangChain / CrewAI / LangGraph

OpenAI SDK (function calling)

LlamaIndex (RAG pipelines)

Quick programmatic API (no YAML needed)

Optional dependencies

Async Queue (RabbitMQ)

Programmatic Usage

Contributing

Requirements

License

Proxy Configuration

Basic Usage

Using with Environment Variables

Proxies for Different Protocols

Rotating Proxies

Common Proxy Providers

Best Practices

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`scrape` — single directive

`batch` — all directives in a folder

`list` — inspect directives

`query` — read stored data

`cache` — manage HTTP cache