Skip to main content

Scrape web pages with Playwright and export structured data.

Project description

SandPaper

SandPaper scrapes web pages with Playwright and writes structured data to CSV, JSON, Excel, Parquet, or SQLite. It ships with an interactive CLI, a non-interactive CLI for scripting, an HTTP API, a small web UI, a visual selector picker, and a session recorder that replays as recipes.

Docs: Getting started · User guide · Installation · Contributing

Install

From PyPI:

pip install sandpaper-py
playwright install

With optional features:

pip install "sandpaper-py[excel,parquet,api,schedule]"

From source:

git clone https://github.com/Aaryan-Dadu/SandPaper
cd SandPaper
pip install -e ".[dev,all]"
playwright install

playwright install downloads the browser binaries the loader needs. The first scrape after a fresh install will fail without it.

Quick start

Interactive mode:

sandpaper

Single-shot non-interactive scrape:

sandpaper run --url https://quotes.toscrape.com -o quotes.csv

Multi-page with a template:

sandpaper run \
  --template "https://quotes.toscrape.com/page/{page}/" \
  --pages 1-5 \
  -o quotes.csv

Auto-paginate by following rel="next" links:

sandpaper run --url https://quotes.toscrape.com --auto-paginate -o quotes.csv

CSS selector extraction:

sandpaper run \
  --url https://quotes.toscrape.com \
  --extractor selector \
  --selectors '{"quote": "span.text", "author": "small.author"}' \
  -o quotes.json -f json

Concurrent multi-page:

sandpaper run \
  --template "https://example.com/page/{page}" \
  --pages 1-50 \
  --concurrency 4 --rate 2 \
  -o data.parquet -f parquet

Dry run (preview without writing):

sandpaper run --url https://example.com --dry-run

Programmatic API

from sandpaper_py import scrape, ScrapeConfig

cfg = ScrapeConfig(
    page_template="https://quotes.toscrape.com/page/{page}/",
    pages="1-3",
    output="quotes.csv",
    format="csv",
    threshold=10,
)
result = scrape(cfg)
print(result.rows, result.columns)

Or use the helpers:

from sandpaper_py import scrape_url, scrape_urls

scrape_url("https://example.com", output="example.csv")
scrape_urls(["https://a.com", "https://b.com"], output="combined.json", format="json")

Commands

Command Purpose
sandpaper Interactive prompt-driven scrape.
sandpaper run Non-interactive scrape from flags.
sandpaper watch --every <sec> Re-run on a fixed interval.
sandpaper schedule --cron "..." Run on a cron schedule.
sandpaper pick <url> Open a visual selector picker.
sandpaper serve Start the HTTP API and web UI.
sandpaper preset {list,save,show,delete} Manage per-site presets.
sandpaper config {path,show,init} Manage the global config file.

Run sandpaper <command> --help for full options.

Output formats

-f csv (default), json, jsonl, excel, parquet, sqlite.

excel requires pip install "sandpaper-py[excel]". parquet requires [parquet].

Extractors

  • heuristic (default) groups text by tag-class and keeps any column with at least --threshold items and more than one unique value. Good for generic scrapes where the structure is unknown.
  • selector takes a JSON map of column to CSS selector and pulls one column per selector. Use sandpaper pick to build the map by clicking.

Presets

Saved per-site configurations live in the user config directory (see sandpaper config path). A preset name that matches a URL host is auto-suggested in interactive mode.

sandpaper preset save quotes.toscrape.com \
  --extractor selector \
  --selectors '{"quote": "span.text", "author": "small.author"}' \
  --threshold 1
sandpaper run --url https://quotes.toscrape.com --preset quotes.toscrape.com

Web UI and HTTP API

pip install "sandpaper-py[api]"
sandpaper serve

Open http://127.0.0.1:8000 for the form-based UI, or POST to /api/scrape:

curl -X POST http://127.0.0.1:8000/api/scrape \
  -H 'content-type: application/json' \
  -d '{"url": "https://quotes.toscrape.com", "format": "json"}'

Visual selector picker

sandpaper pick https://quotes.toscrape.com --save-preset quotes

A real browser opens with an overlay. Click any element inside one of the repeating items (a card, a row, a search result). SandPaper walks up the DOM, finds the row container, highlights every sibling row in green, and shows a side panel with &uarr; Broader / &darr; Narrower controls if it picked the wrong level.

Each subsequent click captures a field. The selector is computed relative to the row container, so the same field is automatically pulled out of every other row. The side panel shows a live 5-row preview table that updates as you click. Per-field undo buttons remove a column. Press Esc to finish.

Output is a complete preset (row_selector + selectors). With --save-preset NAME, sandpaper writes it directly to your preset library:

sandpaper run --url https://quotes.toscrape.com --preset quotes -o quotes.json -f json

Or use the row-scoped extraction directly without going through the picker:

sandpaper run --url https://quotes.toscrape.com \
  -e selector \
  --row-selector "div.quote" \
  --selectors '{"text": "span.text", "author": "small.author"}' \
  -o quotes.json -f json

Row-scoped extraction guarantees per-row alignment: each row in the output corresponds to exactly one container element, no positional zipping across mismatched groups.

Watch and schedule

sandpaper watch --url https://example.com --every 600 -o feed.csv
sandpaper schedule --url https://example.com --cron "*/15 * * * *" -o feed.csv

schedule requires pip install "sandpaper-py[schedule]".

Recipes (record once, replay anywhere)

When a scrape needs interaction (login, search, paginate, follow), record what you do once:

sandpaper record https://example.com/search --output search.recipe.json

A real Chromium opens with a recorder toolbar. Type, click, and navigate normally — every action is captured. Click "Capture extract" to launch the pattern picker mid-recording (point at the row pattern + each field). Click "Save & finish" to write the recipe.

Replay it any time, with optional parameter overrides:

sandpaper run-recipe search.recipe.json --param query=laptops -o results.json

Recipes are JSON, declarative, and shareable. Full schema and action vocabulary in the User guide.

Detail-page join

The most common scraping pattern: list page → click each item → scrape detail → merge. Express it with one flag.

sandpaper run \
  --url https://news.example.com/articles \
  -e selector \
  --row-selector "article.post" \
  --selectors '{"title": "h2.title", "summary": "p.lede", "url": "a.title-link@href"}' \
  --follow url \
  --follow-selectors '{"body": "div.body", "author": "span.author"}' \
  --follow-concurrency 4 \
  -o articles.json -f json

Each row from the list page is enriched with the detail-page fields, joined by the URL extracted via the selector@attr syntax. Output is one row per article with both summary and full body.

  • selector@attr extracts the named attribute (most often href or src) instead of text content.
  • Relative URLs are resolved against the list page's URL automatically; --follow-url-prefix https://example.com overrides the base.
  • --follow-concurrency N runs detail fetches in parallel with per-thread persistent loaders.
  • Failed detail pages are skipped silently by default. --follow-fail-on-error aborts the run on the first failure.

Cleaner output

The heuristic extractor is record-aware. On any page that contains a repeating row pattern (product cards, company listings, search results) or an HTML <table>, SandPaper finds the row container and pulls one row per repeating element. Each row is a clean {column: value} record. No positional zipping across mismatched groups, no tag-class key noise, no filter sidebar leaking into the data.

Default pipeline:

  1. Strip script, style, nav, header, footer, form, aside, and any element whose class/id/role/aria-label matches a noise keyword (cookie, popup, sidebar, subscribe, ...).
  2. If a <table> with >= threshold/2 rows exists, extract its headers and cells directly.
  3. Otherwise find the parent whose direct children form the largest, most homogeneous repeating group. Each child becomes one record.
  4. In each record, only true leaf tags publish text (so <a><h2>Name</h2></a> yields one column, not two).
  5. Post-clean: drop empty/single-value columns, merge columns with identical value sequences.

Knobs that control output quality:

--min-text-length 2 --max-text-length 800
--near-dup-ratio 0.9
--max-fields-per-record 20
--no-prefer-records          # fall back to the flat heuristic
--block-resources image,media,font   # default; speeds up scrapes
--no-dismiss-overlays        # disable cookie/popup auto-dismiss
sandpaper run --url https://example.com \
  --min-text-length 2 --max-text-length 800 --near-dup-ratio 0.9 \
  -o clean.json -f json --normalize-keys --quality-report

--normalize-keys slugifies JSON keys, --sort-keys makes output deterministic, --keep-empty-columns retains all-empty columns, --csv-safe neutralizes spreadsheet formula injection. --typed runs schema coercion before Parquet/Excel/SQLite export.

Data quality controls

Post-processing runs after extraction and before export. Use these to shape the output without writing Python:

sandpaper run --url ... \
  --rename-columns '{"raw_name":"name","raw_price":"price"}' \
  --keep-columns name,price,url \
  --required-columns name,price \
  --schema-lock-after 5 \
  --null-policy null \
  -o out.json -f json
  • --rename-columns JSON — pre-export rename map. Collisions get a numeric suffix.
  • --keep-columns / --drop-columns — whitelist or blacklist (whitelist wins).
  • --required-columns — drop rows that are missing a value in any of these columns.
  • --no-trim — disable whitespace trimming on cell values (default on).
  • --schema-lock-after N — infer schema from the first N rows; drop columns absent there from the rest.
  • --null-policy {empty,null,skip} — JSON-only: empty keeps "", null writes null, skip omits the key from the row.

In-memory results

from sandpaper_py import scrape_url

result = scrape_url("https://quotes.toscrape.com")
df = result.to_pandas(typed=True)
rows = result.records()

result.to_polars() is also available if Polars is installed.

Caching, throttling, anti-block

sandpaper run --url https://example.com \
  --cache-dir .sp-cache --cache-ttl 3600 \
  --rotate-user-agents --random-delay 800 --rate 2

The cache is a per-URL SHA1-keyed local HTML store; warm hits skip the browser entirely. UA rotation picks from a built-in pool. --random-delay adds 0..N ms uniform jitter before each request.

Provenance and quality report

--provenance writes a <output>.meta.json sidecar with source URLs, timestamps, options, and selectors. --quality-report writes <output>.quality.json with row counts, inferred types, null ratios, and per-column samples.

Plugin system

Custom exporters, extractors, and loaders are picked up from entry points:

[project.entry-points."sandpaper.exporters"]
yaml = "your_pkg.yaml_exporter:YAMLExporter"

[project.entry-points."sandpaper.extractors"]
microdata = "your_pkg.microdata:MicrodataExtractor"

Implement the matching protocol (Exporter, Extractor, or PageLoader from sandpaper_py) and the format becomes available as -f yaml, -e microdata, etc.

Project layout

src/sandpaper_py/
  cli.py               CLI entry (Click)
  menu.py              Interactive prompts
  core.py              scrape() and helpers
  config.py            Config file handling
  presets.py           Per-site presets
  pagination.py        URL templates and next-link detection
  schema.py            Type inference and column stats
  provenance.py        Sidecar metadata writer
  plugins.py           Entry-point lookup
  visual.py            Visual selector picker
  api.py               HTTP API and web UI server
  watch.py             Watch and schedule modes
  loaders/             Page loaders (Playwright)
  extractors/          Heuristic and selector extractors
  exporters/           CSV, JSON, JSONL, Excel, Parquet, SQLite
tests/
.github/workflows/     CI and release

Development

pip install -e ".[dev,all]"
playwright install
pre-commit install
ruff check src tests
ruff format src tests
mypy src/sandpaper_py
pytest -v

License

MIT, see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sandpaper_py-0.7.2.tar.gz (74.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sandpaper_py-0.7.2-py3-none-any.whl (77.0 kB view details)

Uploaded Python 3

File details

Details for the file sandpaper_py-0.7.2.tar.gz.

File metadata

  • Download URL: sandpaper_py-0.7.2.tar.gz
  • Upload date:
  • Size: 74.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sandpaper_py-0.7.2.tar.gz
Algorithm Hash digest
SHA256 5390baca879752c456960a63b7d464b0349e5cadd7470f4120327f09d14c47ba
MD5 c9cb2fbe0150e4d4ebec31b8c4d2d777
BLAKE2b-256 0de2eb7b41d8b87d745541e114913264dfeef1bacd089411d913c3509af31bdc

See more details on using hashes here.

Provenance

The following attestation bundles were made for sandpaper_py-0.7.2.tar.gz:

Publisher: release.yml on Aaryan-Dadu/SandPaper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sandpaper_py-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: sandpaper_py-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 77.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sandpaper_py-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d7a205045195ae180392c26ab81d6ed6c9c4b035213db26ae70ec1d9ce8e2fd6
MD5 28d9df99d17c624271b33f76807ff03e
BLAKE2b-256 f7b189220e3e08ec56c35fc9b54408cbf599d2d6b808e677e214369bc3c17002

See more details on using hashes here.

Provenance

The following attestation bundles were made for sandpaper_py-0.7.2-py3-none-any.whl:

Publisher: release.yml on Aaryan-Dadu/SandPaper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page