Scrape web pages with Playwright and export structured data.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

SandPaper

SandPaper scrapes web pages with Playwright and writes structured data to CSV, JSON, Excel, Parquet, or SQLite. It ships with an interactive CLI, a non-interactive CLI for scripting, an HTTP API, a small web UI, a visual selector picker, and a session recorder that replays as recipes.

Docs: Getting started · User guide · Installation · Contributing

Install

From PyPI:

pip install sandpaper-py
playwright install

With optional features:

pip install "sandpaper-py[excel,parquet,api,schedule]"

From source:

git clone https://github.com/Aaryan-Dadu/SandPaper
cd SandPaper
pip install -e ".[dev,all]"
playwright install

playwright install downloads the browser binaries the loader needs. The first scrape after a fresh install will fail without it.

Quick start

Interactive mode:

sandpaper

Single-shot non-interactive scrape:

sandpaper run --url https://quotes.toscrape.com -o quotes.csv

Multi-page with a template:

sandpaper run \
  --template "https://quotes.toscrape.com/page/{page}/" \
  --pages 1-5 \
  -o quotes.csv

Auto-paginate by following rel="next" links:

sandpaper run --url https://quotes.toscrape.com --auto-paginate -o quotes.csv

CSS selector extraction:

sandpaper run \
  --url https://quotes.toscrape.com \
  --extractor selector \
  --selectors '{"quote": "span.text", "author": "small.author"}' \
  -o quotes.json -f json

Concurrent multi-page:

sandpaper run \
  --template "https://example.com/page/{page}" \
  --pages 1-50 \
  --concurrency 4 --rate 2 \
  -o data.parquet -f parquet

Dry run (preview without writing):

sandpaper run --url https://example.com --dry-run

Programmatic API

from sandpaper_py import scrape, ScrapeConfig

cfg = ScrapeConfig(
    page_template="https://quotes.toscrape.com/page/{page}/",
    pages="1-3",
    output="quotes.csv",
    format="csv",
    threshold=10,
)
result = scrape(cfg)
print(result.rows, result.columns)

Or use the helpers:

from sandpaper_py import scrape_url, scrape_urls

scrape_url("https://example.com", output="example.csv")
scrape_urls(["https://a.com", "https://b.com"], output="combined.json", format="json")

Commands

Command	Purpose
`sandpaper`	Interactive prompt-driven scrape.
`sandpaper run`	Non-interactive scrape from flags.
`sandpaper watch --every <sec>`	Re-run on a fixed interval.
`sandpaper schedule --cron "..."`	Run on a cron schedule.
`sandpaper pick <url>`	Open a visual selector picker.
`sandpaper serve`	Start the HTTP API and web UI.
`sandpaper preset {list,save,show,delete}`	Manage per-site presets.
`sandpaper config {path,show,init}`	Manage the global config file.

Run sandpaper <command> --help for full options.

Output formats

-f csv (default), json, jsonl, excel, parquet, sqlite.

excel requires pip install "sandpaper-py[excel]". parquet requires [parquet].

Extractors

heuristic (default) groups text by tag-class and keeps any column with at least --threshold items and more than one unique value. Good for generic scrapes where the structure is unknown.
selector takes a JSON map of column to CSS selector and pulls one column per selector. Use sandpaper pick to build the map by clicking.

Presets

Saved per-site configurations live in the user config directory (see sandpaper config path). A preset name that matches a URL host is auto-suggested in interactive mode.

sandpaper preset save quotes.toscrape.com \
  --extractor selector \
  --selectors '{"quote": "span.text", "author": "small.author"}' \
  --threshold 1
sandpaper run --url https://quotes.toscrape.com --preset quotes.toscrape.com

Web UI and HTTP API

pip install "sandpaper-py[api]"
sandpaper serve

Open http://127.0.0.1:8000 for the form-based UI, or POST to /api/scrape:

curl -X POST http://127.0.0.1:8000/api/scrape \
  -H 'content-type: application/json' \
  -d '{"url": "https://quotes.toscrape.com", "format": "json"}'

Visual selector picker

sandpaper pick https://quotes.toscrape.com --save-preset quotes

A real browser opens with an overlay. Click any element inside one of the repeating items (a card, a row, a search result). SandPaper walks up the DOM, finds the row container, highlights every sibling row in green, and shows a side panel with ↑ Broader / ↓ Narrower controls if it picked the wrong level.

Each subsequent click captures a field. The selector is computed relative to the row container, so the same field is automatically pulled out of every other row. The side panel shows a live 5-row preview table that updates as you click. Per-field undo buttons remove a column. Press Esc to finish.

Output is a complete preset (row_selector + selectors). With --save-preset NAME, sandpaper writes it directly to your preset library:

sandpaper run --url https://quotes.toscrape.com --preset quotes -o quotes.json -f json

Or use the row-scoped extraction directly without going through the picker:

sandpaper run --url https://quotes.toscrape.com \
  -e selector \
  --row-selector "div.quote" \
  --selectors '{"text": "span.text", "author": "small.author"}' \
  -o quotes.json -f json

Row-scoped extraction guarantees per-row alignment: each row in the output corresponds to exactly one container element, no positional zipping across mismatched groups.

Watch and schedule

sandpaper watch --url https://example.com --every 600 -o feed.csv
sandpaper schedule --url https://example.com --cron "*/15 * * * *" -o feed.csv

schedule requires pip install "sandpaper-py[schedule]".

Recipes (record once, replay anywhere)

When a scrape needs interaction (login, search, paginate, follow), record what you do once:

sandpaper record https://example.com/search --output search.recipe.json

A real Chromium opens with a recorder toolbar. Type, click, and navigate normally — every action is captured. Click "Capture extract" to launch the pattern picker mid-recording (point at the row pattern + each field). Click "Save & finish" to write the recipe.

Replay it any time, with optional parameter overrides:

sandpaper run-recipe search.recipe.json --param query=laptops -o results.json

Recipes are JSON, declarative, and shareable. Full schema and action vocabulary in the User guide.

Detail-page join

The most common scraping pattern: list page → click each item → scrape detail → merge. Express it with one flag.

sandpaper run \
  --url https://news.example.com/articles \
  -e selector \
  --row-selector "article.post" \
  --selectors '{"title": "h2.title", "summary": "p.lede", "url": "a.title-link@href"}' \
  --follow url \
  --follow-selectors '{"body": "div.body", "author": "span.author"}' \
  --follow-concurrency 4 \
  -o articles.json -f json

Each row from the list page is enriched with the detail-page fields, joined by the URL extracted via the selector@attr syntax. Output is one row per article with both summary and full body.

selector@attr extracts the named attribute (most often href or src) instead of text content.
Relative URLs are resolved against the list page's URL automatically; --follow-url-prefix https://example.com overrides the base.
--follow-concurrency N runs detail fetches in parallel with per-thread persistent loaders.
Failed detail pages are skipped silently by default. --follow-fail-on-error aborts the run on the first failure.

Cleaner output

The heuristic extractor is record-aware. On any page that contains a repeating row pattern (product cards, company listings, search results) or an HTML <table>, SandPaper finds the row container and pulls one row per repeating element. Each row is a clean {column: value} record. No positional zipping across mismatched groups, no tag-class key noise, no filter sidebar leaking into the data.

Default pipeline:

Strip script, style, nav, header, footer, form, aside, and any element whose class/id/role/aria-label matches a noise keyword (cookie, popup, sidebar, subscribe, ...).
If a <table> with >= threshold/2 rows exists, extract its headers and cells directly.
Otherwise find the parent whose direct children form the largest, most homogeneous repeating group. Each child becomes one record.
In each record, only true leaf tags publish text (so <a><h2>Name</h2></a> yields one column, not two).
Post-clean: drop empty/single-value columns, merge columns with identical value sequences.

Knobs that control output quality:

--min-text-length 2 --max-text-length 800
--near-dup-ratio 0.9
--max-fields-per-record 20
--no-prefer-records          # fall back to the flat heuristic
--block-resources image,media,font   # default; speeds up scrapes
--no-dismiss-overlays        # disable cookie/popup auto-dismiss

sandpaper run --url https://example.com \
  --min-text-length 2 --max-text-length 800 --near-dup-ratio 0.9 \
  -o clean.json -f json --normalize-keys --quality-report

--normalize-keys slugifies JSON keys, --sort-keys makes output deterministic, --keep-empty-columns retains all-empty columns, --csv-safe neutralizes spreadsheet formula injection. --typed runs schema coercion before Parquet/Excel/SQLite export.

Data quality controls

Post-processing runs after extraction and before export. Use these to shape the output without writing Python:

sandpaper run --url ... \
  --rename-columns '{"raw_name":"name","raw_price":"price"}' \
  --keep-columns name,price,url \
  --required-columns name,price \
  --schema-lock-after 5 \
  --null-policy null \
  -o out.json -f json

--rename-columns JSON — pre-export rename map. Collisions get a numeric suffix.
--keep-columns / --drop-columns — whitelist or blacklist (whitelist wins).
--required-columns — drop rows that are missing a value in any of these columns.
--no-trim — disable whitespace trimming on cell values (default on).
--schema-lock-after N — infer schema from the first N rows; drop columns absent there from the rest.
--null-policy {empty,null,skip} — JSON-only: empty keeps "", null writes null, skip omits the key from the row.

In-memory results

from sandpaper_py import scrape_url

result = scrape_url("https://quotes.toscrape.com")
df = result.to_pandas(typed=True)
rows = result.records()

result.to_polars() is also available if Polars is installed.

Caching, throttling, anti-block

sandpaper run --url https://example.com \
  --cache-dir .sp-cache --cache-ttl 3600 \
  --rotate-user-agents --random-delay 800 --rate 2

The cache is a per-URL SHA1-keyed local HTML store; warm hits skip the browser entirely. UA rotation picks from a built-in pool. --random-delay adds 0..N ms uniform jitter before each request.

Provenance and quality report

--provenance writes a <output>.meta.json sidecar with source URLs, timestamps, options, and selectors. --quality-report writes <output>.quality.json with row counts, inferred types, null ratios, and per-column samples.

Plugin system

Custom exporters, extractors, and loaders are picked up from entry points:

[project.entry-points."sandpaper.exporters"]
yaml = "your_pkg.yaml_exporter:YAMLExporter"

[project.entry-points."sandpaper.extractors"]
microdata = "your_pkg.microdata:MicrodataExtractor"

Implement the matching protocol (Exporter, Extractor, or PageLoader from sandpaper_py) and the format becomes available as -f yaml, -e microdata, etc.

Project layout

src/sandpaper_py/
  cli.py               CLI entry (Click)
  menu.py              Interactive prompts
  core.py              scrape() and helpers
  config.py            Config file handling
  presets.py           Per-site presets
  pagination.py        URL templates and next-link detection
  schema.py            Type inference and column stats
  provenance.py        Sidecar metadata writer
  plugins.py           Entry-point lookup
  visual.py            Visual selector picker
  api.py               HTTP API and web UI server
  watch.py             Watch and schedule modes
  loaders/             Page loaders (Playwright)
  extractors/          Heuristic and selector extractors
  exporters/           CSV, JSON, JSONL, Excel, Parquet, SQLite
tests/
.github/workflows/     CI and release

Development

pip install -e ".[dev,all]"
playwright install
pre-commit install
ruff check src tests
ruff format src tests
mypy src/sandpaper_py
pytest -v

License

MIT, see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aaryandadu

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.3

Apr 29, 2026

This version

0.7.2

Apr 29, 2026

0.7.1

Apr 29, 2026

0.1.0

Aug 23, 2025

0.0.5

Aug 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sandpaper_py-0.7.2.tar.gz (74.7 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sandpaper_py-0.7.2-py3-none-any.whl (77.0 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file sandpaper_py-0.7.2.tar.gz.

File metadata

Download URL: sandpaper_py-0.7.2.tar.gz
Upload date: Apr 29, 2026
Size: 74.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sandpaper_py-0.7.2.tar.gz
Algorithm	Hash digest
SHA256	`5390baca879752c456960a63b7d464b0349e5cadd7470f4120327f09d14c47ba`
MD5	`c9cb2fbe0150e4d4ebec31b8c4d2d777`
BLAKE2b-256	`0de2eb7b41d8b87d745541e114913264dfeef1bacd089411d913c3509af31bdc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sandpaper_py-0.7.2.tar.gz:

Publisher: release.yml on Aaryan-Dadu/SandPaper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sandpaper_py-0.7.2.tar.gz
- Subject digest: 5390baca879752c456960a63b7d464b0349e5cadd7470f4120327f09d14c47ba
- Sigstore transparency entry: 1403314570
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: Aaryan-Dadu/SandPaper@4d11d30a4bc474985c35873fa0bdb971221ba3fd
- Branch / Tag: refs/tags/v0.7.2
- Owner: https://github.com/Aaryan-Dadu
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4d11d30a4bc474985c35873fa0bdb971221ba3fd
- Trigger Event: push

File details

Details for the file sandpaper_py-0.7.2-py3-none-any.whl.

File metadata

Download URL: sandpaper_py-0.7.2-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 77.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sandpaper_py-0.7.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7a205045195ae180392c26ab81d6ed6c9c4b035213db26ae70ec1d9ce8e2fd6`
MD5	`28d9df99d17c624271b33f76807ff03e`
BLAKE2b-256	`f7b189220e3e08ec56c35fc9b54408cbf599d2d6b808e677e214369bc3c17002`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sandpaper_py-0.7.2-py3-none-any.whl:

Publisher: release.yml on Aaryan-Dadu/SandPaper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sandpaper_py-0.7.2-py3-none-any.whl
- Subject digest: d7a205045195ae180392c26ab81d6ed6c9c4b035213db26ae70ec1d9ce8e2fd6
- Sigstore transparency entry: 1403314662
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: Aaryan-Dadu/SandPaper@4d11d30a4bc474985c35873fa0bdb971221ba3fd
- Branch / Tag: refs/tags/v0.7.2
- Owner: https://github.com/Aaryan-Dadu
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4d11d30a4bc474985c35873fa0bdb971221ba3fd
- Trigger Event: push

sandpaper-py 0.7.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SandPaper

Install

Quick start

Programmatic API

Commands

Output formats

Extractors

Presets

Web UI and HTTP API

Visual selector picker

Watch and schedule

Recipes (record once, replay anywhere)

Detail-page join

Cleaner output

Data quality controls

In-memory results

Caching, throttling, anti-block

Provenance and quality report

Plugin system

Project layout

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance