Scrape web pages with Playwright and export structured data.
Project description
SandPaper
SandPaper scrapes web pages with Playwright and writes structured data to CSV, JSON, Excel, Parquet, or SQLite. It ships with an interactive CLI, a non-interactive CLI for scripting, an HTTP API, a small web UI, a visual selector picker, and a session recorder that replays as recipes.
Docs: Getting started · User guide · Installation · Contributing
Install
From PyPI:
pip install sandpaper-py
playwright install
With optional features:
pip install "sandpaper-py[excel,parquet,api,schedule]"
From source:
git clone https://github.com/Aaryan-Dadu/SandPaper
cd SandPaper
pip install -e ".[dev,all]"
playwright install
playwright install downloads the browser binaries the loader needs. The first scrape after a fresh install will fail without it.
Quick start
Interactive mode:
sandpaper
Single-shot non-interactive scrape:
sandpaper run --url https://quotes.toscrape.com -o quotes.csv
Multi-page with a template:
sandpaper run \
--template "https://quotes.toscrape.com/page/{page}/" \
--pages 1-5 \
-o quotes.csv
Auto-paginate by following rel="next" links:
sandpaper run --url https://quotes.toscrape.com --auto-paginate -o quotes.csv
CSS selector extraction:
sandpaper run \
--url https://quotes.toscrape.com \
--extractor selector \
--selectors '{"quote": "span.text", "author": "small.author"}' \
-o quotes.json -f json
Concurrent multi-page:
sandpaper run \
--template "https://example.com/page/{page}" \
--pages 1-50 \
--concurrency 4 --rate 2 \
-o data.parquet -f parquet
Dry run (preview without writing):
sandpaper run --url https://example.com --dry-run
Programmatic API
from sandpaper_py import scrape, ScrapeConfig
cfg = ScrapeConfig(
page_template="https://quotes.toscrape.com/page/{page}/",
pages="1-3",
output="quotes.csv",
format="csv",
threshold=10,
)
result = scrape(cfg)
print(result.rows, result.columns)
Or use the helpers:
from sandpaper_py import scrape_url, scrape_urls
scrape_url("https://example.com", output="example.csv")
scrape_urls(["https://a.com", "https://b.com"], output="combined.json", format="json")
Commands
| Command | Purpose |
|---|---|
sandpaper |
Interactive prompt-driven scrape. |
sandpaper run |
Non-interactive scrape from flags. |
sandpaper watch --every <sec> |
Re-run on a fixed interval. |
sandpaper schedule --cron "..." |
Run on a cron schedule. |
sandpaper pick <url> |
Open a visual selector picker. |
sandpaper serve |
Start the HTTP API and web UI. |
sandpaper preset {list,save,show,delete} |
Manage per-site presets. |
sandpaper config {path,show,init} |
Manage the global config file. |
Run sandpaper <command> --help for full options.
Output formats
-f csv (default), json, jsonl, excel, parquet, sqlite.
excel requires pip install "sandpaper-py[excel]". parquet requires [parquet].
Extractors
heuristic(default) groups text bytag-classand keeps any column with at least--thresholditems and more than one unique value. Good for generic scrapes where the structure is unknown.selectortakes a JSON map of column to CSS selector and pulls one column per selector. Usesandpaper pickto build the map by clicking.
Presets
Saved per-site configurations live in the user config directory (see sandpaper config path). A preset name that matches a URL host is auto-suggested in interactive mode.
sandpaper preset save quotes.toscrape.com \
--extractor selector \
--selectors '{"quote": "span.text", "author": "small.author"}' \
--threshold 1
sandpaper run --url https://quotes.toscrape.com --preset quotes.toscrape.com
Web UI and HTTP API
pip install "sandpaper-py[api]"
sandpaper serve
Open http://127.0.0.1:8000 for the form-based UI, or POST to /api/scrape:
curl -X POST http://127.0.0.1:8000/api/scrape \
-H 'content-type: application/json' \
-d '{"url": "https://quotes.toscrape.com", "format": "json"}'
Visual selector picker
sandpaper pick https://quotes.toscrape.com --save-preset quotes
A real browser opens with an overlay. Click any element inside one of the repeating items (a card, a row, a search result). SandPaper walks up the DOM, finds the row container, highlights every sibling row in green, and shows a side panel with ↑ Broader / ↓ Narrower controls if it picked the wrong level.
Each subsequent click captures a field. The selector is computed relative to the row container, so the same field is automatically pulled out of every other row. The side panel shows a live 5-row preview table that updates as you click. Per-field undo buttons remove a column. Press Esc to finish.
Output is a complete preset (row_selector + selectors). With --save-preset NAME, sandpaper writes it directly to your preset library:
sandpaper run --url https://quotes.toscrape.com --preset quotes -o quotes.json -f json
Or use the row-scoped extraction directly without going through the picker:
sandpaper run --url https://quotes.toscrape.com \
-e selector \
--row-selector "div.quote" \
--selectors '{"text": "span.text", "author": "small.author"}' \
-o quotes.json -f json
Row-scoped extraction guarantees per-row alignment: each row in the output corresponds to exactly one container element, no positional zipping across mismatched groups.
Watch and schedule
sandpaper watch --url https://example.com --every 600 -o feed.csv
sandpaper schedule --url https://example.com --cron "*/15 * * * *" -o feed.csv
schedule requires pip install "sandpaper-py[schedule]".
Recipes (record once, replay anywhere)
When a scrape needs interaction (login, search, paginate, follow), record what you do once:
sandpaper record https://example.com/search --output search.recipe.json
A real Chromium opens with a recorder toolbar. Type, click, and navigate normally — every action is captured. Click "Capture extract" to launch the pattern picker mid-recording (point at the row pattern + each field). Click "Save & finish" to write the recipe.
Replay it any time, with optional parameter overrides:
sandpaper run-recipe search.recipe.json --param query=laptops -o results.json
Recipes are JSON, declarative, and shareable. Full schema and action vocabulary in the User guide.
Detail-page join
The most common scraping pattern: list page → click each item → scrape detail → merge. Express it with one flag.
sandpaper run \
--url https://news.example.com/articles \
-e selector \
--row-selector "article.post" \
--selectors '{"title": "h2.title", "summary": "p.lede", "url": "a.title-link@href"}' \
--follow url \
--follow-selectors '{"body": "div.body", "author": "span.author"}' \
--follow-concurrency 4 \
-o articles.json -f json
Each row from the list page is enriched with the detail-page fields, joined by the URL extracted via the selector@attr syntax. Output is one row per article with both summary and full body.
selector@attrextracts the named attribute (most oftenhreforsrc) instead of text content.- Relative URLs are resolved against the list page's URL automatically;
--follow-url-prefix https://example.comoverrides the base. --follow-concurrency Nruns detail fetches in parallel with per-thread persistent loaders.- Failed detail pages are skipped silently by default.
--follow-fail-on-erroraborts the run on the first failure.
Cleaner output
The heuristic extractor is record-aware. On any page that contains a repeating row pattern (product cards, company listings, search results) or an HTML <table>, SandPaper finds the row container and pulls one row per repeating element. Each row is a clean {column: value} record. No positional zipping across mismatched groups, no tag-class key noise, no filter sidebar leaking into the data.
Default pipeline:
- Strip
script,style,nav,header,footer,form,aside, and any element whose class/id/role/aria-label matches a noise keyword (cookie,popup,sidebar,subscribe, ...). - If a
<table>with>= threshold/2rows exists, extract its headers and cells directly. - Otherwise find the parent whose direct children form the largest, most homogeneous repeating group. Each child becomes one record.
- In each record, only true leaf tags publish text (so
<a><h2>Name</h2></a>yields one column, not two). - Post-clean: drop empty/single-value columns, merge columns with identical value sequences.
Knobs that control output quality:
--min-text-length 2 --max-text-length 800
--near-dup-ratio 0.9
--max-fields-per-record 20
--no-prefer-records # fall back to the flat heuristic
--block-resources image,media,font # default; speeds up scrapes
--no-dismiss-overlays # disable cookie/popup auto-dismiss
sandpaper run --url https://example.com \
--min-text-length 2 --max-text-length 800 --near-dup-ratio 0.9 \
-o clean.json -f json --normalize-keys --quality-report
--normalize-keys slugifies JSON keys, --sort-keys makes output deterministic, --keep-empty-columns retains all-empty columns, --csv-safe neutralizes spreadsheet formula injection. --typed runs schema coercion before Parquet/Excel/SQLite export.
Data quality controls
Post-processing runs after extraction and before export. Use these to shape the output without writing Python:
sandpaper run --url ... \
--rename-columns '{"raw_name":"name","raw_price":"price"}' \
--keep-columns name,price,url \
--required-columns name,price \
--schema-lock-after 5 \
--null-policy null \
-o out.json -f json
--rename-columns JSON— pre-export rename map. Collisions get a numeric suffix.--keep-columns / --drop-columns— whitelist or blacklist (whitelist wins).--required-columns— drop rows that are missing a value in any of these columns.--no-trim— disable whitespace trimming on cell values (default on).--schema-lock-after N— infer schema from the first N rows; drop columns absent there from the rest.--null-policy {empty,null,skip}— JSON-only:emptykeeps"",nullwritesnull,skipomits the key from the row.
In-memory results
from sandpaper_py import scrape_url
result = scrape_url("https://quotes.toscrape.com")
df = result.to_pandas(typed=True)
rows = result.records()
result.to_polars() is also available if Polars is installed.
Caching, throttling, anti-block
sandpaper run --url https://example.com \
--cache-dir .sp-cache --cache-ttl 3600 \
--rotate-user-agents --random-delay 800 --rate 2
The cache is a per-URL SHA1-keyed local HTML store; warm hits skip the browser entirely. UA rotation picks from a built-in pool. --random-delay adds 0..N ms uniform jitter before each request.
Provenance and quality report
--provenance writes a <output>.meta.json sidecar with source URLs, timestamps, options, and selectors. --quality-report writes <output>.quality.json with row counts, inferred types, null ratios, and per-column samples.
Plugin system
Custom exporters, extractors, and loaders are picked up from entry points:
[project.entry-points."sandpaper.exporters"]
yaml = "your_pkg.yaml_exporter:YAMLExporter"
[project.entry-points."sandpaper.extractors"]
microdata = "your_pkg.microdata:MicrodataExtractor"
Implement the matching protocol (Exporter, Extractor, or PageLoader from sandpaper_py) and the format becomes available as -f yaml, -e microdata, etc.
Project layout
src/sandpaper_py/
cli.py CLI entry (Click)
menu.py Interactive prompts
core.py scrape() and helpers
config.py Config file handling
presets.py Per-site presets
pagination.py URL templates and next-link detection
schema.py Type inference and column stats
provenance.py Sidecar metadata writer
plugins.py Entry-point lookup
visual.py Visual selector picker
api.py HTTP API and web UI server
watch.py Watch and schedule modes
loaders/ Page loaders (Playwright)
extractors/ Heuristic and selector extractors
exporters/ CSV, JSON, JSONL, Excel, Parquet, SQLite
tests/
.github/workflows/ CI and release
Development
pip install -e ".[dev,all]"
playwright install
pre-commit install
ruff check src tests
ruff format src tests
mypy src/sandpaper_py
pytest -v
License
MIT, see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sandpaper_py-0.7.2.tar.gz.
File metadata
- Download URL: sandpaper_py-0.7.2.tar.gz
- Upload date:
- Size: 74.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5390baca879752c456960a63b7d464b0349e5cadd7470f4120327f09d14c47ba
|
|
| MD5 |
c9cb2fbe0150e4d4ebec31b8c4d2d777
|
|
| BLAKE2b-256 |
0de2eb7b41d8b87d745541e114913264dfeef1bacd089411d913c3509af31bdc
|
Provenance
The following attestation bundles were made for sandpaper_py-0.7.2.tar.gz:
Publisher:
release.yml on Aaryan-Dadu/SandPaper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sandpaper_py-0.7.2.tar.gz -
Subject digest:
5390baca879752c456960a63b7d464b0349e5cadd7470f4120327f09d14c47ba - Sigstore transparency entry: 1403314570
- Sigstore integration time:
-
Permalink:
Aaryan-Dadu/SandPaper@4d11d30a4bc474985c35873fa0bdb971221ba3fd -
Branch / Tag:
refs/tags/v0.7.2 - Owner: https://github.com/Aaryan-Dadu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4d11d30a4bc474985c35873fa0bdb971221ba3fd -
Trigger Event:
push
-
Statement type:
File details
Details for the file sandpaper_py-0.7.2-py3-none-any.whl.
File metadata
- Download URL: sandpaper_py-0.7.2-py3-none-any.whl
- Upload date:
- Size: 77.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7a205045195ae180392c26ab81d6ed6c9c4b035213db26ae70ec1d9ce8e2fd6
|
|
| MD5 |
28d9df99d17c624271b33f76807ff03e
|
|
| BLAKE2b-256 |
f7b189220e3e08ec56c35fc9b54408cbf599d2d6b808e677e214369bc3c17002
|
Provenance
The following attestation bundles were made for sandpaper_py-0.7.2-py3-none-any.whl:
Publisher:
release.yml on Aaryan-Dadu/SandPaper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sandpaper_py-0.7.2-py3-none-any.whl -
Subject digest:
d7a205045195ae180392c26ab81d6ed6c9c4b035213db26ae70ec1d9ce8e2fd6 - Sigstore transparency entry: 1403314662
- Sigstore integration time:
-
Permalink:
Aaryan-Dadu/SandPaper@4d11d30a4bc474985c35873fa0bdb971221ba3fd -
Branch / Tag:
refs/tags/v0.7.2 - Owner: https://github.com/Aaryan-Dadu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4d11d30a4bc474985c35873fa0bdb971221ba3fd -
Trigger Event:
push
-
Statement type: