SandPaper is a package for scraping web pages with Playwright and exporting structured data to CSV.

These details have not been verified by PyPI

Project links

Homepage

Project description

SandPaper-py

SandPaper - SandPaper is a command-line tool for web scraping that extracts structured data from web pages and exports it to CSV. It provides an interactive CLI with options for single-page and multi-page scraping, including pagination support through path variables, query parameters, or custom URL lists. The tool uses Playwright for browser automation with features like automatic scrolling, custom headers, and encoding options, making it useful for collecting data from dynamic websites and turning it into organized datasets.

Features

Interactive CLI Interface: Interactive terminal interface
Single & Multi-Page Scraping: Extract data from individual page or multiple pages with pagination support
Flexible URL Formats: Support for path variables, query parameters, and custom URL lists
Browser Automation: Uses Playwright for JavaScript-rendered content scraping
Automatic Scrolling: Handles infinite scroll and dynamic content loading
Custom Headers: Configure request headers for different websites
Encoding Support: Handle various character encodings (UTF-8, ISO-8859-1, etc.)
Data Filtering: Filter data based on minimum element thresholds
CSV Export: Clean, organized data export with customizable filenames

Installation

From PyPI (Recommended)

pip install sandpaper-py

From Source

git clone https://github.com/Aaryan-Dadu/SandPaper
cd sandpaper
pip install -e .

Quick Start

Command Line Usage

After installation, launch the interactive CLI:

sandpaper

This will start an interactive session that guides you through the scraping process.

Programmatic Usage

from sandpaper_py import scraper

# Single page scraping
result = scraper(
    mode="Single Web Page",
    filename="output.csv",
    base_url="https://example.com",
    headers="Default",
    encoding="utf-8",
    filter_threshold=10,
    intial_page=1,
    final_page=1,
    url_list=[]
)

Usage Examples

1. Single Page Scraping

sandpaper
# Choose: Single Web Page
# Enter URL: https://quotes.toscrape.com
# Output: quotes.csv

2. Multi-Page Scraping with Path Variables

sandpaper
# Choose: Multiple Web Pages
# URL Format: Path Variable
# Base URL: https://quotes.toscrape.com/page/{page}/
# Pages: 1 to 5
# Output: quotes_pages.csv

3. Multi-Page Scraping with Query Parameters

sandpaper
# Choose: Multiple Web Pages
# URL Format: Query Param
# Base URL: https://example.com/search?q=books&page={page}
# Pages: 1 to 10
# Output: search_results.csv

4. Custom URL List

sandpaper
# Choose: Multiple Web Pages
# URL Format: Custom List
# URLs: https://site1.com,https://site2.com,https://site3.com
# Output: custom_sites.csv

Configuration Options

Option	Description	Default
Mode	Single page or multiple page scraping	-
URL Format	Path variable, query param, or custom list	-
Headers	Default or custom JSON headers	Default
Encoding	Character encoding for the page	utf-8
Filter Threshold	Minimum elements per column to keep	10
Output Filename	Custom CSV filename	`{domain}.csv`

URL Format Examples

Path Variable

https://example.com/products/{page}
https://blog.example.com/posts/{page}

Query Parameter

https://example.com/search?q=books&page={page}
https://shop.example.com/category/electronics?page={page}&sort=price

Custom URL List

https://example.com/page1,https://example.com/page2,https://example.com/page3

Project Structure

SandPaper/
├── src/
│   └── sandpaper-py/
│       ├── __init__.py
│       ├── menu.py          # Interactive CLI interface
│       ├── sandpaper.py     # Main scraping logic
│       ├── scraper.py       # Web scraping utilities
│       ├── extractor.py     # Data extraction
│       └── exporter.py      # CSV export functionality
├── tests/                   # Tests
├── pyproject.toml           # Package configuration
├── README.md                # Readme
└── LICENSE                  # License

Dependencies

playwright - Browser automation and JavaScript rendering
rich - Beautiful terminal output and formatting
questionary - Interactive CLI prompts
pandas - Data manipulation and CSV export
tldextract - URL domain extraction
requests - HTTP requests
beautifulsoup4 - HTML parsing

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -m 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Version History

v0.0.2 (Current)

Initial release
Single and multi-page scraping
Interactive CLI interface
CSV export functionality
Browser automation with Playwright

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.7.3

Apr 29, 2026

0.7.2

Apr 29, 2026

0.7.1

Apr 29, 2026

0.1.0

Aug 23, 2025

This version

0.0.5

Aug 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sandpaper_py-0.0.5.tar.gz (6.8 kB view details)

Uploaded Aug 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sandpaper_py-0.0.5-py3-none-any.whl (8.5 kB view details)

Uploaded Aug 23, 2025 Python 3

File details

Details for the file sandpaper_py-0.0.5.tar.gz.

File metadata

Download URL: sandpaper_py-0.0.5.tar.gz
Upload date: Aug 23, 2025
Size: 6.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for sandpaper_py-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`9ec607128ea7b3d21027fec5c054b3349bbf7cf320bef55418ffb09a33745943`
MD5	`e8a479c4408e434b56f652fc26ca4612`
BLAKE2b-256	`2fc098942e9179e8e9197d80aad1d8139b19ddb5f0c1fedf20f258ec2660ccf0`

See more details on using hashes here.

File details

Details for the file sandpaper_py-0.0.5-py3-none-any.whl.

File metadata

Download URL: sandpaper_py-0.0.5-py3-none-any.whl
Upload date: Aug 23, 2025
Size: 8.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for sandpaper_py-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0cce88dc522784050b1b973c1669a27835ab45e8732f457cac3aa85e0e662457`
MD5	`e9b887a1057542e230ad67aa8bed81be`
BLAKE2b-256	`13a9baf7ee43776b01d807c3c9a6d176df274deb414da81e0102aa144253ba3b`

See more details on using hashes here.

sandpaper-py 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SandPaper-py

Features

Installation

From PyPI (Recommended)

From Source

Quick Start

Command Line Usage

Programmatic Usage

Usage Examples

1. Single Page Scraping

2. Multi-Page Scraping with Path Variables

3. Multi-Page Scraping with Query Parameters

4. Custom URL List

Configuration Options

URL Format Examples

Path Variable

Query Parameter

Custom URL List

Project Structure

Dependencies

Contributing

License

Version History

v0.0.2 (Current)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes