SandPaper is a package for scraping web pages with Playwright and exporting structured data to CSV.
Project description
SandPaper-py
SandPaper - SandPaper is a command-line tool for web scraping that extracts structured data from web pages and exports it to CSV. It provides an interactive CLI with options for single-page and multi-page scraping, including pagination support through path variables, query parameters, or custom URL lists. The tool uses Playwright for browser automation with features like automatic scrolling, custom headers, and encoding options, making it useful for collecting data from dynamic websites and turning it into organized datasets.
Features
- Interactive CLI Interface: Interactive terminal interface
- Single & Multi-Page Scraping: Extract data from individual page or multiple pages with pagination support
- Flexible URL Formats: Support for path variables, query parameters, and custom URL lists
- Browser Automation: Uses Playwright for JavaScript-rendered content scraping
- Automatic Scrolling: Handles infinite scroll and dynamic content loading
- Custom Headers: Configure request headers for different websites
- Encoding Support: Handle various character encodings (UTF-8, ISO-8859-1, etc.)
- Data Filtering: Filter data based on minimum element thresholds
- CSV Export: Clean, organized data export with customizable filenames
Installation
From PyPI (Recommended)
pip install sandpaper-py
From Source
git clone https://github.com/Aaryan-Dadu/SandPaper
cd sandpaper
pip install -e .
Quick Start
Command Line Usage
After installation, launch the interactive CLI:
sandpaper
This will start an interactive session that guides you through the scraping process.
Programmatic Usage
from sandpaper_py import scraper
# Single page scraping
result = scraper(
mode="Single Web Page",
filename="output.csv",
base_url="https://example.com",
headers="Default",
encoding="utf-8",
filter_threshold=10,
intial_page=1,
final_page=1,
url_list=[]
)
Usage Examples
1. Single Page Scraping
sandpaper
# Choose: Single Web Page
# Enter URL: https://quotes.toscrape.com
# Output: quotes.csv
2. Multi-Page Scraping with Path Variables
sandpaper
# Choose: Multiple Web Pages
# URL Format: Path Variable
# Base URL: https://quotes.toscrape.com/page/{page}/
# Pages: 1 to 5
# Output: quotes_pages.csv
3. Multi-Page Scraping with Query Parameters
sandpaper
# Choose: Multiple Web Pages
# URL Format: Query Param
# Base URL: https://example.com/search?q=books&page={page}
# Pages: 1 to 10
# Output: search_results.csv
4. Custom URL List
sandpaper
# Choose: Multiple Web Pages
# URL Format: Custom List
# URLs: https://site1.com,https://site2.com,https://site3.com
# Output: custom_sites.csv
Configuration Options
| Option | Description | Default |
|---|---|---|
| Mode | Single page or multiple page scraping | - |
| URL Format | Path variable, query param, or custom list | - |
| Headers | Default or custom JSON headers | Default |
| Encoding | Character encoding for the page | utf-8 |
| Filter Threshold | Minimum elements per column to keep | 10 |
| Output Filename | Custom CSV filename | {domain}.csv |
URL Format Examples
Path Variable
https://example.com/products/{page}
https://blog.example.com/posts/{page}
Query Parameter
https://example.com/search?q=books&page={page}
https://shop.example.com/category/electronics?page={page}&sort=price
Custom URL List
https://example.com/page1,https://example.com/page2,https://example.com/page3
Project Structure
SandPaper/
├── src/
│ └── sandpaper-py/
│ ├── __init__.py
│ ├── menu.py # Interactive CLI interface
│ ├── sandpaper.py # Main scraping logic
│ ├── scraper.py # Web scraping utilities
│ ├── extractor.py # Data extraction
│ └── exporter.py # CSV export functionality
├── tests/ # Tests
├── pyproject.toml # Package configuration
├── README.md # Readme
└── LICENSE # License
Dependencies
- playwright - Browser automation and JavaScript rendering
- rich - Beautiful terminal output and formatting
- questionary - Interactive CLI prompts
- pandas - Data manipulation and CSV export
- tldextract - URL domain extraction
- requests - HTTP requests
- beautifulsoup4 - HTML parsing
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -m 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Version History
v0.1.0 (Current)
- Initial release
- Single and multi-page scraping
- Interactive CLI interface
- CSV export functionality
- Browser automation with Playwright
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sandpaper_py-0.1.0.tar.gz.
File metadata
- Download URL: sandpaper_py-0.1.0.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbc0154141c3e2c0ceb3aef338803255482a25683bfb61dee4188184a60f6b3b
|
|
| MD5 |
31b2aa9c95d202349f8c1729ae4755b2
|
|
| BLAKE2b-256 |
b91c7d866e65a9fec556e24ef64682422c7b8a6e97afbbca7debcb2a33851933
|
File details
Details for the file sandpaper_py-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sandpaper_py-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd5a86935e442de49d910d9647e04ff1e69bf3b12a56fda7dc99f67205e7638e
|
|
| MD5 |
1accf3092802d2fb46714399a9a041b9
|
|
| BLAKE2b-256 |
00ac6f2783f6ed6d23a03efcc13c204535ff67119be7af49361e4487bf3a0ab5
|