Skip to main content

SandPaper is a package for scraping web pages with Playwright and exporting structured data to CSV.

Project description

SandPaper-py

SandPaper - SandPaper is a command-line tool for web scraping that extracts structured data from web pages and exports it to CSV. It provides an interactive CLI with options for single-page and multi-page scraping, including pagination support through path variables, query parameters, or custom URL lists. The tool uses Playwright for browser automation with features like automatic scrolling, custom headers, and encoding options, making it useful for collecting data from dynamic websites and turning it into organized datasets.

Features

  • Interactive CLI Interface: Interactive terminal interface
  • Single & Multi-Page Scraping: Extract data from individual page or multiple pages with pagination support
  • Flexible URL Formats: Support for path variables, query parameters, and custom URL lists
  • Browser Automation: Uses Playwright for JavaScript-rendered content scraping
  • Automatic Scrolling: Handles infinite scroll and dynamic content loading
  • Custom Headers: Configure request headers for different websites
  • Encoding Support: Handle various character encodings (UTF-8, ISO-8859-1, etc.)
  • Data Filtering: Filter data based on minimum element thresholds
  • CSV Export: Clean, organized data export with customizable filenames

Installation

From PyPI (Recommended)

pip install sandpaper-py

From Source

git clone https://github.com/Aaryan-Dadu/SandPaper
cd sandpaper
pip install -e .

Quick Start

Command Line Usage

After installation, launch the interactive CLI:

sandpaper

This will start an interactive session that guides you through the scraping process.

Programmatic Usage

from sandpaper_py import scraper

# Single page scraping
result = scraper(
    mode="Single Web Page",
    filename="output.csv",
    base_url="https://example.com",
    headers="Default",
    encoding="utf-8",
    filter_threshold=10,
    intial_page=1,
    final_page=1,
    url_list=[]
)

Usage Examples

1. Single Page Scraping

sandpaper
# Choose: Single Web Page
# Enter URL: https://quotes.toscrape.com
# Output: quotes.csv

2. Multi-Page Scraping with Path Variables

sandpaper
# Choose: Multiple Web Pages
# URL Format: Path Variable
# Base URL: https://quotes.toscrape.com/page/{page}/
# Pages: 1 to 5
# Output: quotes_pages.csv

3. Multi-Page Scraping with Query Parameters

sandpaper
# Choose: Multiple Web Pages
# URL Format: Query Param
# Base URL: https://example.com/search?q=books&page={page}
# Pages: 1 to 10
# Output: search_results.csv

4. Custom URL List

sandpaper
# Choose: Multiple Web Pages
# URL Format: Custom List
# URLs: https://site1.com,https://site2.com,https://site3.com
# Output: custom_sites.csv

Configuration Options

Option Description Default
Mode Single page or multiple page scraping -
URL Format Path variable, query param, or custom list -
Headers Default or custom JSON headers Default
Encoding Character encoding for the page utf-8
Filter Threshold Minimum elements per column to keep 10
Output Filename Custom CSV filename {domain}.csv

URL Format Examples

Path Variable

https://example.com/products/{page}
https://blog.example.com/posts/{page}

Query Parameter

https://example.com/search?q=books&page={page}
https://shop.example.com/category/electronics?page={page}&sort=price

Custom URL List

https://example.com/page1,https://example.com/page2,https://example.com/page3

Project Structure

SandPaper/
├── src/
│   └── sandpaper-py/
│       ├── __init__.py
│       ├── menu.py          # Interactive CLI interface
│       ├── sandpaper.py     # Main scraping logic
│       ├── scraper.py       # Web scraping utilities
│       ├── extractor.py     # Data extraction
│       └── exporter.py      # CSV export functionality
├── tests/                   # Tests
├── pyproject.toml           # Package configuration
├── README.md                # Readme
└── LICENSE                  # License

Dependencies

  • playwright - Browser automation and JavaScript rendering
  • rich - Beautiful terminal output and formatting
  • questionary - Interactive CLI prompts
  • pandas - Data manipulation and CSV export
  • tldextract - URL domain extraction
  • requests - HTTP requests
  • beautifulsoup4 - HTML parsing

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit your changes (git commit -m 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Version History

v0.1.0 (Current)

  • Initial release
  • Single and multi-page scraping
  • Interactive CLI interface
  • CSV export functionality
  • Browser automation with Playwright

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sandpaper_py-0.1.0.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sandpaper_py-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file sandpaper_py-0.1.0.tar.gz.

File metadata

  • Download URL: sandpaper_py-0.1.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for sandpaper_py-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bbc0154141c3e2c0ceb3aef338803255482a25683bfb61dee4188184a60f6b3b
MD5 31b2aa9c95d202349f8c1729ae4755b2
BLAKE2b-256 b91c7d866e65a9fec556e24ef64682422c7b8a6e97afbbca7debcb2a33851933

See more details on using hashes here.

File details

Details for the file sandpaper_py-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sandpaper_py-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for sandpaper_py-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd5a86935e442de49d910d9647e04ff1e69bf3b12a56fda7dc99f67205e7638e
MD5 1accf3092802d2fb46714399a9a041b9
BLAKE2b-256 00ac6f2783f6ed6d23a03efcc13c204535ff67119be7af49361e4487bf3a0ab5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page