Skip to main content

A versatile web scraping tool with options for Selenium or Playwright, featuring OpenAI-powered data extraction and formatting.

Project description

PAR Scrape

PyPI PyPI - Python Version
Runs on Linux | MacOS | Windows Arch x86-63 | ARM | AppleSilicon
PyPI - License

About

PAR Scrape is a versatile web scraping tool with options for Selenium or Playwright, featuring OpenAI-powered data extraction and formatting.

"Buy Me A Coffee"

Screenshots

PAR Scrape Screenshot

Features

  • Web scraping using Selenium or Playwright
  • OpenAI-powered data extraction and formatting
  • Supports multiple output formats (JSON, Excel, CSV, Markdown)
  • Customizable field extraction
  • Token usage and cost estimation

Installation

To install PAR Scrape, make sure you have Python 3.11 or higher and uv installed.

Installation From Source

Then, follow these steps:

  1. Clone the repository:

    git clone https://github.com/paulrobello/par_scrape.git
    cd par_scrape
    
  2. Install the package dependencies using uv:

    uv sync
    

Installation From PyPI

To install PAR Scrape from PyPI, run any of the following commands:

uv tool install par_scrape
pipx install par_scrape

Usage

To use PAR Scrape, you can run it from the command line with various options. Here's a basic example:

Running from source

uv run par_scrape --url "https://openai.com/api/pricing/" --fields "Model" --fields "Pricing Input" --fields "Pricing Output" --scraper selenium --model gpt-4o-mini --display-output md

Running if installed from PyPI

par_scrape --url "https://openai.com/api/pricing/" --fields "Title" "Number of Points" "Creator" "Time Posted" "Number of Comments" --scraper selenium --model gpt-4o-mini --display-output md

Options

  • --url: The URL to scrape (default: "https://openai.com/api/pricing/")
  • --fields: Fields to extract from the webpage (default: ["Model", "Pricing Input", "Pricing Output"])
  • --scraper: Scraper to use: 'selenium' or 'playwright' (default: "selenium")
  • --remove-output: Remove output folder before running
  • --headless: Run in headless mode (for Selenium) (default: True)
  • --model: OpenAI model to use for processing (default: "gpt-4o-mini")
  • --display-output: Display output in terminal (md, csv, or json)
  • --output-folder: Specify the location of the output folder (default: "./output")
  • --silent: Run in silent mode, suppressing output
  • --run-name: Specify a name for this run
  • --cleanup: Remove output folder before exiting

Examples

  1. Basic usage with default options:
par_scrape --url "https://openai.com/api/pricing/" --fields "Model" "Pricing Input" "Pricing Output"
  1. Using Playwright and displaying JSON output:
par_scrape --url "https://example.com" --fields "Title" "Description" "Price" --scraper playwright --display-output json
  1. Specifying a custom model and output folder:
par_scrape --url "https://example.com" --fields "Title" "Description" "Price" --model gpt-4 --output-folder ./custom_output
  1. Running in silent mode with a custom run name:
par_scrape --url "https://example.com" --fields "Title" "Description" "Price" --silent --run-name my_custom_run
  1. Using the cleanup option to remove the output folder after scraping:
par_scrape --url "https://example.com" --fields "Title" "Description" "Price" --cleanup

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Paul Robello - probello@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

par_scrape-0.1.0.tar.gz (12.1 kB view hashes)

Uploaded Source

Built Distribution

par_scrape-0.1.0-py3-none-any.whl (14.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page