A versatile web scraping tool with options for Selenium or Playwright, featuring OpenAI-powered data extraction and formatting.
Project description
PAR Scrape
About
PAR Scrape is a versatile web scraping tool with options for Selenium or Playwright, featuring OpenAI-powered data extraction and formatting.
Screenshots
Features
- Web scraping using Selenium or Playwright
- OpenAI-powered data extraction and formatting
- Supports multiple output formats (JSON, Excel, CSV, Markdown)
- Customizable field extraction
- Token usage and cost estimation
Installation
To install PAR Scrape, make sure you have Python 3.11 or higher and uv installed.
Installation From Source
Then, follow these steps:
-
Clone the repository:
git clone https://github.com/paulrobello/par_scrape.git cd par_scrape
-
Install the package dependencies using uv:
uv sync
Installation From PyPI
To install PAR Scrape from PyPI, run any of the following commands:
uv tool install par-scrape
pipx install par-scrape
Usage
To use PAR Scrape, you can run it from the command line with various options. Here's a basic example:
Ensure you have your OPENAI_API_KEY set in your environment.
You can also store your key in the file ~/.par-scrape.env
as follows:
OPENAI_API_KEY=your_api_key
Running from source
uv run par_scrape --url "https://openai.com/api/pricing/" --fields "Model" --fields "Pricing Input" --fields "Pricing Output" --scraper selenium --model gpt-4o-mini --display-output md
Running if installed from PyPI
par_scrape --url "https://openai.com/api/pricing/" --fields "Title" "Number of Points" "Creator" "Time Posted" "Number of Comments" --scraper selenium --model gpt-4o-mini --display-output md
Options
--url
,-u
: The URL to scrape (default: "https://openai.com/api/pricing/")--fields
,-f
: Fields to extract from the webpage (default: ["Model", "Pricing Input", "Pricing Output"])--scraper
: Scraper to use: 'selenium' or 'playwright' (default: "selenium")--remove-output
,-r
: Remove output folder before running--headless
,-h
: Run in headless mode (for Selenium) (default: False)--sleep-time
,-t
: Time to sleep (in seconds) before scrolling and closing browser (default: 5)--pause
,-p
: Wait for user input before closing browser--model
,-m
: OpenAI model to use for processing (default: "gpt-4o-mini")--display-output
,-d
: Display output in terminal (md, csv, or json)--output-folder
,-o
: Specify the location of the output folder (default: "./output")--silent
,-s
: Run in silent mode, suppressing output--run-name
,-n
: Specify a name for this run--version
,-v
: Show the version and exit--cleanup
,-c
: Remove output folder before exiting
Examples
- Basic usage with default options:
par_scrape --url "https://openai.com/api/pricing/" -f "Model" -f "Pricing Input" -f "Pricing Output"
- Using Playwright and displaying JSON output:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --scraper playwright -d json
- Specifying a custom model and output folder:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --model gpt-4 --output-folder ./custom_output
- Running in silent mode with a custom run name:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --silent --run-name my_custom_run
- Using the cleanup option to remove the output folder after scraping:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --cleanup
- Using the pause option to wait for user input before scrolling:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --pause
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Paul Robello - probello@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for par_scrape-0.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60f7c692904eaba25eb34d7be89d37192755f41fcd7842bdce352dd50308deac |
|
MD5 | 2d0e10e8acad846ca60cf5ad24c0c29a |
|
BLAKE2b-256 | dd98a6c7528960cc97bb4365f57332d1771a2b4e550cf6cc540a6e4408c68a36 |