A versatile web scraping tool with options for Selenium or Playwright, featuring OpenAI-powered data extraction and formatting.
Project description
PAR Scrape
About
PAR Scrape is a versatile web scraping tool with options for Selenium or Playwright, featuring AI-powered data extraction and formatting.
Screenshots
Features
- Web scraping using Selenium or Playwright
- AI-powered data extraction and formatting
- Supports multiple output formats (JSON, Excel, CSV, Markdown)
- Customizable field extraction
- Token usage and cost estimation
Known Issues
- Silent mode on windows still shows message about websocket. There is no simple way to get rid of this.
- Providers other than OpenAI have not been tested you millage may vary
Installation
To install PAR Scrape, make sure you have Python 3.11 or higher and uv installed.
Installation From Source
Then, follow these steps:
-
Clone the repository:
git clone https://github.com/paulrobello/par_scrape.git cd par_scrape
-
Install the package dependencies using uv:
uv sync
Installation From PyPI
To install PAR Scrape from PyPI, run any of the following commands:
uv tool install par-scrape
pipx install par-scrape
Usage
To use PAR Scrape, you can run it from the command line with various options. Here's a basic example: Ensure you have the AI provider api key in your environment. The key names for supported providers are as follows:
- OpenAI:
OPENAI_API_KEY
- Anthropic:
ANTHROPIC_API_KEY
- Google:
GOOGLE_API_KEY
- Groq:
GROQ_API_KEY
- Ollama:
Not needed
You can also store your key in the file ~/.par-scrape.env
as follows:
OPENAI_API_KEY=your_api_key
ANTHROPIC_API_KEY=your_api_key
GOOGLE_API_KEY=your_api_key
GROQ_API_KEY=your_api_key
Running from source
uv run par_scrape --url "https://openai.com/api/pricing/" --fields "Model" --fields "Pricing Input" --fields "Pricing Output" --scraper selenium --model gpt-4o-mini --display-output md
Running if installed from PyPI
par_scrape --url "https://openai.com/api/pricing/" --fields "Title" "Number of Points" "Creator" "Time Posted" "Number of Comments" --scraper selenium --model gpt-4o-mini --display-output md
Options
--url
,-u
: The URL to scrape (default: "https://openai.com/api/pricing/")--fields
,-f
: Fields to extract from the webpage (default: ["Model", "Pricing Input", "Pricing Output"])--scraper
: Scraper to use: 'selenium' or 'playwright' (default: "selenium")--headless
,-h
: Run in headless mode (for Selenium) (default: False)--sleep-time
,-t
: Time to sleep (in seconds) before scrolling and closing browser (default: 5)--pause
,-p
: Wait for user input before closing browser--ai-provider
,-a
: AI provider to use for processing (default: "OpenAI")--model
,-m
: AI model to use for processing. If not specified, a default model will be used based on the provider.--pricing
: Enable pricing summary display (default: False)--display-output
,-d
: Display output in terminal (md, csv, or json)--output-folder
,-o
: Specify the location of the output folder (default: "./output")--silent
,-s
: Run in silent mode, suppressing output--run-name
,-n
: Specify a name for this run--version
,-v
: Show the version and exit--cleanup
,-c
: [none|before|after|both] If and when to remove the output folder (default: none)
Examples
- Basic usage with default options:
par_scrape --url "https://openai.com/api/pricing/" -f "Model" -f "Pricing Input" -f "Pricing Output" --pricing
- Using Playwright and displaying JSON output:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --scraper playwright -d json --pricing
- Specifying a custom model and output folder:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --model gpt-4 --output-folder ./custom_output --pricing
- Running in silent mode with a custom run name:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --silent --run-name my_custom_run --pricing
- Using the cleanup option to remove the output folder after scraping:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --cleanup --pricing
- Using the pause option to wait for user input before scrolling:
par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --pause --pricing
Whats New
- Version 0.4.1:
- Minor bug fixes for pricing summary.
- Default model for google changed to "gemini-1.5-pro-exp-0827" which is free and usually works well.
- Version 0.4.0:
- Added support for Anthropic, Google, Groq, and Ollama. (Not well tested with any providers other than OpenAI)
- Add flag for displaying pricing summary. Defaults to False.
- Added pricing data for Anthropic.
- Better error handling for llm calls.
- Updated cleanup flag to handle both before and after cleanup. Removed --remove-output-folder flag.
- Version 0.3.1:
- Add pause and sleep-time options to control the browser and scraping delays.
- Default headless mode to False so you can interact with the browser.
- Version 0.3.0:
- Fixed location of config.json file.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
Paul Robello - probello@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for par_scrape-0.4.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 297367f1861c018486ab46c226d1f2b3911f84b0af7e890663088b6b9e321875 |
|
MD5 | c82952f3b593804414dd034d0d8064b3 |
|
BLAKE2b-256 | 7a0a21e7904ed0cb64bcab38a466963394c37a41a8924c3d628f9dce2404aa07 |