A versatile web scraping tool with options for Selenium or Playwright, featuring OpenAI-powered data extraction and formatting.

These details have not been verified by PyPI

Project description

PAR Scrape

Runs on Linux | MacOS | Windows Arch x86-63 | ARM | AppleSilicon
PyPI - Downloads

PyPI - License

PAR Scrape is a versatile web scraping tool with options for Selenium or Playwright, featuring AI-powered data extraction and formatting.

Screenshots

PAR Scrape Screenshot

Features

Web scraping using Playwright or Selenium
AI-powered data extraction and formatting
Can be used to crawl and extract clean markdown without AI
Supports multiple output formats (JSON, Excel, CSV, Markdown)
Customizable field extraction
Token usage and cost estimation
Prompt cache for Anthropic provider
Uses my PAR AI Core

Known Issues

Selenium silent mode on windows still shows message about websocket. There is no simple way to get rid of this.
Providers other than OpenAI are hit-and-miss depending on provider / model / data being extracted.

Prompt Cache

OpenAI will auto cache prompts that are over 1024 tokens.
Anthropic will only cache prompts if you specify the --prompt-cache flag. Due to cache writes costing more only enable this if you intend to run multiple scrape jobs against the same url, also the cache will go stale within a couple of minutes so to reduce cost run your jobs as close together as possible.

How it works

Data is fetch from the site using either Selenium or Playwright
HTML is converted to clean markdown
If you specify an output format other than markdown then the following kicks in:
- A pydantic model is constructed from the fields you specify
- The markdown is sent to the AI provider with the pydantic model as the the required output
- The structured output is saved in the specified formats
If crawling mode is enabled this process is repeated for each page in the queue until the specified max number of pages is reached

Site Crawling

Crawling currently comes in 3 modes:

Single page which is the default
Single level which will crawl all links on the first page and add them to the queue. Links from any pages after the first are not added to the queue
Domain which will crawl all links on all pages as long as they below to the same top level domain (TLD).
Paginated will be added soon

Crawling progress is stored in a sqlite database and all pages are tagged with the run name which can be specified with the --run-name / -n flag.
You can resume a crawl by specifying the same run name again.
The options --scrape-max-parallel / -P can be used to increase the scraping speed by running multiple scrapes in parallel.
The options --crawl-batch-size / -b should be set at least as high as the scrape max parallel option to ensure that the queue is always full. The options --crawl-max-pages / -M can be used to limit the total number of pages crawled in a single run.

Prerequisites

To install PAR Scrape, make sure you have Python 3.11 or higher. Python 3.14 is the default and recommended version (supports Python 3.11-3.14).

uv is recommended

Linux and Mac

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Installation

Installation From Source

Then, follow these steps:

Clone the repository:

git clone https://github.com/paulrobello/par_scrape.git
cd par_scrape

Install the package dependencies using uv:
```
uv sync
```

Installation From PyPI

To install PAR Scrape from PyPI, run any of the following commands:

uv tool install par_scrape

pipx install par_scrape

Playwright Installation

To use playwright as a scraper, you must install it and its browsers using the following commands:

uv tool install playwright
playwright install chromium

Usage

To use PAR Scrape, you can run it from the command line with various options. Here's a basic example: Ensure you have the AI provider api key in your environment. You can also store your api keys in the file ~/.par_scrape.env as follows:

# AI API KEYS
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GROQ_API_KEY=
XAI_API_KEY=
GOOGLE_API_KEY=
MISTRAL_API_KEY=
GITHUB_TOKEN=
OPENROUTER_API_KEY=
DEEPSEEK_API_KEY=
# Used by Bedrock
AWS_PROFILE=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=



### Tracing (optional)
LANGCHAIN_TRACING_V2=false
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=
LANGCHAIN_PROJECT=par_scrape

AI API KEYS

ANTHROPIC_API_KEY is required for Anthropic. Get a key from https://console.anthropic.com/
OPENAI_API_KEY is required for OpenAI. Get a key from https://platform.openai.com/account/api-keys
GITHUB_TOKEN is required for GitHub Models. Get a free key from https://github.com/marketplace/models
GOOGLE_API_KEY is required for Google Models. Get a free key from https://console.cloud.google.com
XAI_API_KEY is required for XAI. Get a free key from https://x.ai/api
GROQ_API_KEY is required for Groq. Get a free key from https://console.groq.com/
MISTRAL_API_KEY is required for Mistral. Get a free key from https://console.mistral.ai/
OPENROUTER_KEY is required for OpenRouter. Get a key from https://openrouter.ai/
DEEPSEEK_API_KEY is required for Deepseek. Get a key from https://platform.deepseek.com/
AWS_PROFILE or AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are used for Bedrock authentication. The environment must already be authenticated with AWS.
No key required to use with Ollama, LlamaCpp, LiteLLM.

Open AI Compatible Providers

If a specify provider is not listed but has an OpenAI compatible endpoint you can use the following combo of vars:

PARAI_AI_PROVIDER=OpenAI
PARAI_MODEL=Your selected model
PARAI_AI_BASE_URL=The providers OpenAI endpoint URL

Running from source

uv run par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" -f "Cache Price" --model gpt-4o-mini --display-output md

Running if installed from PyPI

par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" -f "Cache Price" --model gpt-4o-mini --display-output md

Options

--url                  -u      TEXT                                                                                           URL to scrape [default: https://openai.com/api/pricing/]
--output-format        -O      [md|json|csv|excel]                                                                            Output format for the scraped data [default: md]
--fields               -f      TEXT                                                                                           Fields to extract from the webpage
                                                                                                                              [default: Model, Pricing Input, Pricing Output, Cache Price]
--scraper              -s      [selenium|playwright]                                                                          Scraper to use: 'selenium' or 'playwright' [default: playwright]
--retries              -r      INTEGER                                                                                        Retry attempts for failed scrapes [default: 3]
--scrape-max-parallel  -P      INTEGER                                                                                        Max parallel fetch requests [default: 1]
--wait-type            -w      [none|pause|sleep|idle|selector|text]                                                          Method to use for page content load waiting [default: sleep]
--wait-selector        -i      TEXT                                                                                           Selector or text to use for page content load waiting. [default: None]
--headless             -h                                                                                                     Run in headless mode (for Selenium)
--sleep-time           -t      INTEGER                                                                                        Time to sleep before scrolling (in seconds) [default: 2]
--ai-provider          -a      [Ollama|LlamaCpp|OpenRouter|OpenAI|Gemini|Github|XAI|Anthropic|
                                Groq|Mistral|Deepseek|LiteLLM|Bedrock]                                                        AI provider to use for processing [default: OpenAI]
--model                -m      TEXT                                                                                           AI model to use for processing. If not specified, a default model will be used. [default: None]
--ai-base-url          -b      TEXT                                                                                           Override the base URL for the AI provider. [default: None]
--prompt-cache                                                                                                                Enable prompt cache for Anthropic provider
--reasoning-effort             [low|medium|high]                                                                              Reasoning effort level to use for o1 and o3 models. [default: None]
--reasoning-budget             INTEGER                                                                                        Maximum context size for reasoning. [default: None]
--display-output       -d      [none|plain|md|csv|json]                                                                       Display output in terminal (md, csv, or json) [default: None]
--output-folder        -o      PATH                                                                                           Specify the location of the output folder [default: output]
--silent               -q                                                                                                     Run in silent mode, suppressing output
--run-name             -n      TEXT                                                                                           Specify a name for this run. Can be used to resume a crawl Defaults to YYYYmmdd_HHMMSS
--pricing              -p      [none|price|details]                                                                           Enable pricing summary display [default: details]
--cleanup              -c      [none|before|after|both]                                                                       How to handle cleanup of output folder [default: none]
--extraction-prompt    -e      PATH                                                                                           Path to the extraction prompt file [default: None]
--crawl-type           -C      [single_page|single_level|domain]                                                              Enable crawling mode [default: single_page]
--crawl-max-pages      -M      INTEGER                                                                                        Maximum number of pages to crawl this session [default: 100]
--crawl-batch-size     -B      INTEGER                                                                                        Maximum number of pages to load from the queue at once [default: 1]
--respect-rate-limits                                                                                                         Whether to use domain-specific rate limiting [default: True]
--respect-robots                                                                                                              Whether to respect robots.txt
--crawl-delay                  INTEGER                                                                                        Default delay in seconds between requests to the same domain [default: 1]
--version              -v
--help                                                                                                                        Show this message and exit.

Examples

Basic usage with default options:

par_scrape --url "https://openai.com/api/pricing/" -f "Model" -f "Pricing Input" -f "Pricing Output" -O json -O csv --pricing details --display-output csv

Using Playwright, displaying JSON output and waiting for text gpt-4o to be in page before continuing:

par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --scraper playwright -O json -O csv -d json --pricing details -w text -i gpt-4o

Specifying a custom model and output folder:

par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --model gpt-4 --output-folder ./custom_output -O json -O csv --pricing details -w text -i gpt-4o

Running in silent mode with a custom run name:

par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --silent --run-name my_custom_run --pricing details -O json -O csv -w text -i gpt-4o

Using the cleanup option to remove the output folder after scraping:

par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --cleanup --pricing details -O json -O csv

Using the pause option to wait for user input before scrolling:

par_scrape --url "https://openai.com/api/pricing/" -f "Title" -f "Description" -f "Price" --pause --pricing details -O json -O csv

Using Anthropic provider with prompt cache enabled and detailed pricing breakdown:

par_scrape -a Anthropic --prompt-cache -d csv -p details -f "Title" -f "Description" -f "Price" -f "Cache Price" -O json -O csv

Crawling single level and only outputting markdown (No LLM or cost):

par_scrape --url "https://openai.com/api/pricing/" -O md --crawl-batch-size 5 --scrape-max-parallel 5 --crawl-type single_level

Roadmap

API Server
More crawling options
- Paginated Listing crawling

Whats New

Version 0.8.3
- Updated dependencies and ensured Python 3.14 compatibility
- Python 3.14 is now the default and recommended version
- Maintains backward compatibility with Python 3.11, 3.12, and 3.13
- Updated Pyright configuration to target Python 3.14
- Updated all CI/CD workflows to use Python 3.14
Version 0.8.2
- Updated all dependencies to latest versions (anthropic, chromadb, fastapi, selenium, and more)
- Fixed critical race conditions in database operations:
  - get_next_urls(): Now uses atomic transactions to prevent duplicate URL processing in concurrent scenarios
  - add_to_queue(): Made INSERT and UPDATE operations atomic
  - ROBOTS_PARSERS: Added thread-safe locking for concurrent access
- Fixed logic error in crawl_delay initialization that affected all domains instead of just the target
- Improved error handling for file operations with proper UTF-8 encoding
- Enhanced concurrency safety for multi-threaded/multi-process crawling
Version 0.8.1
- Updated dependencies (ruff 0.14.2, pyright 1.1.407)
- Ensured compatibility with Python 3.13 (now the default version)
- Maintains backward compatibility with Python 3.11 and 3.12
Version 0.8.0
- Update deps and ci/cd workflows
Version 0.7.1
- Updated par-ai-core and other deps
Version 0.7.0
- Major overhaul and fixing of crawling features.
- added --respect-robots flag to check robots.txt before scraping
- added --respect-rate-limits to respect rate limits for domains
- added --reasoning-effort and --reasoning-budget for o1/o3 and Sonnet 3.7
- updated dependencies
Version 0.6.1
- Updated ai-core
Version 0.6.0
- Fixed bug where images were being striped from markdown output
- Now uses par_ai_core for url fetching and markdown conversion
- New Features:
  - BREAKING CHANGES:
    - New option to specify desired output formats -O which defaults to markdown only, which does not require AI
  - BEHAVIOR CHANGES:
    - Now retries 3 times on failed scrapes
  - Basic site crawling
  - Retry failed fetches
  - HTTP authentication
  - Proxy settings
- Updated system prompt for better results
Version 0.5.1
- Update ai-core and dependencies
- Now supports Deepseek, XAI and LiteLLM
- Better pricing data
Version 0.5.0
- Update ai-core and dependencies
- Now supports OpenRouter
Version 0.4.9
- Updated to use new par-ai-core
  - Now supports LlamaCPP and XAI Grok
  - Better cost tracking
  - Updated pricing data
  - Better error handling
- Now supports Python 3.10
Version 0.4.8:
- Added Anthropic prompt cache option.
Version 0.4.7:
- BREAKING CHANGE: --pricing cli option now takes a string value of 'details', 'cost', or 'none'.
- Added pool of user agents that gets randomly pulled from.
- Updating pricing data.
- Pricing token capture and compute now much more accurate.
Version 0.4.6:
- Minor bug fixes.
- Updating pricing data.
- Added support for Amazon Bedrock
- Removed some unnecessary dependencies.
- Code cleanup.
Version 0.4.5:
- Added new option --wait-type that allows you to specify the type of wait to use such as pause, sleep, idle, text or selector.
- Removed --pause option as it is no longer needed with --wait-type option.
- Playwright scraping now honors the headless mode.
- Playwright is now the default scraper as it is much faster.
Version 0.4.4:
- Better Playwright scraping.
Version 0.4.3:
- Added option to override the base URL for the AI provider.
Version 0.4.2:
- The url parameter can now point to a local rawData_*.md file for easier testing of different models without having to re-fetch the data.
- Added ability to specify file with extraction prompt.
- Tweaked extraction prompt to work with Groq and Anthropic. Google still does not work.
- Remove need for ~/.par-scrape-config.json
Version 0.4.1:
- Minor bug fixes for pricing summary.
- Default model for google changed to "gemini-1.5-pro-exp-0827" which is free and usually works well.
Version 0.4.0:
- Added support for Anthropic, Google, Groq, and Ollama. (Not well tested with any providers other than OpenAI)
- Add flag for displaying pricing summary. Defaults to False.
- Added pricing data for Anthropic.
- Better error handling for llm calls.
- Updated cleanup flag to handle both before and after cleanup. Removed --remove-output-folder flag.
Version 0.3.1:
- Add pause and sleep-time options to control the browser and scraping delays.
- Default headless mode to False so you can interact with the browser.
Version 0.3.0:
- Fixed location of config.json file.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Paul Robello - probello@gmail.com

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.9.1

Feb 25, 2026

0.9.0

Jan 28, 2026

This version

0.8.3

Nov 16, 2025

0.8.2

Nov 8, 2025

0.8.1

Oct 25, 2025

0.8.0

Aug 8, 2025

0.7.0

Mar 21, 2025

0.6.0

Feb 21, 2025

0.5.1

Feb 12, 2025

0.4.9

Dec 27, 2024

0.4.8

Nov 6, 2024

0.4.7

Nov 5, 2024

0.4.6

Oct 10, 2024

0.4.5

Sep 24, 2024

0.4.4

Sep 22, 2024

0.4.3

Sep 11, 2024

0.4.2

Sep 10, 2024

0.4.1

Sep 9, 2024

0.4.0

Sep 7, 2024

0.3.1

Sep 7, 2024

0.3.0

Sep 7, 2024

0.2.0

Sep 7, 2024

0.1.0

Sep 7, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

par_scrape-0.8.3.tar.gz (25.2 kB view details)

Uploaded Nov 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

par_scrape-0.8.3-py3-none-any.whl (27.9 kB view details)

Uploaded Nov 16, 2025 Python 3

File details

Details for the file par_scrape-0.8.3.tar.gz.

File metadata

Download URL: par_scrape-0.8.3.tar.gz
Upload date: Nov 16, 2025
Size: 25.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for par_scrape-0.8.3.tar.gz
Algorithm	Hash digest
SHA256	`5df25bf7f7bea2e0276ad385fba967f4ce2dad80250740614ea5de741331e609`
MD5	`38d1345bf62b6bffe9da5a91f72b9b63`
BLAKE2b-256	`6ee8f005ec1404509c41409f5d73430d2d845cf530c527bfa2513cbf6511aac7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for par_scrape-0.8.3.tar.gz:

Publisher: publish.yml on paulrobello/par_scrape

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: par_scrape-0.8.3.tar.gz
- Subject digest: 5df25bf7f7bea2e0276ad385fba967f4ce2dad80250740614ea5de741331e609
- Sigstore transparency entry: 702511574
- Sigstore integration time: Nov 16, 2025
Source repository:
- Permalink: paulrobello/par_scrape@de43e3895413bffc199eef244b895a93728de273
- Branch / Tag: refs/heads/main
- Owner: https://github.com/paulrobello
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@de43e3895413bffc199eef244b895a93728de273
- Trigger Event: workflow_dispatch

File details

Details for the file par_scrape-0.8.3-py3-none-any.whl.

File metadata

Download URL: par_scrape-0.8.3-py3-none-any.whl
Upload date: Nov 16, 2025
Size: 27.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for par_scrape-0.8.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d1903014340807622c62cc25f89d1347d876430f6fa135a1a3656af484757fa5`
MD5	`e4402237d32203a03d1465b37e46f2f9`
BLAKE2b-256	`d7520289b71aac5f345834e46e0df0c4562d079696e2b827cd4d97e89bfa6552`

See more details on using hashes here.

Provenance

The following attestation bundles were made for par_scrape-0.8.3-py3-none-any.whl:

Publisher: publish.yml on paulrobello/par_scrape

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: par_scrape-0.8.3-py3-none-any.whl
- Subject digest: d1903014340807622c62cc25f89d1347d876430f6fa135a1a3656af484757fa5
- Sigstore transparency entry: 702511604
- Sigstore integration time: Nov 16, 2025
Source repository:
- Permalink: paulrobello/par_scrape@de43e3895413bffc199eef244b895a93728de273
- Branch / Tag: refs/heads/main
- Owner: https://github.com/paulrobello
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@de43e3895413bffc199eef244b895a93728de273
- Trigger Event: workflow_dispatch

par-scrape 0.8.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PAR Scrape

Screenshots

Features

Known Issues

Prompt Cache

How it works

Site Crawling

Prerequisites

uv is recommended

Linux and Mac

Windows

Installation

Installation From Source

Installation From PyPI

Playwright Installation

Usage

AI API KEYS

Open AI Compatible Providers

Running from source

Running if installed from PyPI

Options

Examples

Roadmap

Whats New

Contributing

License

Author

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance