Tools for conducting, collecting, and parsing web search

These details have not been verified by PyPI

Project links

Project description

WebSearcher

Tools for conducting and parsing web searches

This package provides tools for conducting algorithm audits of web search and includes tools for geolocating, conducting, and saving searches that are built around patchright. It also includes a modular parser built on selectolax for quickly decomposing a SERP into list of components with categorical classifications and position-based specifications.

Recent Changes

0.11.5: Breaking logging -- import WebSearcher no longer configures logging as a side effect (no root-logger handler or forced DEBUG level on import), so an application's own logging.basicConfig now takes effect; parse-only use is silent until the application configures logging, and crawl-time logging via SearchEngine is unchanged
0.11.4: Recover the result-count/time estimate on result pages where the stats element is injected client-side and absent from the static markup, via an inline <script> fallback; plus a byte-identical classifier parse micro-optimization (union css_first probes)
0.11.3: Classify knowledge-panel and image related-entity carousels ("Search instead for", "Other people search", "You can also search for", "People also search in Images") as searches_related instead of unknown; additive heading labels, existing corpus snapshots unchanged
0.11.2: Parse the no-results and 32-word query-truncation cards as notice components (no_results / query_truncated) and capture host-group sub-results nested in a main result. Breaking output -- dropped the notice_no_results / notice_shortened_query feature flags (now notices), renamed notice_server_error to server_error, and renamed the general sub_type subresult to indented
0.11.1: Broad parser-coverage pass classifying most previously-unknown components -- new types (gallery, places_nearby, datasets, refine_by, shopping_ideas, articles), extended knowledge / recipes / products / images / top_stories coverage, modern ad sitelinks, and the AI-overview unavailable banner
0.11.0: Breaking -- dropped the selenium, zendriver, and playwright backends; patchright is now the default and drives an installed Google Chrome (patchright install chrome if missing). Crawl logs are now JSON Lines only, and SearchEngine gained close() and context-manager teardown
0.10.0: Reliable /sorry/ CAPTCHA detection, an automated weekly geotargets refresh, and richer two-tier parsed output (breaking output)
0.9.0: Breaking -- rewrote the parser onto selectolax for ~2x faster parsing (dropping BeautifulSoup + lxml) and shipped in-package demos via ws-demo

See CHANGELOG.md for a longer history of changes by version.

WebSearcher

Getting Started

# Install from PyPI
pip install WebSearcher

# Or install with uv
uv add WebSearcher

# Install development version from GitHub
pip install git+https://github.com/gitronald/WebSearcher@dev

The default patchright browser backend drives Google Chrome (channel="chrome"), which pip can't install automatically. If Chrome isn't already installed, run this once after installing:

patchright install chrome

Or use patchright's bundled Chromium instead: run patchright install chromium and pass patchright_config={"channel": "chromium"}.

Usage

Example Search Script

WebSearcher ships runnable demos inside the package, so they work straight after pip install WebSearcher. Search and parse a query with ws-demo search, passing the query as the first argument:

uv run ws-demo search "election news"

This collects the SERP, parses it, and saves the outputs (described below). The other demos run the same way: ws-demo parse <file> (offline parse of one HTML file), ws-demo searches (a battery of queries spanning component types), ws-demo headers <query> (custom request headers), and ws-demo locations <query> (localized search). Search results change constantly, especially for news, but you can review the parsed components of any saved query with ws-demo show (add --details for a details column, --list to enumerate saved queries):

uv run ws-demo show "election news"

WebSearcher v0.11.5 | qry='election news' | 15 components

type              title                                                         url
----------------  ------------------------------------------------------------  ------------------------------------------------------------
top_stories       Jack Smith says he's 'very concerned what's going to happ...  https://www.cnbc.com/2026/07/02/jack-smith-trump-intervie...
top_stories       Trump Is Getting Tired of Losing Election Cases               https://www.theatlantic.com/politics/2026/07/trump-electi...
top_stories       Trump Promises Republicans They ‘Will Not Lose An Electio...  https://www.huffpost.com/entry/trump-republicans-election...
top_stories       Trump Targets Not Just Georgia’s Vote, but Also Trust in ...  https://www.nytimes.com/2026/07/03/us/politics/trump-geor...
top_stories       Keiko Fujimori declared winner of razor-edge Peru election    https://www.cnn.com/2026/07/03/americas/fujimori-wins-per...
general           Governor Gavin Newsom marks Fourth of July with a call fo...  https://www.gov.ca.gov/2026/07/04/governor-gavin-newsom-m...
general           Elections                                                     https://www.npr.org/sections/elections/
general           Ballotpedia.org                                               https://ballotpedia.org/Main_Page
general           Newsom to unveil felony penalties for election interferen...  https://www.abc10.com/article/news/politics/newsom-to-unv...
general           EAC News & Events | U.S. Election Assistance Commission       https://www.eac.gov/news-and-events
general           'It's going to be a battle': How Dems plan to combat Trum...  https://www.youtube.com/watch?v=1-H7R4f_ZoE
general           Election Night Results | 2026 Primary Election | Californ...  https://electionresults.sos.ca.gov/
general           Election News, Polls and Results - 270toWin                   https://www.270towin.com/news/
general           2026 Election Results: California and Bay Area Primary ...    https://www.kqed.org/voterguide
searches_related

By default, that script will save the outputs to a directory (data/demo-ws-v{version}/) as JSON lines files: serps.json (the HTML plus search metadata), parsed.json (the parsed results and features), and searches.json (the search metadata only, excluding HTML).

Step by Step

Example search and parse pipeline:

import WebSearcher as ws
se = ws.SearchEngine()                     # 1. Initialize collector
se.search('election news')                 # 2. Conduct a search
se.parse_serp()                            # 3. Parse search results
se.save_serp(append_to='serps.json')       # 4. Save HTML and metadata
se.save_parsed(append_to='parsed.json')    # 5. Save parsed results
se.close()                                 # 6. Close the browser

1. Initialize Collector

import WebSearcher as ws

# Initialize collector with method and other settings.
# `patchright` is the default browser backend; it drives your installed
# Google Chrome (channel="chrome").
se = ws.SearchEngine(
    method="patchright", 
    patchright_config = {
        "headless": False,
        "channel": "chrome",
        "user_data_dir": "",  # a temp profile is created when empty
    }
)

2. Conduct a Search

Logs are emitted as JSON Lines -- one structured object per line, with only the keys that apply to the event:

se.search('election news')
# {"timestamp": "2026-07-04T13:37:12.399-07:00", "pid": 62981, "level": "INFO", "event": "search", "response_code": 200, "qry": "election news", "loc": ""}

3. Parse Search Results

The example below is primarily for parsing search results as you collect HTML. See ws.parse_serp(html) for parsing existing HTML data.

se.parse_serp()

# Show first result
se.parsed.results[0]
{'section': 'main',
 'cmpt_rank': 0,
 'sub_rank': 0,
 'type': 'top_stories',
 'sub_type': None,
 'title': "Jack Smith says he's 'very concerned what's going to happen next election' under Trump",
 'url': 'https://www.cnbc.com/2026/07/02/jack-smith-trump-interview-doj.html',
 'text': None,
 'cite': None,
 'details': None,
 'serp_rank': 0}

Result schema

Every result shares the same lean core fields (type, sub_type, title, url, text, cite, plus the section / cmpt_rank / sub_rank / serp_rank rank metadata). Anything extra lives in details, which is either None (a clean row) or a dict that always carries a type:

# clean row -- nothing extra
{..., 'details': None}

# typed content payload (a specific label)
{..., 'details': {'type': 'ratings', 'rating': '4.6', 'n_reviews': '6.3K'}}
{..., 'details': {'type': 'hyperlinks', 'items': [{'url': '...', 'text': '...'}]}}

# metadata-only row (generic 'item' type): a parse error, a hidden
# carousel-tail card, an extracted timestamp/thumbnail, etc.
{..., 'details': {'type': 'item', 'error': 'no subcomponents parsed'}}
{..., 'details': {'type': 'item', 'visible': False, 'heading': 'What people are saying'}}
{..., 'details': {'type': 'item', 'timestamp': '2 hours ago', 'img_url': 'https://...'}}

The reserved metadata keys (error, visible, timestamp, img_url) are recorded only when they carry information — visible only when False, the others when present — so the common case keeps details as None.

4. Save HTML and Metadata

Recommended: Append html and meta data as lines to a json file for larger or ongoing collections.

se.save_serp(append_to='serps.json')

Alternative: Save individual html files in a directory, named by a provided or (default) generated serp_id. Useful for smaller qualitative explorations where you want to quickly look at what is showing up. No meta data is saved, but timestamps could be recovered from the files themselves.

se.save_serp(save_dir='./serps')

5. Save Parsed Results

Save to a json lines file.

se.save_parsed(append_to='parsed.json')

6. Close the Browser

The browser window stays open until the engine is closed -- close it explicitly when done, or use the engine as a context manager to close it automatically:

se.close()

# or scope the whole pipeline:
with ws.SearchEngine() as se:
    se.search('election news')
    ...

Localization

To conduct localized searches--from a location of your choice--you only need
one additional data point: The "Canonical Name" of each location.

The latest dataset is shipped in this repository at
data/locations/geotargets.csv. An accompanying data/locations/ledger.csv records the upstream release each refresh pulled. The committed copies of these two files are kept current automatically by a weekly workflow. Details on this are available in the GitHub Actions section ("Update locations") below. You can also fetch the most recent version yourself by using the built-in ws.download_locations().

A brief guide on how to select a canonical name and use it to conduct a
localized search is available in a jupyter notebook here.

Running on a headless server (Xvfb)

The patchright backend (the default) drives a real, visible Chrome: Chrome's own --headless mode can be reliably blocked, so the browser must run headed. On a server, CI runner, or container with no display ($DISPLAY unset), a headed Chrome has nothing to attach to and won't launch.

The fix is Xvfb, an in-memory X display server: it lets Chrome run genuinely headed -- no headless code path, no monitor, no GPU. This applies to Linux only (macOS Chrome uses the native window server, not X11). Install it (Debian/Ubuntu):

sudo apt-get install -y xvfb

Then wrap your collection command with xvfb-run:

env -u DISPLAY xvfb-run -a --server-args="-screen 0 1920x1080x24" \
  python your_collection_script.py

env -u DISPLAY removes any inherited display so the run can't silently fall back to a real one (e.g. an X-forwarded SSH session) -- the display Xvfb creates is then the only one in scope.
xvfb-run -a auto-picks a free display number, so concurrent jobs don't collide.
-screen 0 1920x1080x24 gives a realistic window geometry. The 1920x1080x24 is width x height x depth -- a 1920x1080 framebuffer at 24-bit (true-color) depth, i.e. a standard 1080p desktop.

The collection code itself is unchanged:

import WebSearcher as ws

se = ws.SearchEngine()
se.search("immigration news")
se.parse_serp()
se.save_serp(append_to="serps.json")

If you parallelize collection across processes, one shared Xvfb covers them all. Child workers inherit the parent's DISPLAY, so wrap the top-level command once rather than starting an Xvfb per worker.

Contributing

Happy to have help! If you see a component that we aren't covering yet, please add it using the process below. If you aren't sure about how to write a parser, you can also create an issue and I'll try to check it out. When creating that type of issue, providing the query that produced the new component and the time it was seen are essential, a screenshot of the component would be helpful, and the HTML would be ideal. Feel free to reach out if you have questions or need help.

Repair or Enhance a Parser

Examine parser names in /parsers/components/__init__.py
Find parser file as /parsers/components/{cmpt_name}.py.

Add a Parser

Register the component type in parsers/component_types.py -- the single source of truth for name, label, sections, and (for header-text classification) header_texts. Dispatch and classification are derived from this registry.
Add classifier to classifiers/{main,footer,headers}.py for structural signals (header-text matches instead go in the registry's header_texts)
Add parser as new file in /parsers/components
Add new parser to imports and the PARSERS catalogue in /parsers/components/__init__.py (its section dispatch and label are derived by joining this against the registry, so the name must match step 1)

Testing

Run tests:

uv run pytest tests/ -q

Update snapshots:

uv run pytest tests/ --snapshot-update

Show snapshot diffs with -vv:

uv run pytest tests/ -vv

Run a specific snapshot test by serp_id prefix:

uv run pytest tests/ -k "4f4d0fed0592"

Test Fixtures

Tests load from the consolidated compressed corpus tests/fixtures/serps.json.bz2. After adding or updating records, refresh the snapshots:

uv run pytest tests/ --snapshot-update

GitHub Actions

Tests (.github/workflows/test.yml)
Runs on every push and pull request to dev, master, and feature/** branches, across a Python 3.12 / 3.13 / 3.14 matrix: ruff check, ruff format --check, pyrefly check, then pytest with coverage.

Publish (.github/workflows/publish.yml)
Triggered by pushing a v* tag. Builds the package with uv build and publishes to PyPI via trusted publishing (no API tokens). It only runs when the repository variable PUBLISH_ENABLED is "true"; otherwise both jobs skip. For instructions on how to set this, see: Enable or disable PyPI publishing.

Update locations (.github/workflows/update-locations.yml)
Weekly cron (Mondays 06:00 UTC) plus manual dispatch. Refreshes the geotargets CSV (python -m WebSearcher.locations) and opens a PR only when the data changed.

Renovate (.github/workflows/renovate.yml)
Weekly cron plus manual dispatch. Self-hosted Renovate opens dependency-update PRs (config in .github/renovate.json).

To release a new version:

Tag a vX.Y.Z release on master.
Pushing the tag runs the publish workflow, which builds and uploads to PyPI (when PUBLISH_ENABLED is "true").

Similar Packages

Many of the packages I've found for collecting web search data via python are no longer maintained, but others are still ongoing and interesting or useful. The primary strength of WebSearcher is its parser, which provides a level of detail that enables examinations of SERP composition by recording the type and position of each result, and its modular design, which has allowed us to (itermittenly) maintain it for so long and to cover such a wide array of component types (currently 46 registered in parsers/component_types.py, before counting sub_types). Feel free to add to the list of packages or services through a pull request if you are aware of others:

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.11.5

Jul 12, 2026

0.11.4

Jul 11, 2026

0.11.3

Jul 9, 2026

0.11.2

Jul 9, 2026

0.11.1

Jul 8, 2026

0.11.0

Jul 4, 2026

0.10.2

Jun 21, 2026

0.10.1

Jun 21, 2026

0.10.0

Jun 21, 2026

0.9.0

Jun 6, 2026

0.8.6

May 26, 2026

0.8.5

May 26, 2026

0.8.4

May 25, 2026

0.8.3

May 25, 2026

0.8.2

May 24, 2026

0.8.1

May 24, 2026

0.8.0

May 11, 2026

0.7.2

May 10, 2026

0.7.1

May 3, 2026

0.7.0

Mar 16, 2026

0.6.10a0 pre-release

Mar 16, 2026

0.6.9

Feb 23, 2026

0.6.8

Feb 21, 2026

0.6.7

Feb 6, 2026

0.6.6

Dec 5, 2025

0.6.5

Dec 5, 2025

0.5.2

Mar 9, 2025

0.5.1

Mar 7, 2025

0.5.0

Feb 3, 2025

0.4.6

Nov 21, 2024

0.4.5

Nov 12, 2024

0.4.4

Nov 12, 2024

0.4.3

Nov 11, 2024

0.4.1

Aug 26, 2024

0.4.0

May 28, 2024

0.3.12

May 9, 2024

0.3.11

May 8, 2024

0.3.10

May 6, 2024

0.3.9

Feb 26, 2024

0.3.8

Feb 13, 2024

0.3.7

Feb 9, 2024

0.3.6

Dec 8, 2023

0.3.5

Nov 20, 2023

0.3.4

Nov 20, 2023

0.2.9

May 19, 2021

0.2.8

Mar 8, 2021

0.2.7

Nov 30, 2020

0.2.5

Jul 24, 2020

0.2.4

May 22, 2020

0.2.1

Mar 10, 2020

0.2.0

Mar 9, 2020

0.1.14

Dec 10, 2019

0.1.12

Oct 28, 2019

0.1.11

Oct 28, 2019

0.1.10

Oct 4, 2019

0.1.9

Oct 4, 2019

0.1.8

Sep 30, 2019

0.1.7

Sep 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

websearcher-0.11.5.tar.gz (144.1 kB view details)

Uploaded Jul 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

websearcher-0.11.5-py3-none-any.whl (171.8 kB view details)

Uploaded Jul 12, 2026 Python 3

File details

Details for the file websearcher-0.11.5.tar.gz.

File metadata

Download URL: websearcher-0.11.5.tar.gz
Upload date: Jul 12, 2026
Size: 144.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for websearcher-0.11.5.tar.gz
Algorithm	Hash digest
SHA256	`ad421c2b3781bd4030353103c8349aba7ca5aa34b00bf5a9b41fc51f4d399d5c`
MD5	`c0dcd8f666ffe7dd23a8bbe7f589da1e`
BLAKE2b-256	`5aff1e6c5eb327c266b8501322843be9b87ea3ca4cd323f68cf70cb6e36867bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for websearcher-0.11.5.tar.gz:

Publisher: publish.yml on gitronald/WebSearcher

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: websearcher-0.11.5.tar.gz
- Subject digest: ad421c2b3781bd4030353103c8349aba7ca5aa34b00bf5a9b41fc51f4d399d5c
- Sigstore transparency entry: 2148058596
- Sigstore integration time: Jul 12, 2026
Source repository:
- Permalink: gitronald/WebSearcher@c52325a73fc3d9849d07fea82e6a21075d7d87b5
- Branch / Tag: refs/tags/v0.11.5
- Owner: https://github.com/gitronald
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c52325a73fc3d9849d07fea82e6a21075d7d87b5
- Trigger Event: push

File details

Details for the file websearcher-0.11.5-py3-none-any.whl.

File metadata

Download URL: websearcher-0.11.5-py3-none-any.whl
Upload date: Jul 12, 2026
Size: 171.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for websearcher-0.11.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`78f042e4b42597ae2d4f6b9867e9608122d4a5dc2d1ecb35b2a75ac7d5966975`
MD5	`9b3c7fa2764f1f6797c2ded4cb3fac38`
BLAKE2b-256	`f15c16badec64dae0c9629ca7a3d80c6598a77928b2faa777ebb2134370f7311`

See more details on using hashes here.

Provenance

The following attestation bundles were made for websearcher-0.11.5-py3-none-any.whl:

Publisher: publish.yml on gitronald/WebSearcher

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: websearcher-0.11.5-py3-none-any.whl
- Subject digest: 78f042e4b42597ae2d4f6b9867e9608122d4a5dc2d1ecb35b2a75ac7d5966975
- Sigstore transparency entry: 2148058606
- Sigstore integration time: Jul 12, 2026
Source repository:
- Permalink: gitronald/WebSearcher@c52325a73fc3d9849d07fea82e6a21075d7d87b5
- Branch / Tag: refs/tags/v0.11.5
- Owner: https://github.com/gitronald
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c52325a73fc3d9849d07fea82e6a21075d7d87b5
- Trigger Event: push

WebSearcher 0.11.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

WebSearcher

Tools for conducting and parsing web searches

Recent Changes

Table of Contents

Getting Started

Usage

Example Search Script

Step by Step

1. Initialize Collector

2. Conduct a Search

3. Parse Search Results

Result schema

4. Save HTML and Metadata

5. Save Parsed Results

6. Close the Browser

Localization

Running on a headless server (Xvfb)

Contributing

Repair or Enhance a Parser

Add a Parser

Testing

Test Fixtures

GitHub Actions

Similar Packages

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance