Skip to main content

Tools for conducting, collecting, and parsing web search

Project description

WebSearcher

Tools for conducting and parsing web searches

PyPI version

This package provides tools for conducting algorithm audits of web search and includes a scraper built on selenium with tools for geolocating, conducting, and saving searches. It also includes a modular parser built on BeautifulSoup for decomposing a SERP into list of components with categorical classifications and position-based specifications.

Recent Changes

  • 0.8.5: Minor updates to packaging for pypi, demo scripts, and documentation
  • 0.8.4: Reclassified shopping/commercial blocks that previously emitted hollow general rows (29 -> 0) into new component types — products (grid/brands), promo (shopping deals banner), most_read_articles, and buying_guide — plus a general image_strip sub_type
  • 0.8.3: Recovered parser coverage for historical/edge layouts — legacy 2024-SGE ai_overview content + unavailable state, a new recipes parser, empty knowledge (featured_results/dictionary/panel_rhs) extraction, twitter_cards card titles, and modern shopping_ads PLA cards
  • 0.8.2: Parse pipeline optimization — ~24% faster per-SERP parse_serp (dropped whole-document str(soup), classifier signal preconditions, lazy SearchEngine import); fixed the dormant is_valid hidden-survey filter
  • 0.8.1: Breaking — ai_overview promoted to a top-level component type with a section-aware parser, restructured details.sources, and section/lede citations; security and dependency bumps
  • 0.8.0: Added jobs, flights, videos, and knowledge_subcard parsers/classifiers; expanded local_results details; modernized available_on, perspectives, searches_related, and rating-widget selectors; added inspection scripts
  • 0.7.1: Added component type registry and pyrefly type checking; refreshed CI/tooling (lint, format, type-check, tag-based publish); bumped Python floor to 3.12
  • 0.7.0: Breaking changes, standardized data models on Pydantic, typed details field, and removed DetailsItem/DetailsList

See CHANGELOG.md for a longer history of changes by version.

Table of Contents


Getting Started

# Install from PyPI
pip install WebSearcher

# Or install with uv
uv add WebSearcher

# Install development version from GitHub
pip install git+https://github.com/gitronald/WebSearcher@dev

Usage

Example Search Script

There's an example search script that can be run from the command line with uv, passing the search query as the first argument.

uv run demo-search "election news"

This collects the SERP, parses it, and saves the outputs (described below). Search results change constantly, especially for news, but you can review the parsed components of any saved query with show_parsed.py:

uv run python scripts/show_parsed.py "election news" --cat-width 12
qry='election news', components=23

┌──────────────┬────────────────────────────────────────────────────┬────────────────────────────────────────────────────┐
│ type         ┆ title                                              ┆ url                                                │
╞══════════════╪════════════════════════════════════════════════════╪════════════════════════════════════════════════════╡
│ ad           ┆ Latest Election News                               ┆ https://www.election-integrity.org/news            │
│ top_stories  ┆ 2026 Texas primary runoff election results         ┆ https://www.cbsnews.com/texas/live-updates/2026-t… │
│ top_stories  ┆ Texas runoff election live updates: Cornyn vs. Pa… ┆ https://www.usatoday.com/story/news/politics/elec… │
│ top_stories  ┆ Texas’ raucous primary runoffs end today. Here’s … ┆ https://www.texastribune.org/2026/05/26/texas-pri… │
│ top_stories  ┆ Where to vote in El Paso, what time do polls open… ┆ https://www.elpasotimes.com/story/news/politics/e… │
│ top_stories  ┆ Texas voters head to polls today for primary runo… ┆ https://www.audacy.com/krld/news/local/texas-prim… │
│ top_stories  ┆ Texas elections live updates: Trump-backed Ken Pa… ┆ https://www.nbcnews.com/politics/2026-election/li… │
│ top_stories  ┆ Trump claims 2020 election 'rigged' at least 107 … ┆ https://www.reuters.com/world/us/trump-claims-202… │
│ local_news   ┆ Get up to speed fast on the California election w… ┆ https://www.mv-voice.com/calmatters/2026/05/26/ge… │
│ local_news   ┆ Column: My pick for California governor is ... I'… ┆ https://www.latimes.com/california/newsletter/202… │
│ local_news   ┆ Voter turnout remains low in CA primary as electi… ┆ https://www.cbs8.com/video/news/local/voter-turno… │
│ local_news   ┆ California gubernatorial election: Matt Mahan fac… ┆ https://abc7news.com/post/california-gubernatoria… │
│ general      ┆ Last-minute voter guide for California governor e… ┆ https://calmatters.org/politics/elections/2026/05… │
│ general      ┆ Matt Mahan facing campaign questions, political j… ┆ https://abc7news.com/post/california-gubernatoria… │
│ general      ┆ Election 2026: Results, news and analysis          ┆ https://www.cnn.com/election/2026                  │
│ general      ┆ Ballotpedia.org                                    ┆ https://ballotpedia.org/Main_Page                  │
│ videos       ┆ Voter turnout remains low in CA primary as electi… ┆ https://www.youtube.com/watch?v=UnJEjKYuXCI        │
│ videos       ┆ Thomas Massie files statement of candidacy for 20… ┆ https://www.youtube.com/watch?v=tLu_eWYW8Pc        │
│ videos       ┆ Breaking down the Democrats' 2024 election autops… ┆ https://www.youtube.com/watch?v=exTN-Jgb6Vo        │
│ general      ┆ Elections 2026                                     ┆ https://www.npr.org/sections/elections/            │
│ general      ┆ Department of Elections                            ┆ https://www.sf.gov/departments--department-electi… │
│ general      ┆ Everything You Need to Vote - Vote.org             ┆ https://www.vote.org/                              │
│ searches_re… ┆ -                                                  ┆ -                                                  │
└──────────────┴────────────────────────────────────────────────────┴────────────────────────────────────────────────────┘

By default, that script will save the outputs to a directory (data/demo-ws-{version}/) as JSON lines files: serps.json (the HTML plus search metadata), parsed.json (the parsed results and features), and searches.json (the search metadata only, excluding HTML).

ls -hal data/demo-ws-v0.8.4/
total 1020K
drwxr-xr-x 2 user user 4.0K 2024-11-11 10:55 ./
drwxr-xr-x 8 user user 4.0K 2024-11-11 10:54 ../
-rw-r--r-- 1 user user  16K 2024-11-11 10:55 parsed.json
-rw-r--r-- 1 user user 2.0K 2024-11-11 10:55 searches.json
-rw-r--r-- 1 user user 990K 2024-11-11 10:55 serps.json

Step by Step

Example search and parse pipeline (via requests):

import WebSearcher as ws
se = ws.SearchEngine()                     # 1. Initialize collector
se.search('election news')                 # 2. Conduct a search
se.parse_serp()                            # 3. Parse search results
se.save_serp(append_to='serps.json')       # 4. Save HTML and metadata
se.save_parsed(append_to='parsed.json')    # 5. Save parsed results

1. Initialize Collector

import WebSearcher as ws

# Initialize collector with method and other settings
se = ws.SearchEngine(
    method="selenium", 
    selenium_config = {
        "headless": False,
        "use_subprocess": False,
        "driver_executable_path": "",
        "version_main": None,  # auto-detected from installed Chrome when None
    }
)

2. Conduct a Search

se.search('election news')
# 2026-05-26 09:14:22.318 | INFO | WebSearcher.searchers | 200 | election news

3. Parse Search Results

The example below is primarily for parsing search results as you collect HTML. See ws.parse_serp(html) for parsing existing HTML data.

se.parse_serp()

# Show first result
se.parsed.results[0]
{'section': 'main',
 'cmpt_rank': 0,
 'sub_rank': 0,
 'type': 'ad',
 'sub_type': 'standard',
 'title': 'Latest Election News',
 'url': 'https://www.election-integrity.org/news',
 'text': 'Latest Election News',
 'cite': 'https://www.election-integrity.org',
 'details': None,
 'error': None,
 'serp_rank': 0}

4. Save HTML and Metadata

Recommended: Append html and meta data as lines to a json file for larger or ongoing collections.

se.save_serp(append_to='serps.json')

Alternative: Save individual html files in a directory, named by a provided or (default) generated serp_id. Useful for smaller qualitative explorations where you want to quickly look at what is showing up. No meta data is saved, but timestamps could be recovered from the files themselves.

se.save_serp(save_dir='./serps')

5. Save Parsed Results

Save to a json lines file.

se.save_parsed(append_to='parsed.json')

Localization

To conduct localized searches--from a location of your choice--you only need
one additional data point: The "Canonical Name" of each location. These are
available online, and can be downloaded using a built in function
(ws.download_locations()) to check for the most recent version.

A brief guide on how to select a canonical name and use it to conduct a
localized search is available in a jupyter notebook here.


Contributing

Happy to have help! If you see a component that we aren't covering yet, please add it using the process below. If you aren't sure about how to write a parser, you can also create an issue and I'll try to check it out. When creating that type of issue, providing the query that produced the new component and the time it was seen are essential, a screenshot of the component would be helpful, and the HTML would be ideal. Feel free to reach out if you have questions or need help.

Repair or Enhance a Parser

  1. Examine parser names in /component_parsers/__init__.py
  2. Find parser file as /component_parsers/{cmpt_name}.py.

Add a Parser

  1. Add classifier to classifiers/{main,footer,headers}.py
  2. Add parser as new file in /component_parsers
  3. Add new parser to imports and catalogue in /component_parsers/__init__.py

Testing

Run tests:

uv run pytest tests/ -q

Update snapshots:

uv run pytest tests/ --snapshot-update

Show snapshot diffs with -vv:

uv run pytest tests/ -vv

Run a specific snapshot test by serp_id prefix:

uv run pytest tests/ -k "45b6e019bfa2"

Test Fixtures

Tests load from compressed fixtures in tests/fixtures/. To update fixtures after collecting new demo data:

uv run python scripts/condense_fixtures.py 0.6.7
uv run pytest tests/ --snapshot-update

GitHub Actions

Test Workflow (.github/workflows/test.yml) Runs the test suite on every push to dev.

Release Workflow (.github/workflows/publish.yml) Publishes to PyPI when a pull request is merged into master:

  • Builds the package using uv
  • Publishes using trusted publishing (no API tokens required)

To release a new version:

  1. Merge dev into master via PR
  2. Once merged, the package is automatically published to PyPI

Similar Packages

Many of the packages I've found for collecting web search data via python are no longer maintained, but others are still ongoing and interesting or useful. The primary strength of WebSearcher is its parser, which provides a level of detail that enables examinations of SERP composition by recording the type and position of each result, and its modular design, which has allowed us to (itermittenly) maintain it for so long and to cover such a wide array of component types (currently 45 without considering sub_types). Feel free to add to the list of packages or services through a pull request if you are aware of others:


License

Copyright (C) 2017-2026 Ronald E. Robertson rer@acm.org

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

websearcher-0.8.6.tar.gz (102.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

websearcher-0.8.6-py3-none-any.whl (108.8 kB view details)

Uploaded Python 3

File details

Details for the file websearcher-0.8.6.tar.gz.

File metadata

  • Download URL: websearcher-0.8.6.tar.gz
  • Upload date:
  • Size: 102.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for websearcher-0.8.6.tar.gz
Algorithm Hash digest
SHA256 667e00a40c10ec7c869f1a76078df38af4bcf72ef4ae8754ed0aae5de25ad676
MD5 8b55fe2e6e61abf25663724aca9cc817
BLAKE2b-256 1ddefb29767b5e2220d1244a422e8a82535d4c6b65537641db284d095d8f8360

See more details on using hashes here.

Provenance

The following attestation bundles were made for websearcher-0.8.6.tar.gz:

Publisher: publish.yml on gitronald/WebSearcher

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file websearcher-0.8.6-py3-none-any.whl.

File metadata

  • Download URL: websearcher-0.8.6-py3-none-any.whl
  • Upload date:
  • Size: 108.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for websearcher-0.8.6-py3-none-any.whl
Algorithm Hash digest
SHA256 bdfbc2a97b03b658776a08dacbb23143a0973863cc7af2e7adc0e6451d6cff1b
MD5 47e2cb25c7a84446c0b96d8d3c18802b
BLAKE2b-256 1c32cc086a12f7a66747e06ff0c0bad39c0a3e4b13d64702089b961506717016

See more details on using hashes here.

Provenance

The following attestation bundles were made for websearcher-0.8.6-py3-none-any.whl:

Publisher: publish.yml on gitronald/WebSearcher

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page