Skip to main content

CLI tool to compare two URLs and generate a topic & internal linking gap report as CSV.

Project description

page-gap-scanner

page-gap-scanner is a small, focused SEO CLI tool that compares two URLs from the same site and generates a topic & internal-link gap report as CSV.

The idea is simple:

  • You have two pages that might overlap in intent.
  • One of them should be the hero/winner page.
  • The other should probably support & link to it.
  • This tool compares both pages, extracts topics/headings, and shows you what the supporter page is missing — along with suggested anchors and a ready-made CSV you can hand over to content / devs.

Designed for:

  • SEOs who want fast, opinionated insights
  • Internal linking & topical cluster work
  • Pre-work before consolidation / canonical decisions

Installation

Once uploaded to PyPI, you’ll be able to install it via:

pip install page-gap-scanner

For local development (from source):

git clone <your-repo-url>
cd page-gap-scanner
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .

CLI Usage

Basic usage (Mode 2 – two URLs):

page-gap-scanner scan https://example.com/page-a https://example.com/page-b --output gaps.csv

Arguments:

  • url1 – first URL (candidate winner/supporter)
  • url2 – second URL
  • --output – path for the CSV file (default: gaps.csv)

Example:

page-gap-scanner scan \
  https://example.com/credit-card-guide \
  https://example.com/credit-card-fees \
  --output cc_gaps.csv

This will:

  1. Fetch both URLs.
  2. Extract:
    • Page title
    • H1–H3 headings
    • Basic topic/phrase candidates from visible text.
  3. Decide which URL is the winner (more depth & structure).
  4. Find topics the winner has that the supporter does not.
  5. Generate a CSV suggesting how the supporter should link to the winner.

Output CSV

Each row represents one missing topic that the supporter page could cover and link for.

Columns:

  • missing_topic – topic/phrase found on winner page but not on supporter page.
  • winner_page – URL that should receive internal links / authority.
  • supporter_page – URL that should add the link.
  • recommended_change – human-readable suggestion.
  • suggested_anchor – example anchor text.
  • relevance_score – rough 1–100 score (higher = more important topic).

Sample:

missing_topic,winner_page,supporter_page,recommended_change,suggested_anchor,relevance_score
"international transaction charges","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Add a short section about 'international transaction charges' on the supporter page and link to the winner page.","learn more about international transaction charges",84
"annual fee waiver","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Mention 'annual fee waiver' and link to the full guide for details.","full guide on annual fee waivers",78

How the “winner” page is chosen

Right now, the logic is intentionally simple and transparent:

  • Fetch HTML for both URLs.
  • Extract visible text and headings.
  • Compute a basic content score per page:
    • more words → higher score
    • more H1/H2/H3 headings → higher score

The page with the higher score is treated as the winner.
The other becomes the supporter.

In other words: the deeper, better-structured page should typically be your hero page.

You can change this logic later (e.g., integrate crawl data, link counts, or external metrics).


Topic extraction (lightweight)

To avoid heavy NLP dependencies, page-gap-scanner uses a lightweight approach:

  • Collects:
    • <title>
    • <h1>, <h2>, <h3>
    • Some visible text snippets
  • Splits text into word phrases.
  • Filters out:
    • very short tokens
    • common stopwords
  • Normalises to lowercase and de-duplicates.

This keeps the tool:

  • Fast
  • Easy to install
  • Safe to run in simple environments or CI

Example console output

When you run the command, you’ll see something like:

Scanning:
  Winner candidate A: https://example.com/page-a
  Winner candidate B: https://example.com/page-b

Winner selected: https://example.com/page-a
Supporter:       https://example.com/page-b

Found 17 missing topics on supporter page.
CSV written to: gaps.csv

Project structure

page-gap-scanner/
  pyproject.toml
  README.md
  LICENSE
  page_gap_scanner/
    __init__.py
    cli.py
    compare.py
    fetch.py
    extract.py
    export.py
    utils.py

Key modules:

  • cli.py – defines the Typer-based CLI (page-gap-scanner).
  • fetch.py – fetches HTML safely.
  • extract.py – extracts headings & topics.
  • compare.py – core gap logic, winner/supporter decision.
  • export.py – writes the CSV file.
  • utils.py – small helpers.

Development & contribution

  1. Clone the repository.
  2. Create and activate a virtual environment.
  3. Install dependencies in editable mode:
pip install -e ".[dev]"
  1. Run the CLI locally:
page-gap-scanner scan https://example.com/a https://example.com/b

Author

Name: Amal Alexander
Email: amalalex95@gmail.com

Feel free to fork, tweak, and adapt this tool into your own SEO workflow.


License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_gap_scanner-0.1.2.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

page_gap_scanner-0.1.2-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file page_gap_scanner-0.1.2.tar.gz.

File metadata

  • Download URL: page_gap_scanner-0.1.2.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for page_gap_scanner-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9d0625e6b4e531890605d8af09841f8abc4856381c347e428d3ffd1cd03e71ea
MD5 b73154ddcc2076ae64676aa3f4713601
BLAKE2b-256 4790b0d0fcf3a5795584101d2b6c8ed7ed1f1471aface6b93317c78a54de08bb

See more details on using hashes here.

File details

Details for the file page_gap_scanner-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for page_gap_scanner-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 09038cc3c7e525a81777488388f2698b851a051a81d12148abf0e7c3dfd273f0
MD5 f2f4c3a20cc095a91b5e7c693c5154df
BLAKE2b-256 0eda2dc34107ac5b211405c8517ea3fd90517ac34affeb946cf16c1ff72bec01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page