Skip to main content

CLI tool to compare two URLs and generate a topic & internal linking gap report as CSV.

Project description

page-gap-scanner

page-gap-scanner is a small, focused SEO CLI tool that compares two URLs from the same site and generates a topic & internal-link gap report as CSV.

The idea is simple:

  • You have two pages that might overlap in intent.
  • One of them should be the hero/winner page.
  • The other should probably support & link to it.
  • This tool compares both pages, extracts topics/headings, and shows you what the supporter page is missing — along with suggested anchors and a ready-made CSV you can hand over to content / devs.

Designed for:

  • SEOs who want fast, opinionated insights
  • Internal linking & topical cluster work
  • Pre-work before consolidation / canonical decisions

Installation

Once uploaded to PyPI, you’ll be able to install it via:

pip install page-gap-scanner

For local development (from source):

git clone <your-repo-url>
cd page-gap-scanner
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .

CLI Usage

Basic usage (Mode 2 – two URLs):

page-gap-scanner scan https://example.com/page-a https://example.com/page-b --output gaps.csv

Arguments:

  • url1 – first URL (candidate winner/supporter)
  • url2 – second URL
  • --output – path for the CSV file (default: gaps.csv)

Example:

page-gap-scanner scan \
  https://example.com/credit-card-guide \
  https://example.com/credit-card-fees \
  --output cc_gaps.csv

This will:

  1. Fetch both URLs.
  2. Extract:
    • Page title
    • H1–H3 headings
    • Basic topic/phrase candidates from visible text.
  3. Decide which URL is the winner (more depth & structure).
  4. Find topics the winner has that the supporter does not.
  5. Generate a CSV suggesting how the supporter should link to the winner.

Output CSV

Each row represents one missing topic that the supporter page could cover and link for.

Columns:

  • missing_topic – topic/phrase found on winner page but not on supporter page.
  • winner_page – URL that should receive internal links / authority.
  • supporter_page – URL that should add the link.
  • recommended_change – human-readable suggestion.
  • suggested_anchor – example anchor text.
  • relevance_score – rough 1–100 score (higher = more important topic).

Sample:

missing_topic,winner_page,supporter_page,recommended_change,suggested_anchor,relevance_score
"international transaction charges","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Add a short section about 'international transaction charges' on the supporter page and link to the winner page.","learn more about international transaction charges",84
"annual fee waiver","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Mention 'annual fee waiver' and link to the full guide for details.","full guide on annual fee waivers",78

How the “winner” page is chosen

Right now, the logic is intentionally simple and transparent:

  • Fetch HTML for both URLs.
  • Extract visible text and headings.
  • Compute a basic content score per page:
    • more words → higher score
    • more H1/H2/H3 headings → higher score

The page with the higher score is treated as the winner.
The other becomes the supporter.

In other words: the deeper, better-structured page should typically be your hero page.

You can change this logic later (e.g., integrate crawl data, link counts, or external metrics).


Topic extraction (lightweight)

To avoid heavy NLP dependencies, page-gap-scanner uses a lightweight approach:

  • Collects:
    • <title>
    • <h1>, <h2>, <h3>
    • Some visible text snippets
  • Splits text into word phrases.
  • Filters out:
    • very short tokens
    • common stopwords
  • Normalises to lowercase and de-duplicates.

This keeps the tool:

  • Fast
  • Easy to install
  • Safe to run in simple environments or CI

Example console output

When you run the command, you’ll see something like:

Scanning:
  Winner candidate A: https://example.com/page-a
  Winner candidate B: https://example.com/page-b

Winner selected: https://example.com/page-a
Supporter:       https://example.com/page-b

Found 17 missing topics on supporter page.
CSV written to: gaps.csv

Project structure

page-gap-scanner/
  pyproject.toml
  README.md
  LICENSE
  page_gap_scanner/
    __init__.py
    cli.py
    compare.py
    fetch.py
    extract.py
    export.py
    utils.py

Key modules:

  • cli.py – defines the Typer-based CLI (page-gap-scanner).
  • fetch.py – fetches HTML safely.
  • extract.py – extracts headings & topics.
  • compare.py – core gap logic, winner/supporter decision.
  • export.py – writes the CSV file.
  • utils.py – small helpers.

Development & contribution

  1. Clone the repository.
  2. Create and activate a virtual environment.
  3. Install dependencies in editable mode:
pip install -e ".[dev]"
  1. Run the CLI locally:
page-gap-scanner scan https://example.com/a https://example.com/b

Author

Name: Amal Alexander
Email: amalalex95@gmail.com

Feel free to fork, tweak, and adapt this tool into your own SEO workflow.


License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_gap_scanner-0.1.3.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

page_gap_scanner-0.1.3-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file page_gap_scanner-0.1.3.tar.gz.

File metadata

  • Download URL: page_gap_scanner-0.1.3.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for page_gap_scanner-0.1.3.tar.gz
Algorithm Hash digest
SHA256 534a5da4d93e67969863fed14fcb96c2be4594944044361cccba6502c9e155a5
MD5 df3b4796924d25cc8e4c9e242cc19e17
BLAKE2b-256 e69558e18bdfab1fb105c8c9d32743a0bdfff5a567bdb41695a161646ce79508

See more details on using hashes here.

File details

Details for the file page_gap_scanner-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for page_gap_scanner-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3b73f7778fe4137c924dc5cc8d1114de0e080602b40e69a844d64a88fbb0c5be
MD5 58a2d85b965297cf5fa03da2562895e6
BLAKE2b-256 912878e5c08dab8ef4dab5b8056dcaa16744bb4388dd1915e9c715b1e411c451

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page