CLI tool to compare two URLs and generate a topic & internal linking gap report as CSV.

These details have not been verified by PyPI

Project description

page-gap-scanner

page-gap-scanner is a small, focused SEO CLI tool that compares two URLs from the same site and generates a topic & internal-link gap report as CSV.

The idea is simple:

You have two pages that might overlap in intent.
One of them should be the hero/winner page.
The other should probably support & link to it.
This tool compares both pages, extracts topics/headings, and shows you what the supporter page is missing — along with suggested anchors and a ready-made CSV you can hand over to content / devs.

Designed for:

SEOs who want fast, opinionated insights
Internal linking & topical cluster work
Pre-work before consolidation / canonical decisions

Installation

Once uploaded to PyPI, you’ll be able to install it via:

pip install page-gap-scanner

For local development (from source):

git clone <your-repo-url>
cd page-gap-scanner
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .

CLI Usage

Basic usage (Mode 2 – two URLs):

page-gap-scanner scan https://example.com/page-a https://example.com/page-b --output gaps.csv

Arguments:

url1 – first URL (candidate winner/supporter)
url2 – second URL
--output – path for the CSV file (default: gaps.csv)

Example:

page-gap-scanner scan \
  https://example.com/credit-card-guide \
  https://example.com/credit-card-fees \
  --output cc_gaps.csv

This will:

Fetch both URLs.
Extract:
- Page title
- H1–H3 headings
- Basic topic/phrase candidates from visible text.
Decide which URL is the winner (more depth & structure).
Find topics the winner has that the supporter does not.
Generate a CSV suggesting how the supporter should link to the winner.

Output CSV

Each row represents one missing topic that the supporter page could cover and link for.

Columns:

missing_topic – topic/phrase found on winner page but not on supporter page.
winner_page – URL that should receive internal links / authority.
supporter_page – URL that should add the link.
recommended_change – human-readable suggestion.
suggested_anchor – example anchor text.
relevance_score – rough 1–100 score (higher = more important topic).

Sample:

missing_topic,winner_page,supporter_page,recommended_change,suggested_anchor,relevance_score
"international transaction charges","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Add a short section about 'international transaction charges' on the supporter page and link to the winner page.","learn more about international transaction charges",84
"annual fee waiver","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Mention 'annual fee waiver' and link to the full guide for details.","full guide on annual fee waivers",78

How the “winner” page is chosen

Right now, the logic is intentionally simple and transparent:

Fetch HTML for both URLs.
Extract visible text and headings.
Compute a basic content score per page:
- more words → higher score
- more H1/H2/H3 headings → higher score

The page with the higher score is treated as the winner.
The other becomes the supporter.

In other words: the deeper, better-structured page should typically be your hero page.

You can change this logic later (e.g., integrate crawl data, link counts, or external metrics).

Topic extraction (lightweight)

To avoid heavy NLP dependencies, page-gap-scanner uses a lightweight approach:

Collects:
- <title>
- <h1>, <h2>, <h3>
- Some visible text snippets
Splits text into word phrases.
Filters out:
- very short tokens
- common stopwords
Normalises to lowercase and de-duplicates.

This keeps the tool:

Fast
Easy to install
Safe to run in simple environments or CI

Example console output

When you run the command, you’ll see something like:

Scanning:
  Winner candidate A: https://example.com/page-a
  Winner candidate B: https://example.com/page-b

Winner selected: https://example.com/page-a
Supporter:       https://example.com/page-b

Found 17 missing topics on supporter page.
CSV written to: gaps.csv

Project structure

page-gap-scanner/
  pyproject.toml
  README.md
  LICENSE
  page_gap_scanner/
    __init__.py
    cli.py
    compare.py
    fetch.py
    extract.py
    export.py
    utils.py

Key modules:

cli.py – defines the Typer-based CLI (page-gap-scanner).
fetch.py – fetches HTML safely.
extract.py – extracts headings & topics.
compare.py – core gap logic, winner/supporter decision.
export.py – writes the CSV file.
utils.py – small helpers.

Development & contribution

Clone the repository.
Create and activate a virtual environment.
Install dependencies in editable mode:

pip install -e ".[dev]"

Run the CLI locally:

page-gap-scanner scan https://example.com/a https://example.com/b

Author

Name: Amal Alexander
Email: amalalex95@gmail.com

Feel free to fork, tweak, and adapt this tool into your own SEO workflow.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Nov 30, 2025

This version

0.1.2

Nov 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_gap_scanner-0.1.2.tar.gz (8.7 kB view details)

Uploaded Nov 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

page_gap_scanner-0.1.2-py3-none-any.whl (9.1 kB view details)

Uploaded Nov 30, 2025 Python 3

File details

Details for the file page_gap_scanner-0.1.2.tar.gz.

File metadata

Download URL: page_gap_scanner-0.1.2.tar.gz
Upload date: Nov 30, 2025
Size: 8.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for page_gap_scanner-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`9d0625e6b4e531890605d8af09841f8abc4856381c347e428d3ffd1cd03e71ea`
MD5	`b73154ddcc2076ae64676aa3f4713601`
BLAKE2b-256	`4790b0d0fcf3a5795584101d2b6c8ed7ed1f1471aface6b93317c78a54de08bb`

See more details on using hashes here.

File details

Details for the file page_gap_scanner-0.1.2-py3-none-any.whl.

File metadata

Download URL: page_gap_scanner-0.1.2-py3-none-any.whl
Upload date: Nov 30, 2025
Size: 9.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for page_gap_scanner-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09038cc3c7e525a81777488388f2698b851a051a81d12148abf0e7c3dfd273f0`
MD5	`f2f4c3a20cc095a91b5e7c693c5154df`
BLAKE2b-256	`0eda2dc34107ac5b211405c8517ea3fd90517ac34affeb946cf16c1ff72bec01`

See more details on using hashes here.

page-gap-scanner 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

page-gap-scanner

Installation

CLI Usage

Output CSV

How the “winner” page is chosen

Topic extraction (lightweight)

Example console output

Project structure

Development & contribution

Author

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes