CLI tool to compare two URLs and generate a topic & internal linking gap report as CSV.
Project description
page-gap-scanner
page-gap-scanner is a small, focused SEO CLI tool that compares two URLs from the same site and generates a topic & internal-link gap report as CSV.
The idea is simple:
- You have two pages that might overlap in intent.
- One of them should be the hero/winner page.
- The other should probably support & link to it.
- This tool compares both pages, extracts topics/headings, and shows you what the supporter page is missing — along with suggested anchors and a ready-made CSV you can hand over to content / devs.
Designed for:
- SEOs who want fast, opinionated insights
- Internal linking & topical cluster work
- Pre-work before consolidation / canonical decisions
Installation
Once uploaded to PyPI, you’ll be able to install it via:
pip install page-gap-scanner
For local development (from source):
git clone <your-repo-url>
cd page-gap-scanner
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e .
CLI Usage
Basic usage (Mode 2 – two URLs):
page-gap-scanner scan https://example.com/page-a https://example.com/page-b --output gaps.csv
Arguments:
url1– first URL (candidate winner/supporter)url2– second URL--output– path for the CSV file (default:gaps.csv)
Example:
page-gap-scanner scan \
https://example.com/credit-card-guide \
https://example.com/credit-card-fees \
--output cc_gaps.csv
This will:
- Fetch both URLs.
- Extract:
- Page title
- H1–H3 headings
- Basic topic/phrase candidates from visible text.
- Decide which URL is the winner (more depth & structure).
- Find topics the winner has that the supporter does not.
- Generate a CSV suggesting how the supporter should link to the winner.
Output CSV
Each row represents one missing topic that the supporter page could cover and link for.
Columns:
missing_topic– topic/phrase found on winner page but not on supporter page.winner_page– URL that should receive internal links / authority.supporter_page– URL that should add the link.recommended_change– human-readable suggestion.suggested_anchor– example anchor text.relevance_score– rough 1–100 score (higher = more important topic).
Sample:
missing_topic,winner_page,supporter_page,recommended_change,suggested_anchor,relevance_score
"international transaction charges","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Add a short section about 'international transaction charges' on the supporter page and link to the winner page.","learn more about international transaction charges",84
"annual fee waiver","https://example.com/credit-card-guide","https://example.com/credit-card-fees","Mention 'annual fee waiver' and link to the full guide for details.","full guide on annual fee waivers",78
How the “winner” page is chosen
Right now, the logic is intentionally simple and transparent:
- Fetch HTML for both URLs.
- Extract visible text and headings.
- Compute a basic content score per page:
- more words → higher score
- more H1/H2/H3 headings → higher score
The page with the higher score is treated as the winner.
The other becomes the supporter.
In other words: the deeper, better-structured page should typically be your hero page.
You can change this logic later (e.g., integrate crawl data, link counts, or external metrics).
Topic extraction (lightweight)
To avoid heavy NLP dependencies, page-gap-scanner uses a lightweight approach:
- Collects:
<title><h1>,<h2>,<h3>- Some visible text snippets
- Splits text into word phrases.
- Filters out:
- very short tokens
- common stopwords
- Normalises to lowercase and de-duplicates.
This keeps the tool:
- Fast
- Easy to install
- Safe to run in simple environments or CI
Example console output
When you run the command, you’ll see something like:
Scanning:
Winner candidate A: https://example.com/page-a
Winner candidate B: https://example.com/page-b
Winner selected: https://example.com/page-a
Supporter: https://example.com/page-b
Found 17 missing topics on supporter page.
CSV written to: gaps.csv
Project structure
page-gap-scanner/
pyproject.toml
README.md
LICENSE
page_gap_scanner/
__init__.py
cli.py
compare.py
fetch.py
extract.py
export.py
utils.py
Key modules:
cli.py– defines the Typer-based CLI (page-gap-scanner).fetch.py– fetches HTML safely.extract.py– extracts headings & topics.compare.py– core gap logic, winner/supporter decision.export.py– writes the CSV file.utils.py– small helpers.
Development & contribution
- Clone the repository.
- Create and activate a virtual environment.
- Install dependencies in editable mode:
pip install -e ".[dev]"
- Run the CLI locally:
page-gap-scanner scan https://example.com/a https://example.com/b
Author
Name: Amal Alexander
Email: amalalex95@gmail.com
Feel free to fork, tweak, and adapt this tool into your own SEO workflow.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file page_gap_scanner-0.1.2.tar.gz.
File metadata
- Download URL: page_gap_scanner-0.1.2.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d0625e6b4e531890605d8af09841f8abc4856381c347e428d3ffd1cd03e71ea
|
|
| MD5 |
b73154ddcc2076ae64676aa3f4713601
|
|
| BLAKE2b-256 |
4790b0d0fcf3a5795584101d2b6c8ed7ed1f1471aface6b93317c78a54de08bb
|
File details
Details for the file page_gap_scanner-0.1.2-py3-none-any.whl.
File metadata
- Download URL: page_gap_scanner-0.1.2-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09038cc3c7e525a81777488388f2698b851a051a81d12148abf0e7c3dfd273f0
|
|
| MD5 |
f2f4c3a20cc095a91b5e7c693c5154df
|
|
| BLAKE2b-256 |
0eda2dc34107ac5b211405c8517ea3fd90517ac34affeb946cf16c1ff72bec01
|