Convert website crawls, URL inventories, and editorial notes into structured context packs for human-reviewed LLM-assisted content workflows.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

a98884865

These details have not been verified by PyPI

Project description

site-context-pipeline

Convert website crawls, URL inventories, and editorial notes into structured context packs for human-reviewed, LLM-assisted content workflows.

site-context-pipeline is a small, dependency-free Python CLI that turns the boring-but-essential facts about a website into a stable, machine- and human-readable digest. The digest is the artifact you hand to a language model (or to a human writer) before they touch a brief or a draft.

The 0.x core is intentionally small: it reads a CSV/JSON URL list, classifies pages, builds a simple internal-link graph, optionally folds in keyword and search-performance data from local CSV exports, and emits an aggregated agent context pack plus a content opportunities report. The core schemas, artifacts, and pipeline are vendor-neutral and have no required external API dependency. Optional provider adapters may carry vendor-specific names (e.g. google-ads, google-search-console) — see Provider philosophy and docs/providers.md for the rules.

Documentation: Tutorial · Architecture · Providers · Artifacts · Roadmap · Changelog

What this project is

A CLI toolkit for assembling structured context about a single site.
A deterministic pipeline: same input, same output. Every artifact records where its facts came from.
A safe foundation for LLM-assisted workflows: humans (and models) consume the pack, but the pack is built without calling an LLM.
An opinionated layout: every site lives in its own clients/<name>/{input,config,data,output,logs} workspace, so several sites can coexist without contaminating each other.

What this project is not

It is not a one-click SEO article generator. There is no built-in prompt that says "write me a 1500-word article ranked #1 on Google."
It is not a Yandex-only or Google-only tool. The base package works without any search vendor at all.
It is not a crawler. You bring a CSV of URLs (export from Screaming Frog, your CMS, a sitemap parser, etc.). 0.x does not fetch pages.
It is not a SERP scraper, keyword scraper, or link-building automator.
It is not a CMS publisher. Outputs are local files; pushing them to WordPress or anywhere else is your responsibility.
It is not a black-hat SEO toolkit. If your goal is to generate doorway pages or scaled spam, this isn't your tool.

Why structured context matters

Asking an LLM "write a blog post about local delivery" without context produces text that:

duplicates pages already on your site,
targets keywords that don't match your real services,
recommends links that don't exist,
invents facts that conflict with your live copy.

Hand the same model a stable digest of the site (page inventory, link graph, classification reasons, project notes, real keyword volumes, real Search-Console performance) and the failure modes shrink. You also get something every author and reviewer needs: an auditable trail showing where each claim came from. The pack is designed for human review first; LLM consumption is a side benefit.

Installation

Requires Python ≥ 3.11. The core has zero runtime dependencies.

pip install site-context-pipeline

Or, from a clone:

git clone https://github.com/OtShelniko/site-context-pipeline.git
cd site-context-pipeline
pip install -e ".[dev]"

Quickstart

The shipped demo uses synthetic example.com data — no real sites or keywords.

# 1. Initialise an empty client workspace.
site-context-pipeline init --client demo --write

# 2. Build the inventory from a URL CSV.
site-context-pipeline build-inventory \
    --client demo \
    --source examples/demo-client/input/urls.csv \
    --write

# 2a. (alternative) Or feed a sitemap.xml — same command, different format.
#     Auto-detection picks "sitemap" from the .xml extension; --format
#     sitemap forces it explicitly.
# site-context-pipeline build-inventory \
#     --client demo \
#     --source path/to/sitemap.xml \
#     --format sitemap \
#     --write

# 3. Build the internal link graph from an edge CSV.
site-context-pipeline build-link-graph \
    --client demo \
    --source examples/demo-client/input/links.csv \
    --write

# 4. (optional) Import keyword volume data from a local CSV.
site-context-pipeline import-keywords \
    --client demo \
    --provider local-csv \
    --source examples/demo-client/input/keyword_metrics.csv \
    --write

# 5. (optional) Import per-query performance from a Search-Console-style CSV.
site-context-pipeline import-search-performance \
    --client demo \
    --provider local-gsc-csv \
    --source examples/demo-client/input/search_console.csv \
    --write

# 6. Aggregate everything into the agent context pack.
site-context-pipeline build-context-pack --client demo --write

# 7. See what's there.
site-context-pipeline inspect --client demo

After step 6 you will have:

clients/demo/
├── data/
│   ├── content_inventory.json
│   ├── internal_link_graph.json
│   ├── keyword_metrics.json          # only if step 4 ran
│   └── search_performance.json       # only if step 5 ran
└── output/
    ├── agent_context_pack.json
    ├── agent_context_pack.md
    └── content_opportunities.md

Steps 4 and 5 are optional. The context pack works without them; if both artifacts are missing the pack records a clear missing_keyword_data warning so reviewers know the demand and performance sections were not filled in.

CLI commands

Every command takes --client <id> and an optional --workspace <path> (defaults to the current directory). Every command supports --write; without it, the command runs as a dry-run and prints the planned writes.

Command	What it does	Reads	Writes
`init`	Creates the `clients/<id>/` directory tree and seed files.	—	`clients/<id>/{input,config,data,output,logs}/`, `input/{urls.csv,links.csv,project.md}` placeholders
`build-inventory --source PATH`	Normalises URLs, classifies each as `home`/`service`/`blog`/`category`/`landing`/`other`, records the rule that fired. Accepts CSV, JSON, sitemap XML, or Screaming Frog `internal_*.csv` via `--format auto\|csv\|json\|sitemap\|screaming-frog`.	CSV, JSON, sitemap.xml, or Screaming Frog inventory CSV	`data/content_inventory.json`
`build-link-graph --source PATH`	Joins an edge list with the inventory; tags commercial pages with low blog inlinks. Accepts CSV, JSON, or Screaming Frog `all_inlinks.csv` via `--format`.	CSV, JSON, or Screaming Frog link CSV	`data/internal_link_graph.json`
`import-keywords --provider NAME --source PATH`	Reads keyword metrics from a provider into a normalised artifact.	provider-specific	`data/keyword_metrics.json`
`import-search-performance --provider NAME --source PATH`	Reads per-query performance data into a normalised artifact.	provider-specific	`data/search_performance.json`
`list-providers`	Lists available keyword and search-performance providers and whether each is live in this release.	—	nothing
`build-context-pack`	Aggregates inventory, link graph, project notes, keywords, and performance into one digest. No LLM, no network.	The JSON artifacts above + project notes	`output/agent_context_pack.json`, `output/agent_context_pack.md`, `output/content_opportunities.md`
`inspect`	Reports which expected files exist. Useful for CI scripts.	The whole workspace	nothing

All commands print one JSON document on stdout, so you can pipe them.

Looking for a longer walkthrough? See docs/tutorial.md — a 10-minute end-to-end tutorial that goes from "I have a sitemap" to a finished context pack, with explanations for every step.

Provider philosophy

Providers are how external data — keyword volume, search performance, SERP rows — gets into the pipeline. The toolkit follows four rules:

Providers are optional. The base package works without any of them. The core artifacts (inventory, link graph, context pack) never touch the network.
Providers convert external data into normalised local artifacts. A provider's job is to read a CSV (today) or call a vendor API (in the future) and emit data/keyword_metrics.json or data/search_performance.json in a stable, vendor-independent shape. Every row carries a source field so you can tell which provider produced it.
The core pipeline reads normalised artifacts only. Once a provider has written the artifact, no other code in the pipeline cares which provider produced it. This prevents vendor lock-in and keeps the context pack reproducible from a single workspace directory.
Vendor-specific names live in providers, never in the core. The schemas, artifact field names, and CLI core commands stay vendor-neutral. A provider identifier like google-ads may be vendor-specific by design — that is what tells the user which API the future live adapter will call. Vendor-specific providers must remain optional adapters and never become core dependencies.

Listing in this release:

Provider name	Kind	Status	Notes
`local-csv`	keyword	live	Read keyword metrics from any local CSV (Google Ads export, Ahrefs / Semrush export, hand-curated research). Offline.
`google-ads`	keyword	stub	Returns `not_configured`. Live Google Ads Keyword Planner support is on the roadmap behind an optional extra.
`local-gsc-csv`	search_performance	live	Read per-query performance from a Google Search Console Performance CSV export. Offline.
`google-search-console`	search_performance	stub	Returns `not_configured`. Live Search Console API access is on the roadmap behind an optional extra.

Why not hardcode Yandex or Google?

Different markets use different search engines. Yandex still leads in some regions; Google leads in others; Baidu, Naver, DuckDuckGo, and vertical search matter for specific niches. Hardcoding any single vendor would push the toolkit toward one market and against another.
OSS users should be able to bring their own data. The pipeline cannot tell whether your keyword_metrics.csv came from Google Ads, Yandex Wordstat, Ahrefs, Semrush, an internal database, or a hand-curated spreadsheet — and it does not need to. Every row is treated the same way.
Local CSV imports are the stable baseline. Vendors change auth flows, schemas, and access tiers. Files do not. Building the data contract around CSV/JSON keeps the pipeline working when an API changes overnight.
API adapters should never be required for core usage. When a live adapter ships, it lives behind an optional extra (e.g. pip install site-context-pipeline[gsc]) and the rest of the pipeline stays dependency-free.

If you need a Yandex-specific or Google-specific adapter, add it as a new provider that produces the same KeywordMetric rows the rest of the toolkit already understands. No core changes required.

Demo client

Run site-context-pipeline init --client demo --write to start a fresh workspace, or use the synthetic fixtures in examples/demo-client/ directly. The fixtures contain:

8 pages on a fictional example.com (home, services, blog posts, pricing, about).
6 internal links between them.
6 fake search queries with synthetic volumes (local delivery planning, delivery cost guide, same day delivery checklist, business delivery pricing, warehouse delivery service, local delivery service).
6 fake Search-Console rows with impressions, clicks, CTR, and average position.
A short project.md describing the imaginary business.
config/commercial_urls.json promoting one URL to landing.
config/classifier.json showing how to override the default page-pattern rules.

The fixtures are intentionally tiny and language-neutral. They are not copied from any real site or client.

Generated artifacts

`data/content_inventory.json`

A list of objects, one per page:

{
  "url": "https://example.com/blog/how-to-plan-delivery/",
  "path": "/blog/how-to-plan-delivery/",
  "page_type": "blog",
  "classification_reason": "matched_pattern:*/blog/*",
  "title": "How to plan a delivery",
  "h1": "How to plan a delivery",
  "status_code": 200,
  "word_count": 1100,
  "inlinks_count": 2,
  "outlinks_count": 3,
  "source": "csv"
}

`data/internal_link_graph.json`

{
  "nodes": [{"url": "...", "page_type": "service", "blog_inlink_count": 1, "is_commercial_target": true, "...": "..."}],
  "edges": [{"source_url": "...", "target_url": "...", "anchor_text": "..."}],
  "commercial_pages_low_blog_inlinks": [],
  "blog_pages_low_inlinks": [],
  "warnings": []
}

`data/keyword_metrics.json` (optional)

Produced by import-keywords. Every row carries a source field identifying the provider that wrote it.

{
  "schema_version": 1,
  "provider": "local-csv",
  "items_count": 6,
  "metadata": {"source_path": "examples/demo-client/input/keyword_metrics.csv", "row_count": 6, "items_count": 6},
  "warnings": [],
  "items": [
    {
      "query": "local delivery service",
      "source": "local-csv",
      "avg_monthly_searches": 3600,
      "competition": "HIGH",
      "geo": "US",
      "language": "en",
      "source_url": "https://example.com/services/local-delivery/",
      "raw": {}
    }
  ]
}

`data/search_performance.json` (optional)

Produced by import-search-performance. Same shape as keyword_metrics.json but the rows usually fill impressions, clicks, ctr, and position instead of avg_monthly_searches.

`output/agent_context_pack.json`

{
  "schema_version": 1,
  "generated_at": "2026-05-31T00:00:00+00:00",
  "client": "demo",
  "summary": {
    "page_count": 8,
    "edge_count": 6,
    "node_count": 8,
    "keyword_metrics_count": 6,
    "search_performance_rows": 6,
    "page_type_counts": {"blog": 2, "home": 1, "...": "..."}
  },
  "classification": {"reasons": {"...": "..."}},
  "pages": {"home": [], "blog": [], "...": []},
  "opportunities": {
    "commercial_pages_low_blog_inlinks": [],
    "blog_pages_low_inlinks": [],
    "top_keywords": [],
    "weak_ctr_pages": [],
    "ranked_but_unsupported": []
  },
  "search_performance_summary": {"rows": 6, "total_clicks": 174, "total_impressions": 8620, "average_ctr": 0.0218, "average_position": 13.62},
  "providers": {"keyword_metrics": {}, "search_performance": {}},
  "project_notes": "...",
  "sources": {"...": "..."},
  "warnings": []
}

output/agent_context_pack.md is the same content as a Markdown document, with sections for top keyword opportunities, weak-CTR pages, and pages that already rank but receive no internal support.

output/content_opportunities.md is a deterministic shortlist of gaps: commercial pages without blog inlinks, orphan blog posts, weak-CTR queries, and ranked-but-unsupported URLs. It is a prompt for human review, not a ranking.

Provider input formats

`local-csv` (keyword data)

Recognised columns (case-insensitive, _/-/space treated as equivalent — Search Volume, search_volume, and search-volume all match):

Required (one of)	Optional
`query`, `keyword`, `search_term`	`avg_monthly_searches` (`search_volume`, `monthly_searches`, `volume`, `searches`)
	`impressions`, `clicks`, `ctr`, `position` (`average_position`, `rank`)
	`competition`, `locale`, `geo` (`country`, `location`), `language`, `source_url`

Numeric values handle thousand separators ("1,234" → 1234) and CTR percentages ("12.3%" → 0.123). Unknown columns are preserved in each row's raw dict so no information is lost.

`local-gsc-csv` (search-performance data)

Tolerant of any CSV that uses Google Search Console-like headers. The same column-normalisation rules apply.

Required	Optional
`query` (also `top queries`, `search_term`)	`page` / `landing_page` / `url`, `clicks`, `impressions`, `ctr`, `position`, `country`, `device`, `date`

device and date are preserved on each row's raw dict so future filters can use them.

Dry-run / write principle

Every command runs as a dry-run by default. The exit code is 0, the JSON payload lists planned_writes, but nothing is created on disk. Re-run with --write to materialise the artifacts. This makes the pipeline safe to integrate into CI, code reviews, and PR previews.

Data safety

This toolkit is designed for public, source-backed processing. A few ground rules:

Use synthetic data in public examples. Never check real client domains, keyword lists, briefs, or scraped HTML into a public repo. See examples/demo-client/ for the bar.
Keep secrets out of the workspace. No API keys are needed for the 0.x core. When live network adapters are added (see Roadmap), they will read keys from environment variables and .env files that are gitignored; keys must never be written into artifacts.
Treat input files as untrusted data. The CLI never executes anything from your CSV/JSON; it only reads fields it knows about. Unknown columns are preserved in raw but never executed.
Path traversal is rejected. Client identifiers are validated against a strict pattern; --client ../etc exits with an error.

If you find a security issue, please follow SECURITY.md.

Architecture overview

┌──────────────────────┐     ┌─────────────────────────┐
│ input/urls.csv       │ ──► │ inventory.py            │
│ input/links.csv      │     │  classify URLs          │
│ input/project.md     │     │  build content_inventory│
└──────────────────────┘     └────────────┬────────────┘
                                          │
                                          ▼
                             ┌─────────────────────────┐
                             │ link_graph.py           │
                             │  join inventory + edges │
                             │  flag opportunities     │
                             └────────────┬────────────┘
                                          │
   ┌──────────────────────┐               │
   │ keyword_metrics.csv  │ ──┐           │
   │ search_console.csv   │ ──┤           │
   │ ...                  │   │           │
   └──────────────────────┘   ▼           │
                          ┌─────────────────────────┐
                          │ providers/              │
                          │  local-csv              │
                          │  local-gsc-csv          │
                          │  google-ads (stub)      │
                          │  google-search-console  │
                          │      (stub)             │
                          └────────────┬────────────┘
                                       │ data/keyword_metrics.json
                                       │ data/search_performance.json
                                       ▼
                             ┌─────────────────────────┐
                             │ context_pack.py         │
                             │  aggregate everything   │
                             │  emit pack.{json,md}    │
                             └─────────────────────────┘

The pipeline is intentionally one-way: each step reads from the previous step's artifact on disk. This means you can run any step independently and re-run cheaply when an upstream input changes.

Roadmap

The 0.x core is offline-only on purpose. Future versions will add optional adapters behind explicit opt-in flags and [extras]. The shape these will take:

crawl adapter — wrap an external crawler (SiteOne, Screaming Frog, a sitemap parser) so users do not have to assemble the input CSV by hand. The adapter must be opt-in and never crawl by default.
Live keyword providers — Google Ads Keyword Planner, plus future adapters for any vendor (DataForSEO, Yandex Wordstat, SerpApi, Ahrefs, Semrush) where the user has credentials. Each lives behind its own optional extra and produces the same normalised KeywordMetric rows.
Live search-performance providers — Google Search Console Search Analytics API, plus future adapters for any equivalent service. Same opt-in pattern.
Search evidence providers — read top-N organic rows for a query from a search API. Behind an explicit --allow-external flag and one of several pluggable backends.
llm-brief adapter — feed the context pack to an LLM to produce a brief (not a draft). Output must be reviewable JSON, not free-form prose, and every claim must cite a source field from the pack.
yoast-style-qa module — deterministic, offline content QA over Markdown drafts (keyphrase distribution, internal-link sanity, slug checks). No LLM involvement.
schema-org module — generate JSON-LD Article / FAQPage / BreadcrumbList from a draft + the context pack. Validation against Google's required-property checklist.
WordPress publish — explicitly out of scope until everything above is stable. When added, it will be a separate package.

Items intentionally not on the roadmap:

A built-in "write me an article" command.
Bulk content generation across many sites in one run.
Anything that touches a live site without an explicit opt-in flag.
Hardcoded support for any single search vendor in the core. Vendors are providers; providers are optional.

Using this with OpenAI Codex (or any coding assistant)

The agent context pack is designed to be a stable input for a coding or content assistant. A typical loop:

Run the pipeline locally and review agent_context_pack.md by eye.
Paste the pack (or attach agent_context_pack.json) into the assistant's context window.
Ask the assistant to draft a brief, an outline, or a code change that cites the pack's sources and pages fields.
Verify the assistant's references against the live site before acting on the output.

The pack's schema_version field lets you write a small validator in your own codebase to refuse drafts that drift from the agreed schema.

Development

git clone <this repo>
cd site-context-pipeline
python -m venv .venv
. .venv/Scripts/activate     # Windows
pip install -e ".[dev]"
ruff check .
pytest

CI runs the same commands on Python 3.11 and 3.12.

License

MIT.

Code of conduct

By participating you agree to the Contributor Covenant.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

a98884865

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

Jun 1, 2026

0.4.0

May 31, 2026

0.3.0

May 31, 2026

This version

0.2.0

May 31, 2026

0.1.1

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

site_context_pipeline-0.2.0.tar.gz (56.2 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

site_context_pipeline-0.2.0-py3-none-any.whl (53.8 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file site_context_pipeline-0.2.0.tar.gz.

File metadata

Download URL: site_context_pipeline-0.2.0.tar.gz
Upload date: May 31, 2026
Size: 56.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for site_context_pipeline-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`4cc4bf0e842691b4a63ea5590450e36f59f069fc99ee0366c55afbc62af1a662`
MD5	`9240c7779d1eda596820698c957d541b`
BLAKE2b-256	`7401405a52cfb4f3d0130ddf57c341ee0841aa5ec7600aedd0148086a7482263`

See more details on using hashes here.

Provenance

The following attestation bundles were made for site_context_pipeline-0.2.0.tar.gz:

Publisher: release.yml on OtShelniko/site-context-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: site_context_pipeline-0.2.0.tar.gz
- Subject digest: 4cc4bf0e842691b4a63ea5590450e36f59f069fc99ee0366c55afbc62af1a662
- Sigstore transparency entry: 1684919404
- Sigstore integration time: May 31, 2026
Source repository:
- Permalink: OtShelniko/site-context-pipeline@76a9246072887dc6d4dadb9ea6f783b44e23fd34
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/OtShelniko
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@76a9246072887dc6d4dadb9ea6f783b44e23fd34
- Trigger Event: push

File details

Details for the file site_context_pipeline-0.2.0-py3-none-any.whl.

File metadata

Download URL: site_context_pipeline-0.2.0-py3-none-any.whl
Upload date: May 31, 2026
Size: 53.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for site_context_pipeline-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5f451012a9d6ac4c4b60a631104e68f220de8ba48306dc7f47fb1ff1d8d657a0`
MD5	`602e53f2f32a989200872674f8bcc708`
BLAKE2b-256	`346bf79ba2c74548dd540d2c455f56f4ac0d6ce0380fa56010fee2f594989909`

See more details on using hashes here.

Provenance

The following attestation bundles were made for site_context_pipeline-0.2.0-py3-none-any.whl:

Publisher: release.yml on OtShelniko/site-context-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: site_context_pipeline-0.2.0-py3-none-any.whl
- Subject digest: 5f451012a9d6ac4c4b60a631104e68f220de8ba48306dc7f47fb1ff1d8d657a0
- Sigstore transparency entry: 1684919552
- Sigstore integration time: May 31, 2026
Source repository:
- Permalink: OtShelniko/site-context-pipeline@76a9246072887dc6d4dadb9ea6f783b44e23fd34
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/OtShelniko
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@76a9246072887dc6d4dadb9ea6f783b44e23fd34
- Trigger Event: push

site-context-pipeline 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

site-context-pipeline

What this project is

What this project is not

Why structured context matters

Installation

Quickstart

CLI commands

Provider philosophy

Why not hardcode Yandex or Google?

Demo client

Generated artifacts

data/content_inventory.json

data/internal_link_graph.json

data/keyword_metrics.json (optional)

data/search_performance.json (optional)

output/agent_context_pack.json

Provider input formats

local-csv (keyword data)

local-gsc-csv (search-performance data)

Dry-run / write principle

Data safety

Architecture overview

Roadmap

Using this with OpenAI Codex (or any coding assistant)

Development

License

Code of conduct

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`data/content_inventory.json`

`data/internal_link_graph.json`

`data/keyword_metrics.json` (optional)

`data/search_performance.json` (optional)

`output/agent_context_pack.json`

`local-csv` (keyword data)

`local-gsc-csv` (search-performance data)