Probe before extract. Classify how a site serves data and recommend the optimal extraction strategy.

These details have not been verified by PyPI

Project links

Project description

cartograph

Probe before extract. Give cartograph a URL. It tells you how the site serves data and recommends the optimal extraction strategy. Claude is the intelligence layer. CLI and Python library, equal citizens.

What it solves

Figuring out how a site serves data is the work nobody tooled. You do it by hand, every time, for every site. WordPress here, Algolia search there, embedded hydration JSON on the next one, form-gated bulk CSV on the one after. Open devtools. Watch the network tab. View source. Guess. Build the extraction. Discover the architecture changed last quarter. Repeat.

cartograph automates the discovery half. Claude reads the probe output and tells you what kind of site it is, where the data lives, and what to do next. The actual extraction is still your code (and that's the right division of labor: Claude is expensive for bulk extraction, cheap for one-shot classification).

The value isn't probing one URL. The value is doing this 20 or 200 times against sources that all behave differently, without re-doing the same detective work every time.

Quickstart

pip install cartograph-ai  # Requires Python 3.11+
export ANTHROPIC_API_KEY=your-key
cartograph-ai https://sasaki.com/projects

sasaki.com/projects
└── Algolia search API (confidence: high)
    All 90 projects accessible without a browser.
    Estimated effort: half a day for a developer.
    Recommended: GET against app AHNZ21XTZ6, index prod_projects.
    Run with --json for machine output.

Same probe as a Python call:

from cartograph_ai import probe

result = probe("https://sasaki.com/projects")
print(result.classification)        # "algolia_search_api"
print(result.confidence)            # 0.94
print(result.extraction_strategy)   # structured dict

Or the full JSON output (--json from CLI, result.model_dump() from library):

{
  "url": "https://sasaki.com/projects",
  "model": "claude-sonnet-4-6",
  "classification": {
    "category": "direct_api",
    "subcategory": "algolia_search",
    "confidence": 0.94
  },
  "extraction_strategy": {
    "method": "algolia_search",
    "requires_browser": false,
    "estimated_requests": 2,
    "recommended_tool": "requests",
    "specifics": {"app_id": "AHNZ21XTZ6", "index": "prod_projects"}
  }
}

About 15 seconds per probe. ~$0.015 in tokens at current Sonnet rates. You're done with the first hour of detective work that usually eats the front of every scraping project.

Three ways to use it

Ad hoc. One URL list, one-time research, no engineering team. Run the CLI, get an answer you can hand to a developer.

Ongoing. Continuous signal pipeline. Probe new sources as they appear. Re-validate existing sources on a schedule. Detect when an architecture changes.

Embedded. Another tool calls cartograph with a URL and gets back a classification. A dashboard accepting URL input. An agent encountering a new web source mid-task. The caller doesn't need to know anything about web scraping.

The pattern is the same across all three modes. The deeper you integrate it, the more value compounds.

Economics

cartograph's real competitor isn't compute cost. It's human time. Two ways the math lands.

Batch cost comparison (50 URLs from your typical source list):

Approach	Time	Cost
cartograph (stages 1+2+4)	~12 minutes	~$0.75 in tokens
Headless browser per URL	~15-25 minutes	~$0 compute, ~$50-150 in developer time
Manual devtools inspection	~25-75 hours	~$2,500-7,500 in labor

Token economics for downstream LLM consumers:

If you're piping into another LLM (agent, RAG, summarizer), the structured probe result is a few KB of clean JSON. Throwing raw HTML at the model instead burns 80-95% of the context window on DOM noise.

Input to your LLM	Typical size	Input tokens (~)	Cost at Sonnet rates
Raw HTML (typical page)	200-500 KB	50K-125K	$0.15-0.38
cartograph probe result	2-5 KB	1,500-2,500	$0.005-0.008

Roughly 99% less spend on downstream input tokens. The probe already knows what to fetch and how, so the model doesn't have to figure it out from the raw DOM.

Real-world payloads run larger than initial back-of-envelope estimates; the cost math still favors cartograph by 10-1000x over alternatives. Numbers measured against a 15-URL benchmark set (2026-05-28, commit c1f8c15); claude-sonnet-4-6 pinned. Median probe cost: $0.015. Full results in bench/results.json.

Harder example: enterprise site

Big enterprise sites often look intimidating but reveal their architecture quickly. cartograph's job is to give you an honest fingerprint and point you at where the data actually lives.

cartograph-ai https://ford.com

ford.com
└── Adobe Experience Manager (confidence: high)
    Heavily server-rendered HTML. Content available in the initial response.
    Asset paths follow AEM conventions: /content/dam/ for the DAM, dedicated
    assets origin at assets.ford.com with Adobe Dynamic Media renditions.
    No client-side state blob or JSON API surface detected in the served HTML.
    Multi-subdomain topology suggests product data lives elsewhere:
      shop.ford.com (vehicle catalog and configurator)
      owner.ford.com (account and ownership data)
      fordpro.com (commercial fleet)
    Recommended: parse the server-rendered HTML directly for content on ford.com.
    Probe shop.ford.com separately for product data; the architecture there may
    differ (configurator likely needs the browser extra).

No Chromium downloaded. No Playwright on disk. cartograph identified the platform, named the asset patterns, and pointed you at the right subdomain to probe next. When you eventually need a browser, you opt in.

What it isn't

cartograph tells you which scraper to use; it doesn't do the scraping. It's a CLI and Python library that runs on your machine with your API key, outputs JSON, and never phones home. The full anti-positioning argument (vs Firecrawl, Apify, manual investigation, and what cartograph deliberately doesn't try to do) is in /docs/why-this-exists.md.

How it works

Four stages, progressively escalating from cheap to expensive: HTTP probe, HTML analysis, optional JS execution via Playwright (browser extra), and Claude classification. Most sites stop at stage 2. The pinned model is claude-sonnet-4-6. Full architecture, the published prompt, the output schema, and named failure modes live in /docs/how-it-works.md.

Honest limits

Every probe returns a confidence score. When cartograph can't get a clean read, it says so. That matters more than the classifications it gets right, because a confidently wrong probe wastes real time downstream.

Phase 1 covers ~75% of public sites without a browser. Auth-walled sites, anti-bot defenses, and genuinely novel architectures get reported honestly as limitations. The --strict flag makes cartograph refuse to recommend a strategy when confidence drops below threshold. See how it works for the full failure-mode taxonomy.

Roadmap

Phase 1 (active development). Stages 1, 2, 4. No browser. Covers most public sites. CLI, library, JSON + rich terminal output, full prompt published.

Phase 2. Stage 3 via pip install cartograph-ai[browser]. Playwright as optional extra. Uses system Chrome or Edge if available, falls back to Chromium only as last resort. Plus: caching layer, source-change detection, diff output for ongoing-mode users.

Phase 3 (stretch, may not happen). Probes against authenticated sites: login flows, API keys, paid subscriptions. Per-source encrypted credential storage.

Out of scope: crawler, anti-bot bypass, persistent storage, hosted SaaS.

Contribute

Issues welcome. Failed probes especially welcome. They're the input loop that improves the tool. Pull requests for framework fingerprints, prompt improvements, or test URLs against new patterns are all real contribution paths. See /CONTRIBUTING.md.

Why this exists: the pattern, the anti-positioning, how this project came together
How it works: architecture, the Claude prompt, failure modes, the structured output schema
Contributing: workshop principle, how to help
Changelog: what changed, when

Built with Claude Code.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cartograph_ai-0.1.0.tar.gz (68.4 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cartograph_ai-0.1.0-py3-none-any.whl (36.1 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file cartograph_ai-0.1.0.tar.gz.

File metadata

Download URL: cartograph_ai-0.1.0.tar.gz
Upload date: May 28, 2026
Size: 68.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cartograph_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f0d5e20df535e328b0d61936fa4046af4d9fca5b2024cc44655fcbd2c59bab24`
MD5	`61948c290bb024306116fc0842d9d4de`
BLAKE2b-256	`dcc10bcc74c3c712b5534905470d276369c0bc4cab67b376e06ebad39f0b775b`

See more details on using hashes here.

File details

Details for the file cartograph_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: cartograph_ai-0.1.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 36.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cartograph_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a3c522a564d5fa52d1c92b0fe01c7f82a20936f9a0bb3795f932948e06ba33fa`
MD5	`710c76143fe5eb28e27097be86d0bcf7`
BLAKE2b-256	`284bac175fff78b63413d3b5c0d34e78f9a64577a52cb4b74cdf1e19c97cd0ef`

See more details on using hashes here.

cartograph-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cartograph

What it solves

Quickstart

Three ways to use it

Economics

Harder example: enterprise site

What it isn't

How it works

Honest limits

Roadmap

Contribute

More

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes