Probe before extract. Classify how a site serves data and recommend the optimal extraction strategy.
Project description
cartograph
Probe before extract. Give cartograph a URL. It tells you how the site serves data and recommends the optimal extraction strategy. Claude is the intelligence layer. CLI and Python library, equal citizens.
What it solves
Figuring out how a site serves data is the work nobody tooled. You do it by hand, every time, for every site. WordPress here, Algolia search there, embedded hydration JSON on the next one, form-gated bulk CSV on the one after. Open devtools. Watch the network tab. View source. Guess. Build the extraction. Discover the architecture changed last quarter. Repeat.
cartograph automates the discovery half. Claude reads the probe output and tells you what kind of site it is, where the data lives, and what to do next. The actual extraction is still your code (and that's the right division of labor: Claude is expensive for bulk extraction, cheap for one-shot classification).
The value isn't probing one URL. The value is doing this 20 or 200 times against sources that all behave differently, without re-doing the same detective work every time.
Quickstart
pip install cartograph-ai # Requires Python 3.11+
export ANTHROPIC_API_KEY=your-key
cartograph-ai https://sasaki.com/projects
sasaki.com/projects
└── Algolia search API (confidence: high)
All 90 projects accessible without a browser.
Estimated effort: half a day for a developer.
Recommended: GET against app AHNZ21XTZ6, index prod_projects.
Run with --json for machine output.
Same probe as a Python call:
from cartograph_ai import probe
result = probe("https://sasaki.com/projects")
print(result.classification) # "algolia_search_api"
print(result.confidence) # 0.94
print(result.extraction_strategy) # structured dict
Or the full JSON output (--json from CLI, result.model_dump() from library):
{
"url": "https://sasaki.com/projects",
"model": "claude-sonnet-4-6",
"classification": {
"category": "direct_api",
"subcategory": "algolia_search",
"confidence": 0.94
},
"extraction_strategy": {
"method": "algolia_search",
"requires_browser": false,
"estimated_requests": 2,
"recommended_tool": "requests",
"specifics": {"app_id": "AHNZ21XTZ6", "index": "prod_projects"}
}
}
About 15 seconds per probe. ~$0.015 in tokens at current Sonnet rates. You're done with the first hour of detective work that usually eats the front of every scraping project.
Three ways to use it
Ad hoc. One URL list, one-time research, no engineering team. Run the CLI, get an answer you can hand to a developer.
Ongoing. Continuous signal pipeline. Probe new sources as they appear. Re-validate existing sources on a schedule. Detect when an architecture changes.
Embedded. Another tool calls cartograph with a URL and gets back a classification. A dashboard accepting URL input. An agent encountering a new web source mid-task. The caller doesn't need to know anything about web scraping.
The pattern is the same across all three modes. The deeper you integrate it, the more value compounds.
Economics
cartograph's real competitor isn't compute cost. It's human time. Two ways the math lands.
Batch cost comparison (50 URLs from your typical source list):
| Approach | Time | Cost |
|---|---|---|
| cartograph (stages 1+2+4) | ~12 minutes | ~$0.75 in tokens |
| Headless browser per URL | ~15-25 minutes | ~$0 compute, ~$50-150 in developer time |
| Manual devtools inspection | ~25-75 hours | ~$2,500-7,500 in labor |
Token economics for downstream LLM consumers:
If you're piping into another LLM (agent, RAG, summarizer), the structured probe result is a few KB of clean JSON. Throwing raw HTML at the model instead burns 80-95% of the context window on DOM noise.
| Input to your LLM | Typical size | Input tokens (~) | Cost at Sonnet rates |
|---|---|---|---|
| Raw HTML (typical page) | 200-500 KB | 50K-125K | $0.15-0.38 |
| cartograph probe result | 2-5 KB | 1,500-2,500 | $0.005-0.008 |
Roughly 99% less spend on downstream input tokens. The probe already knows what to fetch and how, so the model doesn't have to figure it out from the raw DOM.
Real-world payloads run larger than initial back-of-envelope estimates; the cost math still favors cartograph by 10-1000x over alternatives. Numbers measured against a 15-URL benchmark set (2026-05-28, commit c1f8c15); claude-sonnet-4-6 pinned. Median probe cost: $0.015. Full results in bench/results.json.
Harder example: enterprise site
Big enterprise sites often look intimidating but reveal their architecture quickly. cartograph's job is to give you an honest fingerprint and point you at where the data actually lives.
cartograph-ai https://ford.com
ford.com
└── Adobe Experience Manager (confidence: high)
Heavily server-rendered HTML. Content available in the initial response.
Asset paths follow AEM conventions: /content/dam/ for the DAM, dedicated
assets origin at assets.ford.com with Adobe Dynamic Media renditions.
No client-side state blob or JSON API surface detected in the served HTML.
Multi-subdomain topology suggests product data lives elsewhere:
shop.ford.com (vehicle catalog and configurator)
owner.ford.com (account and ownership data)
fordpro.com (commercial fleet)
Recommended: parse the server-rendered HTML directly for content on ford.com.
Probe shop.ford.com separately for product data; the architecture there may
differ (configurator likely needs the browser extra).
No Chromium downloaded. No Playwright on disk. cartograph identified the platform, named the asset patterns, and pointed you at the right subdomain to probe next. When you eventually need a browser, you opt in.
What it isn't
cartograph tells you which scraper to use; it doesn't do the scraping. It's a CLI and Python library that runs on your machine with your API key, outputs JSON, and never phones home. The full anti-positioning argument (vs Firecrawl, Apify, manual investigation, and what cartograph deliberately doesn't try to do) is in /docs/why-this-exists.md.
How it works
Four stages, progressively escalating from cheap to expensive: HTTP probe, HTML analysis, optional JS execution via Playwright (browser extra), and Claude classification. Most sites stop at stage 2. The pinned model is claude-sonnet-4-6. Full architecture, the published prompt, the output schema, and named failure modes live in /docs/how-it-works.md.
Honest limits
Every probe returns a confidence score. When cartograph can't get a clean read, it says so. That matters more than the classifications it gets right, because a confidently wrong probe wastes real time downstream.
Phase 1 covers ~75% of public sites without a browser. Auth-walled sites, anti-bot defenses, and genuinely novel architectures get reported honestly as limitations. The --strict flag makes cartograph refuse to recommend a strategy when confidence drops below threshold. See how it works for the full failure-mode taxonomy.
Roadmap
Phase 1 (active development). Stages 1, 2, 4. No browser. Covers most public sites. CLI, library, JSON + rich terminal output, full prompt published.
Phase 2. Stage 3 via pip install cartograph-ai[browser]. Playwright as optional extra. Uses system Chrome or Edge if available, falls back to Chromium only as last resort. Plus: caching layer, source-change detection, diff output for ongoing-mode users.
Phase 3 (stretch, may not happen). Probes against authenticated sites: login flows, API keys, paid subscriptions. Per-source encrypted credential storage.
Out of scope: crawler, anti-bot bypass, persistent storage, hosted SaaS.
Contribute
Issues welcome. Failed probes especially welcome. They're the input loop that improves the tool. Pull requests for framework fingerprints, prompt improvements, or test URLs against new patterns are all real contribution paths. See /CONTRIBUTING.md.
More
- Why this exists: the pattern, the anti-positioning, how this project came together
- How it works: architecture, the Claude prompt, failure modes, the structured output schema
- Contributing: workshop principle, how to help
- Changelog: what changed, when
Built with Claude Code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cartograph_ai-0.1.0.tar.gz.
File metadata
- Download URL: cartograph_ai-0.1.0.tar.gz
- Upload date:
- Size: 68.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0d5e20df535e328b0d61936fa4046af4d9fca5b2024cc44655fcbd2c59bab24
|
|
| MD5 |
61948c290bb024306116fc0842d9d4de
|
|
| BLAKE2b-256 |
dcc10bcc74c3c712b5534905470d276369c0bc4cab67b376e06ebad39f0b775b
|
File details
Details for the file cartograph_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cartograph_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3c522a564d5fa52d1c92b0fe01c7f82a20936f9a0bb3795f932948e06ba33fa
|
|
| MD5 |
710c76143fe5eb28e27097be86d0bcf7
|
|
| BLAKE2b-256 |
284bac175fff78b63413d3b5c0d34e78f9a64577a52cb4b74cdf1e19c97cd0ef
|