Skip to main content

LLM-driven self-healing API discovery for undocumented SaaS portals via CDP

Project description

site-mapper-agents

LLM-once API discovery + self-healing extraction for any browser-accessible portal.

Burst-record CDP network traffic from a portal you have a browser session on, hand it to a three-agent team, get back a typed schema + signatures you can extract from forever — with auto-repair when the portal's API shape drifts.

PyPI Python License


The problem

Every SaaS portal has a different API. Writing extractors for each is a treadmill — and the schemas change without warning, so your extractors silently break.

Pre-built connectors only cover the top 20 platforms. For everything else (internal CRMs, niche-vertical tools, undocumented partner portals) you either pay someone to reverse-engineer the API, or you give up and scrape the DOM.

This library is the third option.

What this solves

  1. Onboarding: you point the system at a portal you have a real browser session on. It records a burst of CDP network traffic while you click around, then asks an LLM once to classify which endpoints carry the data you want and to map response JSON keys to your fields. Output is a typed SiteSchema and a list of NetworkSignature patterns.
  2. Extraction: from that point forward, every CDP event is matched against the saved signatures with pure Pydantic validation — sub-millisecond, no LLM calls, no cost.
  3. Self-healing: when the portal changes its response shape, an ExtractionFailed event fires. The Healer compares the old key map against the new response, fixes what it can deterministically, and asks the LLM to semantically match the rest. Confident patches auto-apply. Borderline patches surface for human review.

The three agents

                   ┌──────────────────────────────────────────────┐
                   │  Browser session → CDP forwarder → events    │
                   └──────────────────────┬───────────────────────┘
                                          │
   ┌──────────────────────┐               │              ┌───────────────────────┐
   │     Architect        │ ◀── once ──── │ ──── live ──▶│     Eavesdropper      │
   │  (LLM classifies     │               │              │  (Pydantic only,      │
   │   endpoints, builds  │               │              │   sub-ms hot path)    │
   │   SiteSchema +       │               │              │                       │
   │   signatures)        │               │              │   emits ExtractionResult
   └──────────────────────┘               │              │   or ExtractionFailed │
              │                           │              └───────────┬───────────┘
              ▼                           │                          │
   ╔══════════════════════╗               │              ┌───────────▼───────────┐
   ║   MappedSite +       ║◀──── heals ───┼──────────────│       Healer          │
   ║   NetworkSignatures  ║               │              │  (LLM re-maps stale   │
   ╚══════════════════════╝               │              │   keys, auto-applies  │
                                          │              │   confident patches)  │
                                          │              └───────────────────────┘
  • Architect — runs once. Expensive. Produces the schema.
  • Eavesdropper — runs on every event. Free. Pure validation.
  • Healer — runs only on failures. Costs nothing when nothing breaks.

Install

pip install site-mapper-agents

For the runnable examples you'll also want a pydantic-ai provider:

pip install 'pydantic-ai[anthropic]'   # or [openai], [ollama], ...

Quickstart

import asyncio
from pydantic_ai.models.test import TestModel

from site_mapper_agents import (
    Architect,
    CDPNetworkEvent,
    Eavesdropper,
    TargetField,
    UserIntent,
)

# 1. Tell the system what you want to extract.
intent = UserIntent(
    description="Customer account details",
    target_fields=[
        TargetField(name="account_id", description="Account UUID"),
        TargetField(name="email", description="Primary contact email"),
    ],
)

# 2. Construct the Architect. Replace TestModel with a real provider.
architect = Architect(model=TestModel())  # or AnthropicModel("claude-sonnet-4-5")

# 3. Feed it a burst of CDP traffic (your forwarder produced these).
architect.record_traffic(CDPNetworkEvent(
    request_id="r1",
    url="https://crm.example.com/api/v2/accounts/42",
    method="GET",
    body={"data": {"client": {"id": "acct_42", "email": "ada@example.com"}}},
))

# 4. Ask the Architect to propose a schema.
async def onboard():
    proposal = await architect.propose(
        target_url="https://crm.example.com/accounts",
        user_intent=intent,
    )
    site = architect.build_mapped_site(
        proposal=proposal,
        target_url="https://crm.example.com/accounts",
        user_intent=intent,
    )
    return site

site = asyncio.run(onboard())

# 5. From now on, every live CDP event runs through the Eavesdropper.
eaves = Eavesdropper()
result, event = eaves.ingest(
    CDPNetworkEvent(
        request_id="r2",
        url="https://crm.example.com/api/v2/accounts/99",
        method="GET",
        body={"data": {"client": {"id": "acct_99", "email": "g@example.com"}}},
    ),
    sites=[site],
)
print(result.data_payload if result else "no match")

API reference

Architect(model=None, vocabulary=None, policy=DEFAULT_ONBOARDING_POLICY, model_settings=None)

The onboarding agent. LLM-once.

Parameter Type Notes
model pydantic_ai.Model | None Any pydantic-ai model. None → heuristic.
vocabulary list[EndpointType] | None Caller-supplied classifications. See below.
policy OnboardingPolicy Sample-count thresholds.
model_settings ModelSettings | None max_tokens, temperature, etc.

Methods:

  • record_traffic(event) — buffer a CDP event during onboarding.
  • record_click() — mark that the user clicked something.
  • has_enough_samples()bool — policy check.
  • detect_endpoints()list[DetectedEndpoint] — deterministic pre-processing.
  • await propose(*, target_url, user_intent, llm_classify=None)ArchitectProposal — the main entry point.
  • build_mapped_site(*, proposal, target_url, user_intent)MappedSite — promote an approved proposal to an active site.
  • emit_event(site, *, success=True, reason="")SiteMapped | OnboardingFailed.
  • reset() — clear buffers for the next onboarding session.

Eavesdropper(policy=DEFAULT_EXTRACTION_POLICY)

The runtime agent. No LLM. Pure Pydantic validation.

Methods:

  • ingest(event, sites)(ExtractionResult | None, ExtractionSucceeded | ExtractionFailed | None).

Healer(model=None, policy=DEFAULT_HEALING_POLICY, model_settings=None)

The self-healing agent.

Methods:

  • await diagnose(*, site, failed_event, new_response_body=None, llm_semantic_match=None)HealerPatch.
  • apply_patch(site, patch)(bool, SchemaHealed | HealingFailed | SiteDegraded).

Models

Class Purpose
CDPNetworkEvent One captured network response. Library input.
TargetField One data point the caller wants extracted.
UserIntent A bundle of target fields with a human description.
EndpointType One entry in the Architect's classification vocabulary.
DetectedEndpoint Pre-LLM view of a unique endpoint.
NetworkSignature URL pattern + JSON-key map. Saved per site.
SiteSchema The extraction contract for one intent.
ArchitectProposal Architect's structured output before user confirms.
HealerPatch Healer's structured output for one repair attempt.
MappedSite Aggregate root — schemas + signatures + status.
ExtractionResult Eavesdropper's output for one matched event.

Domain events

SiteMapped, OnboardingFailed, ExtractionSucceeded, ExtractionFailed, SchemaHealed, HealingFailed, SiteDegraded.

All extend AutomationEvent (frozen Pydantic model).

Endpoint vocabularies

The Architect's LLM prompt embeds a list of EndpointType definitions that tell the model "you may only classify endpoints into one of these categories". The default vocabulary covers generic CRUD shapes:

name what it means
list_records Paginated list of records (grid/table views).
detail_view One record's full detail (after click-through).
search Filtered records based on user query.
create_record POST/PUT that creates a new record.
update_record PATCH/PUT that mutates an existing record.
delete_record DELETE.
reference_data Lookup / enum / config data.
metrics Dashboard counts/aggregates.
unknown Fallback when nothing fits.

You'll usually want to extend this with site-specific categories:

from site_mapper_agents import (
    Architect,
    default_vocabulary,
    define_endpoint_type,
    merge_vocabularies,
)

vocab = merge_vocabularies(
    default_vocabulary(),
    [
        define_endpoint_type(
            name="invoice_pdf_download",
            description="Streaming download of a generated invoice PDF",
            expected_fields=["invoice_id", "pdf_url"],
        ),
        define_endpoint_type(
            name="webhook_subscription",
            description="Webhook registration endpoint that returns the subscription id",
            expected_fields=["subscription_id", "target_url", "events"],
        ),
    ],
)

architect = Architect(model=my_model, vocabulary=vocab)

LLM providers

The library binds to any provider pydantic-ai supports — just pass a Model instance (or its name) to the agent constructor:

# Anthropic
from pydantic_ai.models.anthropic import AnthropicModel
architect = Architect(model=AnthropicModel("claude-sonnet-4-5"))

# OpenAI
from pydantic_ai.models.openai import OpenAIModel
architect = Architect(model=OpenAIModel("gpt-4o"))

# Ollama (or any OpenAI-compatible local server)
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
architect = Architect(model=OpenAIModel(
    "llama3.1:8b",
    provider=OpenAIProvider(base_url="http://localhost:11434/v1"),
))

# Deterministic stub for tests
from pydantic_ai.models.test import TestModel
architect = Architect(model=TestModel())

CDP burst format

CDPNetworkEvent is the only input shape the library cares about:

CDPNetworkEvent(
    request_id="<unique-id>",
    url="https://...",
    method="GET",
    status_code=200,
    headers={"content-type": "application/json"},
    body={"data": {"...": "..."}},   # parsed JSON
    frame_origin=None,                # set for iframe traffic
    target_id=None,                   # CDP target id, for multi-frame disambiguation
    timestamp=1715760000.0,
)

The library does not capture CDP traffic itself. Use a sibling tool — e.g. axumquant/cdp-network-interceptor — or your own Chrome extension / Puppeteer / Playwright session that emits this shape.

Self-healing flow

When does the Healer fire?

  1. The Eavesdropper validates an incoming event and detects missing fields against a registered signature.
  2. It emits ExtractionFailed and returns it from ingest().
  3. Your orchestrator passes the failed event (plus the raw response body) to Healer.diagnose().
  4. The Healer runs structural matching first (same key still exists? then we just need a path tweak). If everything resolves structurally, no LLM call happens.
  5. Otherwise the Healer calls its pydantic-ai Agent with the old key map + new available keys + unresolved field names.
  6. The returned HealerPatch has an aggregate confidence:
    • ≥ auto_approve_above (default 0.90) → apply_patch() succeeds, emits SchemaHealed, signature is replaced in-place.
    • [min_semantic_confidence, require_human_review_below) (default 0.70–0.75) → apply_patch() returns HealingFailed with reason requires human review. Surface this to the user.
    • < min_semantic_confidence → site is marked DEGRADED, retried up to max_attempts times, then marked BROKEN.
  7. Persistence is the caller's job — the library mutates the MappedSite aggregate in memory but doesn't write it anywhere.

Use cases

  • Salesforce custom-object extraction — Salesforce's API surface is huge and per-tenant. Onboard once against the tenant you have a session on, extract from then on.
  • HubSpot scraping — undocumented internal endpoints powering the UI.
  • Internal CRM discovery — your customer is on some no-name CRM you've never seen. Onboarding takes minutes.
  • Pre-acquisition portal audits — point it at a target's admin portal, get back a structured map of their data surface.
  • Partner integrations with companies who refuse to ship an API.

Pitfalls

  • The Architect costs money — it's an LLM call with a non-trivial prompt + context. Budget for one call per site you map. The Eavesdropper is free; the Healer only fires when something breaks.
  • Schema drift is real — sites change shapes monthly. Wire the Healer or you'll be debugging in production.
  • Auth-protected endpoints — the library never authenticates for you. You drive a real browser session; the CDP forwarder captures authenticated traffic. The library only sees the resulting bodies.
  • Rate limits — your scraping cadence is your problem. Polite pacing is on you.
  • Iframe traffic — the library handles frame_origin matching correctly, but your CDP forwarder MUST populate it. Without frame_origin, iframe responses match parent-frame signatures, which produces garbage extractions.
  • The vocabulary matters — generic CRUD works for most sites, but niche portals benefit a lot from a custom vocabulary that names the domain entities (e.g. invoice_line_items vs generic list_records).

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

site_mapper_agents-0.1.0.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

site_mapper_agents-0.1.0-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file site_mapper_agents-0.1.0.tar.gz.

File metadata

  • Download URL: site_mapper_agents-0.1.0.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for site_mapper_agents-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bcfd6c27b6eb050225868561088a224e081fbaabcb8786d232802db144b2d49d
MD5 a65ee5483dfea8418ede6922a100cd56
BLAKE2b-256 9e16ed5f295b6229bc577ddd1cf2e1c63064e81a2fef6e1ad87badc1e29459bf

See more details on using hashes here.

File details

Details for the file site_mapper_agents-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for site_mapper_agents-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 101e2bef8f4e563acc1fd57570c95adb6ca2a3061b76c4826fe06ecb98c4cd0e
MD5 b468983d29292aaf922995b8bd74345a
BLAKE2b-256 e33c57e69c5809a0592321743012d0168a76bca568b85f3b2f4ccf16c9e31e61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page