Skip to main content

Run automated tests against Salesforce Agentforce agents (External + Internal Copilot) and score them into evidence — fully local, privacy-first, no API key required.

Project description

agentforce-probe

A local, privacy-first CLI to run automated tests against Salesforce Agentforce agents — and score the results into evidence.

agentforce-probe auto-detects the agent type and picks the right path:

  • ExternalCopilot (customer/service agents) → drives sf agent test create/run/results. Salesforce Testing Center provides the LLM judge (output_validation) for you — no extra setup.
  • InternalCopilot (employee agents) → this is the tool's core value. Testing Center cannot run employee agents, so agentforce-probe walks the headless path instead: External Client App → Client Credentials mint (JWT) → Agent API headless session → one message per utterance → a configurable LLM-as-judge scores each response.

Both paths emit one unified evidence markdown report (per case: utterance / topic / agent response / each assertion), using the same assertion-filtering rules.

Why this exists

Salesforce's built-in Testing Center (sf agent test) only runs ExternalCopilot agents — the customer-facing ones that have a Bot User to impersonate. InternalCopilot (employee/internal) agents have no run-as Bot User, so the Testing Center judge never fires and you simply cannot get an automated test score for them through the supported tooling.

That's a real product gap. agentforce-probe closes it: for Internal agents it bypasses the Testing Center and drives the headless Agent API directly, replaying each utterance through a real session and grading the responses with an LLM-as-judge. One command, one evidence report, both agent types.

Privacy

Everything runs on your machine. The only outbound network calls are:

  1. to the target Salesforce org (sf CLI + the Agent API), and
  2. (InternalCopilot path only, and only if you opt into a live API-key judge) to the judge LLM you configure.

No telemetry, no third parties. Secrets (ECA consumer key/secret, judge API key) are read from a gitignored .env (or env vars), held in memory only, and are never printed, logged, written to evidence, or passed through a shell. Token diagnostics only ever expose length + JWT segment count — never bytes.

Verification status & known limitations

Read this before you trust a score in anger. The tool is deliberately honest about what has and hasn't been validated.

What's verified

  • Logic layer — fully tested. 207 unit tests, 100% line coverage across all modules. Spec loading, assertion filtering, scoring, evidence rendering, the judge contract, token-shape validation, and the Agent API error ladder are all exercised — with the network and the sf CLI mocked.

What's not yet verified (the real gap)

  • 100% coverage is not the same as "proven against a live org." Every test mocks the network and sf. The genuine end-to-end paths — sf agent test against a real External agent, and ECA mint → JWT → live Agent API session against a real Internal agent — have not been re-run against a live Salesforce org in this open-source extraction. The InternalCopilot gotchas baked into the code (opaque-token 404, 412 config errors, bypassUser handling) were learned from real-world use, but treat your first live run as the first true end-to-end validation and sanity-check the evidence by hand.

Known limitations

  • Internal path needs a one-time manual UI step. The External Client App must have isNamedUserJwtEnabled on, or the mint returns an opaque token and the session endpoint 404s. The tool detects and reports this, but cannot fix it for you — see the ECA prerequisite below. This is the most common place to get stuck.
  • Agent-type detection relies on a live org query (BotDefinition.Type). If your org's metadata shape differs, auto-detection can misfire; override with --force-type internal|external (and --bot-id for the Internal path).
  • The handoff judge is an LLM, so verdicts are not perfectly reproducible. Two graders (or the same grader twice) may disagree on a borderline case. The score is a well-evidenced judgment, not a deterministic measurement — always read the captured agent responses, don't rubber-stamp.
  • Single-turn only. Each utterance runs in its own fresh session; the tool does not test multi-turn context or memory.
  • endSession is best-effort and silently ignores failures, so an unreachable org could leave a dangling session server-side (low risk, no effect on the score).
  • --from-results accepts External-shaped payloads only (offline re-scoring of sf agent test results); there's no offline replay for the Internal path.
  • No PyPI package yet — install from source (pip install -e .).

Install

From source (until published to PyPI):

git clone https://github.com/raykuonz/agentforce-probe
cd agentforce-probe
pip install -e .

This installs the agentforce-probe console command. You can also run it as a module:

agentforce-probe --help
python3 -m agentforce_probe --help

The only runtime dependency is pyyaml.

Configure secrets (.env)

Only the InternalCopilot path needs secrets. Copy the template into the directory you run agentforce-probe from and fill it in (the file is gitignored):

cp .env.example .env
# then edit .env:
#   AGENTPROBE_SF_CONSUMER_KEY=...      (Internal path: ECA consumer key)
#   AGENTPROBE_SF_CONSUMER_SECRET=...   (Internal path: ECA consumer secret)
#   AGENTPROBE_ANTHROPIC_API_KEY=...    (only if you use a live API-key judge)
#   AGENTPROBE_OPENAI_API_KEY=...       (only if you use a live API-key judge)

Environment variables take precedence over .env. The ExternalCopilot path needs none of these (Testing Center judges for you). You can also point at a specific file with AGENTPROBE_ENV_FILE=/path/to/.env.

Prerequisite for the Internal path — the External Client App

The InternalCopilot path needs an External Client App (ECA) configured for the Client Credentials flow. To get its consumer key/secret:

  1. Setup → App Manager (or External Client App Manager).
  2. Find your ECA → row dropdown → View / Manage Consumer Details (you may be asked to verify your identity).
  3. Copy the Consumer Key and Consumer Secret into .env.
  4. Confirm the ECA has Client Credentials enabled, a Run-As user (clientCredentialsFlowUser), and isNamedUserJwtEnabled ON — otherwise the mint returns an opaque token instead of a JWT and the Agent API session endpoint 404s. (agentforce-probe detects this and tells you.)

That's the only step that requires the Salesforce UI. Everything else is CLI.

Usage

doctor — preflight (local + read-only)

agentforce-probe doctor --org my-org

Reports: is sf installed, does the org connect, are External Client Apps present, are ECA secrets + judge keys configured, where is .env. Never spends Einstein credits; secrets shown only as present/absent.

Run an ExternalCopilot agent (Testing Center)

agentforce-probe run \
  --org my-org \
  --agent Support_Concierge \
  --spec examples/specs/Support_Concierge-testSpec.yaml \
  --out support_concierge-evidence.md

Run an InternalCopilot agent (headless Agent API + judge)

agentforce-probe run \
  --org my-org \
  --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --out it_helpdesk-evidence.md
  # --judge handoff is the default (grade with Claude Code, no API key)

--judge selects the Internal-path judge:

  • handoff (default)no API key needed. Grade with Claude Code via a file-handoff protocol. See Judge via Claude Code.
  • openai:<model> / anthropic:<model> — grade live in one step using a raw LLM API key from .env.
  • mock — offline heuristic (no network), for dry runs / smoke tests.

⚠️ Running a real test (sf agent test run or a live Internal Agent API session) spends Einstein credits. doctor, --dry-run, --from-results, and --from-verdicts are all free / offline.

Judge via Claude Code (no API key needed)

The InternalCopilot path needs an LLM to grade each agent response PASS/FAIL. If your team has Claude Code (or a similar coding agent) open in the editor but no raw LLM API key, use the default handoff judge — a three-step file protocol where Claude Code is the judge runtime and agentforce-probe just defines the contract. No secret ever leaves your machine; the handoff files contain only test data.

Step ① — produce the judge task package (replays the agent; contacts no LLM):

agentforce-probe run \
  --org my-org --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --out it_helpdesk-evidence.md            # --judge handoff is the default

This mints the token, opens the headless Agent API session, sends every utterance, captures response / topic / invokedActions, and writes two files next to --out, then exits:

  • IT_Helpdesk_Assistant-judge-task.json — the grading materials (schema below).
  • IT_Helpdesk_Assistant-JUDGING.md — a block you paste into Claude Code.

Step ② — grade in Claude Code. Open Claude Code in this repo and paste the block from *-JUDGING.md. It instructs Claude Code to read the task package, apply the rubric, and write *-judge-verdicts.json (verdict is strictly PASS/FAIL, one entry per case id, no skips).

Step ③ — collect the verdicts into evidence (offline; no org/LLM call):

agentforce-probe run \
  --org my-org --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --from-verdicts IT_Helpdesk_Assistant-judge-verdicts.json \
  --out it_helpdesk-evidence.md

agentforce-probe reads the verdicts back, aligns them to the task package by id, recomputes topic/actions (from the recorded live values + the spec), uses each verdict as the output signal, applies the same assertion-filtering rules, and writes the unified evidence markdown. It validates that every case id has a verdict (missing ids = error) and that each verdict is PASS/FAIL.

Schemas

<agent>-judge-task.json (agentforce-probe/judge-task@1):

{
  "schema": "agentforce-probe/judge-task@1",
  "agent": "IT_Helpdesk_Assistant", "org": "my-org",
  "rubric": "<strict-QA-grader rubric>",
  "instructions": "For each case, decide if actual_response satisfies expected_outcome. Write {id,verdict,reason} for every case into the verdicts file. Do not skip any case.",
  "cases": [
    {"id": 1, "utterance": "...", "expected_outcome": "...",
     "actual_response": "...", "actual_topic": "...", "actual_actions": ["..."]}
  ]
}

<agent>-judge-verdicts.json (agentforce-probe/judge-verdicts@1):

{"schema": "agentforce-probe/judge-verdicts@1",
 "agent": "IT_Helpdesk_Assistant",
 "verdicts": [{"id": 1, "verdict": "PASS", "reason": "..."}]}

All handoff files (*-judge-task.json, *-judge-verdicts.json, *-JUDGING.md) are run artifacts (test data) and are gitignored.

Test spec format (*.yaml)

name: "My Suite"
subjectType: AGENT
subjectName: IT_Helpdesk_Assistant
testCases:
  - utterance: "..."             # required
    expectedTopic: account_help  # optional (= subagent / topic name)
    expectedActions: [foo, bar]  # optional (Level-2 invocation names)
    expectedOutcome: "..."       # used by the judge; almost always present

See examples/specs/ for complete, runnable examples (using fictional demo data).

Scoring rules (assertion filtering)

  • topic_assertion is scored only if the case declares expectedTopic.
  • actions_assertion is scored only if the case declares expectedActions.
  • output_validation (LLM-as-judge) is the primary behavioral signal and is scored for every case.
  • A dimension with no declared expectation renders as - and never counts against the score.

A topic FAIL with an output PASS usually means the agent behaved correctly even though single-turn routing picked a semantically adjacent topic — look at the primary output signal first.

Module layout

file responsibility
cli.py argparse entrypoint; dispatches run / doctor
config.py reads secrets from .env / env; never exposes values
doctor.py local + read-only preflight checks
agent_meta.py resolves BotDefinition.Type/Id (Internal vs External)
sf_external.py ExternalCopilot path via sf agent test
agent_api.py InternalCopilot mint + headless Agent API (urllib, token-safe)
sf_internal.py Internal path orchestration (session → judge → score)
judge.py configurable judge: handoff (default) + live openai/anthropic/mock
scorer.py spec loading + assertion-filtering scorer
evidence.py unified evidence markdown generator
sfcli.py sf CLI wrapper + banner-tolerant JSON parsing

InternalCopilot Agent API — gotchas baked in

These are battle-tested; the code enforces them so you don't re-learn them:

  1. Mint: grant_type=client_credentials{instance}/services/oauth2/token; read access_token and api_instance_url.
  2. Token must be a JWT (~1700 chars, 3 dot segments). An opaque token → 404 → isNamedUserJwtEnabled is off. The tool refuses to proceed on an opaque token.
  3. Host = api_instance_url from the mint response (sandbox/scratch = https://test.api.salesforce.com). Never hardcoded.
  4. Session: POST .../einstein/ai-agent/v1/agents/{0Xx...}/sessions with bypassUser:false (true → 400 "Invalid user ID"). Run-as = the ECA's clientCredentialsFlowUser; no userId in the body.
  5. Message: POST .../sessions/{id}/messages with {"message":{"sequenceId":N,"type":"Text","text":"..."}}, N increments.
  6. Error ladder: 404 empty = wrong host / opaque token; 400 "Invalid user ID" = use bypassUser:false; 412 "Invalid Config" = auth OK but planner config broken (usually an action missing its inputs block).
  7. Bearer hygiene: the auth header is built at runtime from an in-memory variable (never a source literal, never echo'd) to dodge both shell-quoting and log-redaction traps.

Development

pip install -e ".[dev]"
pre-commit install   # gate every commit/push on the same checks CI runs
pytest               # run the test suite (pure logic; no network, no secrets)
ruff check .         # lint

Pre-commit / pre-push gate

The repo ships a pre-commit config with local hooks (no external hook repos, works offline). After pre-commit install:

  • on every commit — a privacy/hygiene scan (scripts/check-secrets.sh: no secrets, JWTs, org IDs, customer data, agent footprints, or run artifacts) plus ruff check and ruff format --check.
  • on every push — the full pytest suite with the 100% coverage gate.

So a commit that would leak a secret, or a push that would break a test or drop coverage, is blocked locally before it ever reaches GitHub. CI re-runs the same checks, so green-local means green-pipeline.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentforce_probe-0.1.0.tar.gz (63.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentforce_probe-0.1.0-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file agentforce_probe-0.1.0.tar.gz.

File metadata

  • Download URL: agentforce_probe-0.1.0.tar.gz
  • Upload date:
  • Size: 63.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentforce_probe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ab9e48f0e0851b8fd52e9b2aa5f819bfcdca521dd51903e22be56beefdc8e503
MD5 18a3adcd0fc4f6dcbed4e1b3d1d96a66
BLAKE2b-256 b9d2ac257d2061b0b82d61d1384eb01d7091631f74098d7b58f8e0b461f1f893

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentforce_probe-0.1.0.tar.gz:

Publisher: publish.yml on raykuonz/agentforce-probe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentforce_probe-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agentforce_probe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f177e80d2593e417cd5a026d2e00c6c473b6b3d9e2de4642e165128e26fb0504
MD5 139f7028c08829e1435c325af6e579c2
BLAKE2b-256 91aea870175c4841d395260dd595ef21b3b62c54f8e14a2b071ab1291152a4eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentforce_probe-0.1.0-py3-none-any.whl:

Publisher: publish.yml on raykuonz/agentforce-probe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page