Skip to main content

Run automated tests against Salesforce Agentforce agents (External + Internal Copilot) and score them into evidence — fully local, privacy-first, no API key required.

Project description

agentforce-probe

A local, privacy-first CLI to run automated tests against Salesforce Agentforce agents — and score the results into evidence.

TL;DR — Salesforce's Testing Center can score your customer-facing agents but silently can't touch your employee-facing ones. agentforce-probe tests both from one command and hands you a single evidence report. It runs entirely on your machine, sends nothing to third parties, and needs no API key to get started.

Do I need to set anything up? (the 30-second version)

You're testing… What you need headless API / ECA?
ExternalCopilot (customer/service agents) just an sf-authenticated org ❌ none — Testing Center judges for you, zero secrets
InternalCopilot (employee agents) the above + a one-time External Client App (consumer key/secret in .env) ✅ yes — this is the headless path Testing Center can't do
Optional: grade with a live LLM judge an OpenAI/Anthropic API key the default judge is a no-key Claude Code handoff

So: External agents work out of the box. The only real setup is a one-time ECA for the Internal path — and even then the judge needs no API key by default. Full steps are in Configure secrets.

agentforce-probe auto-detects the agent type and picks the right path:

  • ExternalCopilot (customer/service agents) → drives sf agent test create/run/results. Salesforce Testing Center provides the LLM judge (output_validation) for you — no extra setup.
  • InternalCopilot (employee agents) → this is the tool's core value. Testing Center cannot run employee agents, so agentforce-probe walks the headless path instead: External Client App → Client Credentials mint (JWT) → Agent API headless session → one message per utterance → a configurable LLM-as-judge scores each response.

Both paths emit one unified evidence markdown report (per case: utterance / topic / agent response / each assertion), using the same assertion-filtering rules.

Why this exists

Salesforce's built-in Testing Center (sf agent test) only runs ExternalCopilot agents — the customer-facing ones that have a Bot User to impersonate. InternalCopilot (employee/internal) agents have no run-as Bot User, so the Testing Center judge never fires and you simply cannot get an automated test score for them through the supported tooling.

That's a real product gap. agentforce-probe closes it: for Internal agents it bypasses the Testing Center and drives the headless Agent API directly, replaying each utterance through a real session and grading the responses with an LLM-as-judge. One command, one evidence report, both agent types.

Privacy

Everything runs on your machine. The only outbound network calls are:

  1. to the target Salesforce org (sf CLI + the Agent API), and
  2. (InternalCopilot path only, and only if you opt into a live API-key judge) to the judge LLM you configure.

No telemetry, no third parties. Secrets (ECA consumer key/secret, judge API key) are read from a gitignored .env (or env vars), held in memory only, and are never printed, logged, written to evidence, or passed through a shell. Token diagnostics only ever expose length + JWT segment count — never bytes.

Verification status & known limitations

Read this before you trust a score in anger. The tool is deliberately honest about what has and hasn't been validated.

What's verified

  • Logic layer — fully tested. 207 unit tests, 100% line coverage across all modules. Spec loading, assertion filtering, scoring, evidence rendering, the judge contract, token-shape validation, and the Agent API error ladder are all exercised — with the network and the sf CLI mocked.

What's not yet verified (the real gap)

  • 100% coverage is not the same as "proven against a live org." Every test mocks the network and sf. The genuine end-to-end paths — sf agent test against a real External agent, and ECA mint → JWT → live Agent API session against a real Internal agent — have not been re-run against a live Salesforce org in this open-source extraction. The InternalCopilot gotchas baked into the code (opaque-token 404, 412 config errors, bypassUser handling) were learned from real-world use, but treat your first live run as the first true end-to-end validation and sanity-check the evidence by hand.

Known limitations

  • Internal path needs a one-time manual UI step. The External Client App must have isNamedUserJwtEnabled on, or the mint returns an opaque token and the session endpoint 404s. The tool detects and reports this, but cannot fix it for you — see the ECA prerequisite below. This is the most common place to get stuck.
  • Agent-type detection relies on a live org query (BotDefinition.Type). If your org's metadata shape differs, auto-detection can misfire; override with --force-type internal|external (and --bot-id for the Internal path).
  • The handoff judge is an LLM, so verdicts are not perfectly reproducible. Two graders (or the same grader twice) may disagree on a borderline case. The score is a well-evidenced judgment, not a deterministic measurement — always read the captured agent responses, don't rubber-stamp.
  • Single-turn only. Each utterance runs in its own fresh session; the tool does not test multi-turn context or memory.
  • endSession is best-effort and silently ignores failures, so an unreachable org could leave a dangling session server-side (low risk, no effect on the score).
  • --from-results accepts External-shaped payloads only (offline re-scoring of sf agent test results); there's no offline replay for the Internal path.

Install

From PyPI:

pip install agentforce-probe

This installs the agentforce-probe console command. You can also run it as a module:

agentforce-probe --help
python3 -m agentforce_probe --help

The only runtime dependency is pyyaml.

To install from source instead (e.g. to track main or hack on it):

git clone https://github.com/raykuonz/agentforce-probe
cd agentforce-probe
pip install -e .

Install the Claude skill (no CLI needed)

This repo ships a Claude Code skill (probe-agentforce-agents) that teaches an agent when and how to drive agentforce-probe. Install it into your agent in one command with vercel-labs/skills — no clone, no install, just npx:

# Preview the skill without installing
npx skills add raykuonz/agentforce-probe --list

# Install it globally into Claude Code
npx skills add raykuonz/agentforce-probe -g -a claude-code -y

It also works with Cursor, Codex, OpenCode, and 50+ other agents — drop the -a claude-code flag to pick interactively. The skill assumes the agentforce-probe CLI is installed (see above).

Configure secrets (.env)

Only the InternalCopilot path needs secrets. Copy the template into the directory you run agentforce-probe from and fill it in (the file is gitignored):

cp .env.example .env
# then edit .env:
#   AGENTPROBE_SF_CONSUMER_KEY=...      (Internal path: ECA consumer key)
#   AGENTPROBE_SF_CONSUMER_SECRET=...   (Internal path: ECA consumer secret)
#   AGENTPROBE_ANTHROPIC_API_KEY=...    (only if you use a live API-key judge)
#   AGENTPROBE_OPENAI_API_KEY=...       (only if you use a live API-key judge)

Environment variables take precedence over .env. The ExternalCopilot path needs none of these (Testing Center judges for you). You can also point at a specific file with AGENTPROBE_ENV_FILE=/path/to/.env.

Prerequisite for the Internal path — the External Client App

The InternalCopilot path needs an External Client App (ECA) configured for the Client Credentials flow. To get its consumer key/secret:

  1. Setup → App Manager (or External Client App Manager).
  2. Find your ECA → row dropdown → View / Manage Consumer Details (you may be asked to verify your identity).
  3. Copy the Consumer Key and Consumer Secret into .env.
  4. Confirm the ECA has Client Credentials enabled, a Run-As user (clientCredentialsFlowUser), and isNamedUserJwtEnabled ON — otherwise the mint returns an opaque token instead of a JWT and the Agent API session endpoint 404s. (agentforce-probe detects this and tells you.)

That's the only step that requires the Salesforce UI. Everything else is CLI.

Usage

doctor — preflight (local + read-only)

agentforce-probe doctor --org my-org

Reports: is sf installed, does the org connect, are External Client Apps present, are ECA secrets + judge keys configured, where is .env. Never spends Einstein credits; secrets shown only as present/absent.

Run an ExternalCopilot agent (Testing Center)

agentforce-probe run \
  --org my-org \
  --agent Support_Concierge \
  --spec examples/specs/Support_Concierge-testSpec.yaml \
  --out support_concierge-evidence.md

Run an InternalCopilot agent (headless Agent API + judge)

agentforce-probe run \
  --org my-org \
  --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --out it_helpdesk-evidence.md
  # --judge handoff is the default (grade with Claude Code, no API key)

--judge selects the Internal-path judge:

  • handoff (default)no API key needed. Grade with Claude Code via a file-handoff protocol. See Judge via Claude Code.
  • openai:<model> / anthropic:<model> — grade live in one step using a raw LLM API key from .env.
  • mock — offline heuristic (no network), for dry runs / smoke tests.

⚠️ Running a real test (sf agent test run or a live Internal Agent API session) spends Einstein credits. doctor, --dry-run, --from-results, and --from-verdicts are all free / offline.

Judge via Claude Code (no API key needed)

The InternalCopilot path needs an LLM to grade each agent response PASS/FAIL. If your team has Claude Code (or a similar coding agent) open in the editor but no raw LLM API key, use the default handoff judge — a three-step file protocol where Claude Code is the judge runtime and agentforce-probe just defines the contract. No secret ever leaves your machine; the handoff files contain only test data.

Step ① — produce the judge task package (replays the agent; contacts no LLM):

agentforce-probe run \
  --org my-org --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --out it_helpdesk-evidence.md            # --judge handoff is the default

This mints the token, opens the headless Agent API session, sends every utterance, captures response / topic / invokedActions, and writes two files next to --out, then exits:

  • IT_Helpdesk_Assistant-judge-task.json — the grading materials (schema below).
  • IT_Helpdesk_Assistant-JUDGING.md — a block you paste into Claude Code.

Step ② — grade in Claude Code. Open Claude Code in this repo and paste the block from *-JUDGING.md. It instructs Claude Code to read the task package, apply the rubric, and write *-judge-verdicts.json (verdict is strictly PASS/FAIL, one entry per case id, no skips).

Step ③ — collect the verdicts into evidence (offline; no org/LLM call):

agentforce-probe run \
  --org my-org --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --from-verdicts IT_Helpdesk_Assistant-judge-verdicts.json \
  --out it_helpdesk-evidence.md

agentforce-probe reads the verdicts back, aligns them to the task package by id, recomputes topic/actions (from the recorded live values + the spec), uses each verdict as the output signal, applies the same assertion-filtering rules, and writes the unified evidence markdown. It validates that every case id has a verdict (missing ids = error) and that each verdict is PASS/FAIL.

Schemas

<agent>-judge-task.json (agentforce-probe/judge-task@1):

{
  "schema": "agentforce-probe/judge-task@1",
  "agent": "IT_Helpdesk_Assistant", "org": "my-org",
  "rubric": "<strict-QA-grader rubric>",
  "instructions": "For each case, decide if actual_response satisfies expected_outcome. Write {id,verdict,reason} for every case into the verdicts file. Do not skip any case.",
  "cases": [
    {"id": 1, "utterance": "...", "expected_outcome": "...",
     "actual_response": "...", "actual_topic": "...", "actual_actions": ["..."]}
  ]
}

<agent>-judge-verdicts.json (agentforce-probe/judge-verdicts@1):

{"schema": "agentforce-probe/judge-verdicts@1",
 "agent": "IT_Helpdesk_Assistant",
 "verdicts": [{"id": 1, "verdict": "PASS", "reason": "..."}]}

All handoff files (*-judge-task.json, *-judge-verdicts.json, *-JUDGING.md) are run artifacts (test data) and are gitignored.

Test spec format (*.yaml)

name: "My Suite"
subjectType: AGENT
subjectName: IT_Helpdesk_Assistant
testCases:
  - utterance: "..."             # required
    expectedTopic: account_help  # optional (= subagent / topic name)
    expectedActions: [foo, bar]  # optional (Level-2 invocation names)
    expectedOutcome: "..."       # used by the judge; almost always present

See examples/specs/ for complete, runnable examples (using fictional demo data).

Scoring rules (assertion filtering)

  • topic_assertion is scored only if the case declares expectedTopic.
  • actions_assertion is scored only if the case declares expectedActions.
  • output_validation (LLM-as-judge) is the primary behavioral signal and is scored for every case.
  • A dimension with no declared expectation renders as - and never counts against the score.

A topic FAIL with an output PASS usually means the agent behaved correctly even though single-turn routing picked a semantically adjacent topic — look at the primary output signal first.

Module layout

file responsibility
cli.py argparse entrypoint; dispatches run / doctor
config.py reads secrets from .env / env; never exposes values
doctor.py local + read-only preflight checks
agent_meta.py resolves BotDefinition.Type/Id (Internal vs External)
sf_external.py ExternalCopilot path via sf agent test
agent_api.py InternalCopilot mint + headless Agent API (urllib, token-safe)
sf_internal.py Internal path orchestration (session → judge → score)
judge.py configurable judge: handoff (default) + live openai/anthropic/mock
scorer.py spec loading + assertion-filtering scorer
evidence.py unified evidence markdown generator
sfcli.py sf CLI wrapper + banner-tolerant JSON parsing

InternalCopilot Agent API — gotchas baked in

These are battle-tested; the code enforces them so you don't re-learn them:

  1. Mint: grant_type=client_credentials{instance}/services/oauth2/token; read access_token and api_instance_url.
  2. Token must be a JWT (~1700 chars, 3 dot segments). An opaque token → 404 → isNamedUserJwtEnabled is off. The tool refuses to proceed on an opaque token.
  3. Host = api_instance_url from the mint response (sandbox/scratch = https://test.api.salesforce.com). Never hardcoded.
  4. Session: POST .../einstein/ai-agent/v1/agents/{0Xx...}/sessions with bypassUser:false (true → 400 "Invalid user ID"). Run-as = the ECA's clientCredentialsFlowUser; no userId in the body.
  5. Message: POST .../sessions/{id}/messages with {"message":{"sequenceId":N,"type":"Text","text":"..."}}, N increments.
  6. Error ladder: 404 empty = wrong host / opaque token; 400 "Invalid user ID" = use bypassUser:false; 412 "Invalid Config" = auth OK but planner config broken (usually an action missing its inputs block).
  7. Bearer hygiene: the auth header is built at runtime from an in-memory variable (never a source literal, never echo'd) to dodge both shell-quoting and log-redaction traps.

Development

pip install -e ".[dev]"
pre-commit install   # gate every commit/push on the same checks CI runs
pytest               # run the test suite (pure logic; no network, no secrets)
ruff check .         # lint

Pre-commit / pre-push gate

The repo ships a pre-commit config with local hooks (no external hook repos, works offline). After pre-commit install:

  • on every commit — a privacy/hygiene scan (scripts/check-secrets.sh: no secrets, JWTs, org IDs, customer data, agent footprints, or run artifacts) plus ruff check and ruff format --check.
  • on every push — the full pytest suite with the 100% coverage gate.

So a commit that would leak a secret, or a push that would break a test or drop coverage, is blocked locally before it ever reaches GitHub. CI re-runs the same checks, so green-local means green-pipeline.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentforce_probe-0.1.1.tar.gz (66.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentforce_probe-0.1.1-py3-none-any.whl (43.9 kB view details)

Uploaded Python 3

File details

Details for the file agentforce_probe-0.1.1.tar.gz.

File metadata

  • Download URL: agentforce_probe-0.1.1.tar.gz
  • Upload date:
  • Size: 66.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentforce_probe-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d3b995acd202a2cc47350b3f9228634fe148068c5d45142754524bbf8951248a
MD5 c1259cd8d19ed20073e07d207c17cacc
BLAKE2b-256 0a09e90fbdbcf77d2ed4ed53b372003f868fa4fcbc9830258f92a6ca39a92d96

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentforce_probe-0.1.1.tar.gz:

Publisher: publish.yml on raykuonz/agentforce-probe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentforce_probe-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for agentforce_probe-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5a75ee4f64bd337cfc9dec4f3442828ef2e3f3f9c50f26cd4503fb53533ad6b3
MD5 460c781383e64c067572b1686dbdcff2
BLAKE2b-256 98358477c59be0400188140bb301021de76def70c2eefe062def77f883586b37

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentforce_probe-0.1.1-py3-none-any.whl:

Publisher: publish.yml on raykuonz/agentforce-probe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page