Run automated tests against Salesforce Agentforce agents (External + Internal Copilot) and score them into evidence — fully local, privacy-first, no API key required.
Project description
agentforce-probe
A local, privacy-first CLI to run automated tests against Salesforce Agentforce agents — and score the results into evidence.
TL;DR — Salesforce's Testing Center can score your customer-facing agents but silently can't touch your employee-facing ones.
agentforce-probetests both from one command and hands you a single evidence report. It runs entirely on your machine, sends nothing to third parties, and needs no API key to get started.
Do I need to set anything up? (the 30-second version)
| You're testing… | What you need | headless API / ECA? |
|---|---|---|
| ExternalCopilot (customer/service agents) | just an sf-authenticated org |
❌ none — Testing Center judges for you, zero secrets |
| InternalCopilot (employee agents) | the above + a one-time External Client App (consumer key/secret in .env) |
✅ yes — this is the headless path Testing Center can't do |
| Optional: grade with a live LLM judge | an OpenAI/Anthropic API key | the default judge is a no-key Claude Code handoff |
So: External agents work out of the box. The only real setup is a one-time ECA for the Internal path — and even then the judge needs no API key by default. Full steps are in Configure secrets.
agentforce-probe auto-detects the agent type and picks the right path:
- ExternalCopilot (customer/service agents) → drives
sf agent test create/run/results. Salesforce Testing Center provides the LLM judge (output_validation) for you — no extra setup. - InternalCopilot (employee agents) → this is the tool's core value.
Testing Center cannot run employee agents, so
agentforce-probewalks the headless path instead: External Client App → Client Credentials mint (JWT) → Agent API headless session → one message per utterance → a configurable LLM-as-judge scores each response.
Both paths emit one unified evidence markdown report (per case: utterance / topic / agent response / each assertion), using the same assertion-filtering rules.
Why this exists
Salesforce's built-in Testing Center (sf agent test) only runs
ExternalCopilot agents — the customer-facing ones that have a Bot User to
impersonate. InternalCopilot (employee/internal) agents have no run-as Bot
User, so the Testing Center judge never fires and you simply cannot get an
automated test score for them through the supported tooling.
That's a real product gap. agentforce-probe closes it: for Internal agents it
bypasses the Testing Center and drives the headless Agent API directly,
replaying each utterance through a real session and grading the responses with
an LLM-as-judge. One command, one evidence report, both agent types.
Privacy
Everything runs on your machine. The only outbound network calls are:
- to the target Salesforce org (
sfCLI + the Agent API), and - (InternalCopilot path only, and only if you opt into a live API-key judge) to the judge LLM you configure.
No telemetry, no third parties. Secrets (ECA consumer key/secret, judge API key)
are read from a gitignored .env (or env vars), held in memory only, and are
never printed, logged, written to evidence, or passed through a shell. Token
diagnostics only ever expose length + JWT segment count — never bytes.
Verification status & known limitations
Read this before you trust a score in anger. The tool is deliberately honest about what has and hasn't been validated.
What's verified
- Logic layer — fully tested. 207 unit tests, 100% line coverage across all
modules. Spec loading, assertion filtering, scoring, evidence rendering, the
judge contract, token-shape validation, and the Agent API error ladder are all
exercised — with the network and the
sfCLI mocked.
What's not yet verified (the real gap)
- 100% coverage is not the same as "proven against a live org." Every test
mocks the network and
sf. The genuine end-to-end paths —sf agent testagainst a real External agent, and ECA mint → JWT → live Agent API session against a real Internal agent — have not been re-run against a live Salesforce org in this open-source extraction. The InternalCopilot gotchas baked into the code (opaque-token 404, 412 config errors,bypassUserhandling) were learned from real-world use, but treat your first live run as the first true end-to-end validation and sanity-check the evidence by hand.
Known limitations
- Internal path needs a one-time manual UI step. The External Client App
must have
isNamedUserJwtEnabledon, or the mint returns an opaque token and the session endpoint 404s. The tool detects and reports this, but cannot fix it for you — see the ECA prerequisite below. This is the most common place to get stuck. - Agent-type detection relies on a live org query (
BotDefinition.Type). If your org's metadata shape differs, auto-detection can misfire; override with--force-type internal|external(and--bot-idfor the Internal path). - The
handoffjudge is an LLM, so verdicts are not perfectly reproducible. Two graders (or the same grader twice) may disagree on a borderline case. The score is a well-evidenced judgment, not a deterministic measurement — always read the captured agent responses, don't rubber-stamp. - Single-turn only. Each utterance runs in its own fresh session; the tool does not test multi-turn context or memory.
endSessionis best-effort and silently ignores failures, so an unreachable org could leave a dangling session server-side (low risk, no effect on the score).--from-resultsaccepts External-shaped payloads only (offline re-scoring ofsf agent test results); there's no offline replay for the Internal path.
Install
From PyPI:
pip install agentforce-probe
This installs the agentforce-probe console command. You can also run it as a
module:
agentforce-probe --help
python3 -m agentforce_probe --help
The only runtime dependency is pyyaml.
To install from source instead (e.g. to track main or hack on it):
git clone https://github.com/raykuonz/agentforce-probe
cd agentforce-probe
pip install -e .
Install the Claude skill (no CLI needed)
This repo ships a Claude Code
skill (probe-agentforce-agents) that teaches an agent when and how to drive
agentforce-probe. Install it into your agent in one command with
vercel-labs/skills — no clone, no
install, just npx:
# Preview the skill without installing
npx skills add raykuonz/agentforce-probe --list
# Install it globally into Claude Code
npx skills add raykuonz/agentforce-probe -g -a claude-code -y
It also works with Cursor, Codex, OpenCode, and
50+ other agents — drop
the -a claude-code flag to pick interactively. The skill assumes the
agentforce-probe CLI is installed (see above).
Configure secrets (.env)
Only the InternalCopilot path needs secrets. Copy the template into the
directory you run agentforce-probe from and fill it in (the file is
gitignored):
cp .env.example .env
# then edit .env:
# AGENTPROBE_SF_CONSUMER_KEY=... (Internal path: ECA consumer key)
# AGENTPROBE_SF_CONSUMER_SECRET=... (Internal path: ECA consumer secret)
# AGENTPROBE_ANTHROPIC_API_KEY=... (only if you use a live API-key judge)
# AGENTPROBE_OPENAI_API_KEY=... (only if you use a live API-key judge)
Environment variables take precedence over .env. The ExternalCopilot path
needs none of these (Testing Center judges for you). You can also point at a
specific file with AGENTPROBE_ENV_FILE=/path/to/.env.
Prerequisite for the Internal path — the External Client App
The InternalCopilot path needs an External Client App (ECA) configured for the Client Credentials flow. To get its consumer key/secret:
- Setup → App Manager (or External Client App Manager).
- Find your ECA → row dropdown → View / Manage Consumer Details (you may be asked to verify your identity).
- Copy the Consumer Key and Consumer Secret into
.env. - Confirm the ECA has Client Credentials enabled, a Run-As user
(
clientCredentialsFlowUser), andisNamedUserJwtEnabledON — otherwise the mint returns an opaque token instead of a JWT and the Agent API session endpoint 404s. (agentforce-probedetects this and tells you.)
That's the only step that requires the Salesforce UI. Everything else is CLI.
Usage
doctor — preflight (local + read-only)
agentforce-probe doctor --org my-org
Reports: is sf installed, does the org connect, are External Client Apps
present, are ECA secrets + judge keys configured, where is .env. Never spends
Einstein credits; secrets shown only as present/absent.
Run an ExternalCopilot agent (Testing Center)
agentforce-probe run \
--org my-org \
--agent Support_Concierge \
--spec examples/specs/Support_Concierge-testSpec.yaml \
--out support_concierge-evidence.md
Run an InternalCopilot agent (headless Agent API + judge)
agentforce-probe run \
--org my-org \
--agent IT_Helpdesk_Assistant \
--spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
--out it_helpdesk-evidence.md
# --judge handoff is the default (grade with Claude Code, no API key)
--judge selects the Internal-path judge:
handoff(default) — no API key needed. Grade with Claude Code via a file-handoff protocol. See Judge via Claude Code.openai:<model>/anthropic:<model>— grade live in one step using a raw LLM API key from.env.mock— offline heuristic (no network), for dry runs / smoke tests.
⚠️ Running a real test (
sf agent test runor a live Internal Agent API session) spends Einstein credits.doctor,--dry-run,--from-results, and--from-verdictsare all free / offline.
Judge via Claude Code (no API key needed)
The InternalCopilot path needs an LLM to grade each agent response PASS/FAIL.
If your team has Claude Code (or a similar coding agent) open in the editor
but no raw LLM API key, use the default handoff judge — a three-step file
protocol where Claude Code is the judge runtime and agentforce-probe just
defines the contract. No secret ever leaves your machine; the handoff files
contain only test data.
Step ① — produce the judge task package (replays the agent; contacts no LLM):
agentforce-probe run \
--org my-org --agent IT_Helpdesk_Assistant \
--spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
--out it_helpdesk-evidence.md # --judge handoff is the default
This mints the token, opens the headless Agent API session, sends every
utterance, captures response / topic / invokedActions, and writes two
files next to --out, then exits:
IT_Helpdesk_Assistant-judge-task.json— the grading materials (schema below).IT_Helpdesk_Assistant-JUDGING.md— a block you paste into Claude Code.
Step ② — grade in Claude Code. Open Claude Code in this repo and paste the
block from *-JUDGING.md. It instructs Claude Code to read the task package,
apply the rubric, and write *-judge-verdicts.json (verdict is strictly
PASS/FAIL, one entry per case id, no skips).
Step ③ — collect the verdicts into evidence (offline; no org/LLM call):
agentforce-probe run \
--org my-org --agent IT_Helpdesk_Assistant \
--spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
--from-verdicts IT_Helpdesk_Assistant-judge-verdicts.json \
--out it_helpdesk-evidence.md
agentforce-probe reads the verdicts back, aligns them to the task package by
id, recomputes topic/actions (from the recorded live values + the spec), uses
each verdict as the output signal, applies the same assertion-filtering
rules, and writes the unified evidence markdown. It validates that every case id
has a verdict (missing ids = error) and that each verdict is PASS/FAIL.
Schemas
<agent>-judge-task.json (agentforce-probe/judge-task@1):
{
"schema": "agentforce-probe/judge-task@1",
"agent": "IT_Helpdesk_Assistant", "org": "my-org",
"rubric": "<strict-QA-grader rubric>",
"instructions": "For each case, decide if actual_response satisfies expected_outcome. Write {id,verdict,reason} for every case into the verdicts file. Do not skip any case.",
"cases": [
{"id": 1, "utterance": "...", "expected_outcome": "...",
"actual_response": "...", "actual_topic": "...", "actual_actions": ["..."]}
]
}
<agent>-judge-verdicts.json (agentforce-probe/judge-verdicts@1):
{"schema": "agentforce-probe/judge-verdicts@1",
"agent": "IT_Helpdesk_Assistant",
"verdicts": [{"id": 1, "verdict": "PASS", "reason": "..."}]}
All handoff files (*-judge-task.json, *-judge-verdicts.json, *-JUDGING.md)
are run artifacts (test data) and are gitignored.
Test spec format (*.yaml)
name: "My Suite"
subjectType: AGENT
subjectName: IT_Helpdesk_Assistant
testCases:
- utterance: "..." # required
expectedTopic: account_help # optional (= subagent / topic name)
expectedActions: [foo, bar] # optional (Level-2 invocation names)
expectedOutcome: "..." # used by the judge; almost always present
See examples/specs/ for complete, runnable examples (using
fictional demo data).
Scoring rules (assertion filtering)
topic_assertionis scored only if the case declaresexpectedTopic.actions_assertionis scored only if the case declaresexpectedActions.output_validation(LLM-as-judge) is the primary behavioral signal and is scored for every case.- A dimension with no declared expectation renders as
-and never counts against the score.
A
topicFAIL with anoutputPASS usually means the agent behaved correctly even though single-turn routing picked a semantically adjacent topic — look at the primaryoutputsignal first.
Module layout
| file | responsibility |
|---|---|
cli.py |
argparse entrypoint; dispatches run / doctor |
config.py |
reads secrets from .env / env; never exposes values |
doctor.py |
local + read-only preflight checks |
agent_meta.py |
resolves BotDefinition.Type/Id (Internal vs External) |
sf_external.py |
ExternalCopilot path via sf agent test |
agent_api.py |
InternalCopilot mint + headless Agent API (urllib, token-safe) |
sf_internal.py |
Internal path orchestration (session → judge → score) |
judge.py |
configurable judge: handoff (default) + live openai/anthropic/mock |
scorer.py |
spec loading + assertion-filtering scorer |
evidence.py |
unified evidence markdown generator |
sfcli.py |
sf CLI wrapper + banner-tolerant JSON parsing |
InternalCopilot Agent API — gotchas baked in
These are battle-tested; the code enforces them so you don't re-learn them:
- Mint:
grant_type=client_credentials→{instance}/services/oauth2/token; readaccess_tokenandapi_instance_url. - Token must be a JWT (~1700 chars, 3 dot segments). An opaque token → 404 →
isNamedUserJwtEnabledis off. The tool refuses to proceed on an opaque token. - Host =
api_instance_urlfrom the mint response (sandbox/scratch =https://test.api.salesforce.com). Never hardcoded. - Session:
POST .../einstein/ai-agent/v1/agents/{0Xx...}/sessionswithbypassUser:false(true → 400 "Invalid user ID"). Run-as = the ECA'sclientCredentialsFlowUser; nouserIdin the body. - Message:
POST .../sessions/{id}/messageswith{"message":{"sequenceId":N,"type":"Text","text":"..."}}, N increments. - Error ladder: 404 empty = wrong host / opaque token; 400 "Invalid user ID"
= use
bypassUser:false; 412 "Invalid Config" = auth OK but planner config broken (usually an action missing itsinputsblock). - Bearer hygiene: the auth header is built at runtime from an in-memory
variable (never a source literal, never
echo'd) to dodge both shell-quoting and log-redaction traps.
Development
pip install -e ".[dev]"
pre-commit install # gate every commit/push on the same checks CI runs
pytest # run the test suite (pure logic; no network, no secrets)
ruff check . # lint
Pre-commit / pre-push gate
The repo ships a pre-commit config with local
hooks (no external hook repos, works offline). After pre-commit install:
- on every commit — a privacy/hygiene scan (
scripts/check-secrets.sh: no secrets, JWTs, org IDs, customer data, agent footprints, or run artifacts) plusruff checkandruff format --check. - on every push — the full
pytestsuite with the 100% coverage gate.
So a commit that would leak a secret, or a push that would break a test or drop coverage, is blocked locally before it ever reaches GitHub. CI re-runs the same checks, so green-local means green-pipeline.
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentforce_probe-0.1.1.tar.gz.
File metadata
- Download URL: agentforce_probe-0.1.1.tar.gz
- Upload date:
- Size: 66.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3b995acd202a2cc47350b3f9228634fe148068c5d45142754524bbf8951248a
|
|
| MD5 |
c1259cd8d19ed20073e07d207c17cacc
|
|
| BLAKE2b-256 |
0a09e90fbdbcf77d2ed4ed53b372003f868fa4fcbc9830258f92a6ca39a92d96
|
Provenance
The following attestation bundles were made for agentforce_probe-0.1.1.tar.gz:
Publisher:
publish.yml on raykuonz/agentforce-probe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentforce_probe-0.1.1.tar.gz -
Subject digest:
d3b995acd202a2cc47350b3f9228634fe148068c5d45142754524bbf8951248a - Sigstore transparency entry: 1704132717
- Sigstore integration time:
-
Permalink:
raykuonz/agentforce-probe@d10df67c85875734092289d5037ad83997a12575 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/raykuonz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d10df67c85875734092289d5037ad83997a12575 -
Trigger Event:
release
-
Statement type:
File details
Details for the file agentforce_probe-0.1.1-py3-none-any.whl.
File metadata
- Download URL: agentforce_probe-0.1.1-py3-none-any.whl
- Upload date:
- Size: 43.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a75ee4f64bd337cfc9dec4f3442828ef2e3f3f9c50f26cd4503fb53533ad6b3
|
|
| MD5 |
460c781383e64c067572b1686dbdcff2
|
|
| BLAKE2b-256 |
98358477c59be0400188140bb301021de76def70c2eefe062def77f883586b37
|
Provenance
The following attestation bundles were made for agentforce_probe-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on raykuonz/agentforce-probe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentforce_probe-0.1.1-py3-none-any.whl -
Subject digest:
5a75ee4f64bd337cfc9dec4f3442828ef2e3f3f9c50f26cd4503fb53533ad6b3 - Sigstore transparency entry: 1704132766
- Sigstore integration time:
-
Permalink:
raykuonz/agentforce-probe@d10df67c85875734092289d5037ad83997a12575 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/raykuonz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d10df67c85875734092289d5037ad83997a12575 -
Trigger Event:
release
-
Statement type: