Declarative Linux PMU observation on top of perf stat
Project description
perf-skill
perf-skill is a Linux CLI that turns a short declarative statement into a
ready-to-run perf stat session. It resolves the target process, enables sane
defaults for PMU collection, and streams a small terminal dashboard with IPC and
recent history charts.
What it does
- Parses statements such as
trace comm=python pid=4242 inst cycles - Resolves the target process from
pid,comm, or both - Expands event aliases such as
inst -> instructions - Always injects
instructionsandcyclesso IPC can be derived alongside any extra events - Auto-completes missing paired counters such as
branches + branch-missesandcache-references + cache-misses - Auto-groups related events into perf groups so IPC, branch, and cache counters stay aligned
- Auto-splits groups against a PMU slot limit, with local hardware hints and vendor fallbacks
- Automatically retries with smaller groups when perf reports retryable grouped-event failures
- Starts
perf statwith interval sampling and parses the live CSV output - Can switch to
perf recordand write a renamed.dataartifact when requested - Can parse an existing
.datafile viaperf script -i - Can auto-clone Brendan Gregg's FlameGraph repository and render FlameGraph SVGs from recorded
.datafiles - Can continue
.dataanalysis withperf report --stdioandperf annotate --stdio - Can launch
stress-ngorabthrough theexercisesubcommand and observe either the resolved target or the load process itself during the run - Can generate Python-side summaries with trends, miss ratios, expert diagnosis, and top perf.data hotspots
- Can export those summaries as structured JSON for later automation
- Can export time-series samples as CSV and stacked SVG charts
- Renders a rolling terminal dashboard with current counters and ASCII charts
Quick start
python -m venv .venv
source .venv/bin/activate
pip install .
perf-skill observe "trace comm=python pid=4242 inst cycles"
For editable local development, use pip install -e .[dev] instead.
Use --dry-run first if you want a simulated preview of the resolved request and
generated perf command without attaching to the process. This preview is
implemented by perf-skill; native perf does not provide a --dry-run
option.
perf-skill observe "trace python 4242 inst" --dry-run
By default, the CLI uses --group-mode auto and emits perf stat -e groups such
as {instructions,cycles,cache-misses} or
{instructions,cycles},{branches,branch-misses}. This keeps related counters in
the same perf group without forcing everything into one oversized event set.
You can stop a live run after either a fixed sample count or a fixed duration:
perf-skill observe "trace pid=4242 inst cycles" --samples 10 --plain
perf-skill observe "trace pid=4242 inst cycles" --seconds 5 --plain
The statement parser also understands short natural-language hints such as
for 5 seconds, 10 samples, 10秒, 采样10次, 持续 30 秒, or
采 20 个样本.
If the statement asks to generate an image or chart, the CLI also enables SVG
export automatically and picks a default path under out/, for example:
perf-skill observe "探测20秒node的cycles并生成图像"
perf-skill observe "生成10s内node的branchs的图像"
If the statement asks for perf.data, the CLI switches to perf record and
auto-picks a renamed output path like
out/node_targetpid4242_cycles_data_20260519T120000.data when you did not pass
--data-out explicitly:
perf-skill observe "追踪 node 的 cycles 并输出 perf.data" --seconds 10
If the statement asks for a FlameGraph, or if you pass --flamegraph-out, the
CLI also switches to perf record -g, bootstraps
https://github.com/brendangregg/FlameGraph.git under
~/.openclaw/perf-skill/FlameGraph on first use, and writes a FlameGraph SVG:
perf-skill observe "追踪 node 的 cycles 并生成火焰图" --seconds 10
perf-skill observe --data-in out/node_targetpid4242_cycles_data_20260519T120000.data --flamegraph-out out/node-flamegraph.svg
If you want to parse an existing .data file, the CLI can proxy perf script:
perf-skill observe "解析 out/node_targetpid4242_cycles_data_20260519T120000.data"
perf-skill observe --data-in out/node_targetpid4242_cycles_data_20260519T120000.data
If you want a second-hop analysis step after the Python-side .data summary,
you can now append perf report --stdio or auto-run perf annotate --stdio
for the hottest parsed symbol:
perf-skill observe --data-in out/node_targetpid4242_cycles_data_20260519T120000.data --summary --report-stdio
perf-skill observe --data-in out/node_targetpid4242_cycles_data_20260519T120000.data --annotate-top
perf-skill observe --data-in out/node_targetpid4242_cycles_data_20260519T120000.data --annotate-symbol 'v8::Function+0x10'
If you want analysis that goes beyond a single perf command, enable the
Python summary layer. It can compute post-run averages, peaks, trends, derived
ratios such as branch-miss rate and cache-miss rate, automatically flag anomaly
points such as sudden IPC drops or miss-rate surges, turn those signals into
expert diagnosis plus next-step recommendations, and summarize .data
artifacts by top events, threads, callchains, comms, symbols, and ranked hotspots:
perf-skill observe "trace pid=4242 inst cycles" --summary
perf-skill observe "trace pid=4242 branches branch-misses" --summary-out out/summary.json
perf-skill observe "解析 out/node_targetpid4242_cycles_data_20260519T120000.data" --summary
For live observations, anomaly lines are emitted in the summary with the sample
timestamp where the deviation was detected, and the summary now also emits
insight and next-step lines so the result is easier to act on.
Plain and dashboard output now also mark those anomalies as they happen.
For .data parsing, thread-level aggregation is shown as top-thread entries
keyed by comm pid/tid, top-callchain entries summarize stacked perf script
frames, top-callchain[event] further breaks those stacks down per event such
as cycles or sched:sched_switch, and hotspot lines highlight the symbols
with the largest sample share. The dashboard also keeps an alert summary with a
total anomaly count, a recent time-window count, and the last alert timestamp.
If you want novice-friendly metaphors or term translation, keep that in the AI
layer or skill response instead of expecting the CLI summary itself to render it.
If you want a single command for load generation plus observation, use the new
exercise subcommand. It launches stress-ng or ab, then observes either the
resolved target or, when no target is given, the load process itself. The final
output includes both the load-tool result and the perf summary:
perf-skill exercise stress-ng --load-args "--cpu 4 --timeout 10" --summary
perf-skill exercise ab "trace comm=nginx cache-misses" --load-args "-n 1000 -c 50 http://127.0.0.1:8080/" --summary
exercise is best for load generation + live perf stat. If you really want
hotspots, symbols, FlameGraphs, whole-machine observation, page-fault ranking,
or discovery before choosing a target, do not force the whole workflow into the
Python CLI. It is usually better to use pgrep, ps, free, vmstat,
smem, or native perf for the discovery and recording steps, then hand the
resolved target or recorded .data artifact back to perf-skill for summary,
reporting, or rendering.
AI/agent routing quick reference
This table describes how an AI or human operator should route the workflow. It
is broader than the narrow statement parser. Use perf-skill when it fits, and
use shell plus native perf when the workflow needs discovery, system-wide
scope, or a recording step that the current CLI does not cover end-to-end.
| User intent | Preferred path | Keep outside the Python CLI |
|---|---|---|
| Known target, counters, IPC, or summary | perf-skill observe |
Nothing special |
| Known target under generated load, live counters only | perf-skill exercise |
Let the AI decide the stress-ng or ab arguments |
| Hotspots, symbols, callchains, FlameGraphs, or hotspot-style images | perf record -g + perf-skill observe --data-in |
Recording and multi-step orchestration |
| Whole-machine branch, cache, or page-fault observation | native perf ... -a |
system-wide scope |
| Find the process with the most page faults | native perf discovery first, then perf-skill summary |
offender discovery |
| Memory stays high and no target is known yet | free, vmstat, ps, smem, then perf stat or perf record -g |
baseline and target discovery |
One comm matches multiple PIDs |
pgrep -ax or ps -C first, then ask the user |
PID disambiguation |
Scenario-driven workflows
These workflows are meant for both humans and agents. The goal is to keep the workflow realistic instead of pretending every request should be compressed into one parser statement.
-
Generate CPU50 load, then probe node branch for 10s and tell me the resultThis is a
load + known process + summary countersflow. Resolve thenodePID first. If there is only one match,exerciseis the simplest path.pgrep -x node perf-skill exercise stress-ng "trace pid=4242 branches branch-misses" \ --load-args "--cpu 1 --cpu-load 50 --timeout 10" \ --seconds 10 --summary
-
Generate CPU50 load, then probe node branch for 10s and tell me hotspots or symbolsThis is a
load + known process + hotspotsflow, so switch toperf record -g.exercisedoes not cover that chain today. Start the load with shell, record with nativeperf, then hand the.datafile back toperf-skillfor summary,perf report --stdio, orperf annotate.pgrep -x node stress-ng --cpu 1 --cpu-load 50 --timeout 12 & perf record -g -o out/node-branches.data -e branches,branch-misses -p 4242 -- sleep 10 perf-skill observe --data-in out/node-branches.data --summary --report-stdio perf-skill observe --data-in out/node-branches.data --annotate-top
-
Generate CPU50 load, then probe node branch for 10s and generate an imageDecide whether the user wants a trend chart or a hotspot picture. For branch trends, export a timeline SVG. For hotspot pictures, record
.dataand emit a FlameGraph.pgrep -x node perf-skill exercise stress-ng "trace pid=4242 branches branch-misses" \ --load-args "--cpu 1 --cpu-load 50 --timeout 10" \ --seconds 10 --svg-out out/node-branches.svg stress-ng --cpu 1 --cpu-load 50 --timeout 12 & perf record -g -o out/node-branches.data -e branches,branch-misses -p 4242 -- sleep 10 perf-skill observe --data-in out/node-branches.data --flamegraph-out out/node-branches-flamegraph.svg
-
Generate CPU50 load, then probe whole-machine branch for 10sThis is system-wide observation. Do not guess a PID first.
stress-ng --cpu 1 --cpu-load 50 --timeout 12 & perf stat -a -e branches,branch-misses -- sleep 10
-
Find the program with the most page faultsThis is offender discovery. Run a short system-wide recording first, then use the Python summary to inspect
top-comm,top-thread, and hotspots.perf record -a -g -o out/pagefaults.data -e page-faults -- sleep 10 perf-skill observe --data-in out/pagefaults.data --summary
-
Memory usage stays high. How should I test it?Start with a baseline before deciding whether this is even a
perfquestion. Separate whole-machine memory pressure, swap activity, page-fault pressure, and a single process with outsized RSS. Only then chooseperf statorperf record -g.free -h vmstat 1 5 ps -eo pid,comm,rss,%mem --sort=-rss | head smem -rk | head perf stat -p 4242 -e page-faults,minor-faults,major-faults -- sleep 10
-
I said node, but the machine has multiple node processesDo not let the CLI fail on ambiguity and do not pick a PID silently. Show the candidates and ask once.
pgrep -ax node ps -C node -o pid,cmd
-
I already have perf.data. Continue with hotspots, FlameGraph, or symbolsThis is a
.datasecond-hop analysis flow. Reuse the artifact instead of attaching again.perf-skill observe --data-in out/node_targetpid4242_cycles_data_20260519T120000.data --summary --report-stdio perf-skill observe --data-in out/node_targetpid4242_cycles_data_20260519T120000.data --annotate-top perf-skill observe --data-in out/node_targetpid4242_cycles_data_20260519T120000.data --flamegraph-out out/node-flamegraph.svg
-
The server feels slow, but I do not yet know which process to inspectThis is target discovery. Inspect CPU, memory, and listeners first. Only attach once the target is clear.
ps -eo pid,comm,%cpu,%mem --sort=-%cpu | head ps -eo pid,comm,rss,%mem --sort=-rss | head ss -lntp | head perf-skill events branch
If you only want to inspect available events, use perf list through the CLI:
perf-skill events
perf-skill events cache
Use --group-mode off if you want the raw ungrouped event list, or
--group-mode always if you want every event list chunked into groups.
Use --pmu-slots auto to keep the default hardware grouping budget at 4
slots. Cache and branch families stay grouped when possible, while software
events such as cpu-clock and tracepoints such as sched:sched_switch do not
consume those hardware slots. You can still override the budget with an
explicit integer such as --pmu-slots 4.
The parser also accepts these event names directly in natural-language statements, for example:
perf-skill observe "trace node cpu-clock sched:sched_switch" --plain
perf-skill observe "追踪 node 的 cpu-clock 和 sched:sched_switch" --plain
If grouped collection fails with retryable perf diagnostics such as
<not counted> or grouped counter scheduling errors, the CLI now retries with
smaller pmu-slots values and finally falls back to ungrouped collection unless
you disable that behavior with --no-group-retry. Successful groups keep their
current layout while only the failing group is split further.
You can inspect the full CLI reference with:
perf-skill --help
perf-skill observe --help
perf-skill exercise --help
Supported statement forms
The parser is intentionally narrow and predictable. The AI/agent routing quick reference and Scenario-driven workflows sections above can be broader than
the parser itself, because workflow orchestration is allowed to combine shell,
native perf, and perf-skill instead of forcing every request into one
statement.
trace comm=python pid=4242 inst cyclesobserve python 4242 instructionsobserve node instructionsobserve node cpu-clock sched:sched_switch追踪 node 的 cycles 并输出 perf.data追踪 node 的 cycles 并生成火焰图解析 out/node_targetpid4242_cycles_data_20260519T120000.data解析 out/node_targetpid4242_cycles_data_20260519T120000.data 并生成火焰图trace pid=4242 inst cycles summarytrace pid=4242 inst cycles for 5 secondsobserve pid=4242 cache-misses 10 samples追踪 comm=nginx pid=31337 inst cycles追踪 node 的 指令 和 周期我要追踪node20秒内的cycles探测20秒node的cycles并生成图像生成10s内node的branchs的图像追踪 pid=31337 的 inst 和 cycles,采样10次追踪 node 持续 30 秒,采 20 个样本watch pid 9001 events=inst,cycles,cache-misses
Event listing is intentionally explicit now. Use perf-skill events, perf-skill events cache, or let the skill route a natural-language event-listing request to that subcommand instead of widening the parser again.
Recognized target keys:
pid,pid=1234comm,comm=python
Recognized event aliases:
inst,instruction,instructionscycle,cyclesbranch-misses,branchescache-misses,cache-references
Even if you request only cache-misses or branches, the tool still keeps
instructions and cycles in the perf event set so IPC remains available.
Even if you request only branch-misses or cache-misses, the CLI fills in the
paired counters it needs for a more interpretable timeline.
Auto grouping rules:
instructionsandcyclesstay in the same core groupbranchesandbranch-missesare grouped together when both are presentcache-referencesandcache-missesare grouped together when both are present- Names that share a strong prefix, suffix, or namespace are preferred in the same group when there is a choice
- Single leftover events are merged into an existing group when there is room
Exporting traces
Write CSV samples during collection:
perf-skill observe "trace pid=4242 inst cycles cache-misses" \
--samples 10 --plain --csv-out out/samples.csv
Write both CSV and SVG artifacts:
perf-skill observe "trace pid=4242 inst cycles branches" \
--samples 20 --plain --csv-out out/samples.csv --svg-out out/timeline.svg
The CSV contains one row per interval sample. The SVG is a stacked time-series report with one panel per metric plus IPC when available.
SVG charts are rendered with matplotlib instead of hand-written XML, so the output is easier to read and closer to a normal plotting workflow.
Use --no-svg-legend if you want a more compact SVG without the color legend.
Write a renamed perf.data artifact with perf record:
perf-skill observe "trace pid=4242 inst cycles" --data-out out/python_targetpid4242_cycles_data_20260519T120000.data --seconds 10
Parse a recorded .data artifact with perf script:
perf-skill observe --data-in out/python_targetpid4242_cycles_data_20260519T120000.data
Write a Python-generated JSON summary:
perf-skill observe "trace pid=4242 inst cycles" --summary-out out/summary.json
perf-skill observe --data-in out/python_targetpid4242_cycles_data_20260519T120000.data --summary-out out/data-summary.json
Packaging and releases
Build a local wheel and sdist:
python -m pip install -e .[dev]
python -m build
Install the generated wheel locally:
pip install dist/perf_skill-*.whl
This repository includes a tag-driven GitHub Actions workflow at
.github/workflows/release.yml. Pushing a tag such as v1.0.0 builds the wheel
and sdist, validates that the tag matches the package version, generates a
changelog from commits since the previous tag, uploads the built artifacts, and
attaches them to a GitHub release.
The workflow now supports a two-stage publish path:
- Push
test-v1.0.0to publish the build to TestPyPI, create a GitHub prerelease, and run a smoke-install from TestPyPI. - After that succeeds, push
v1.0.0on the same commit to create the formal GitHub release, publish to PyPI, and run a smoke-install from PyPI.
The release workflow uses:
scripts/release/validate_tag.pyto assertvX.Y.Zmatchesperf_skill.__version__scripts/release/generate_changelog.pyto build release notes from the git history between tags
If you want to bump the package version references before tagging, use:
python3 scripts/release/bump_version.py 0.6.0 --dry-run
python3 scripts/release/bump_version.py 0.6.0
Configure trusted publishers for this repository on both TestPyPI and PyPI
before pushing release tags. test-vX.Y.Z tags publish to TestPyPI first, and
vX.Y.Z tags publish the same version to PyPI after the rehearsal is complete.
The skill's bundled .github/skills/hardware-event-observe/package-requirement.txt
keeps the runtime installer pinned to that released PyPI version when no local
checkout is available.
You can also run the release helpers locally:
PYTHONPATH=src python3 scripts/release/validate_tag.py v1.0.0
PYTHONPATH=src python3 scripts/release/generate_changelog.py --tag v1.0.0 --output /tmp/release-notes.md
Notes
- Linux only. The tool shells out to
perf. - You may need lower
kernel.perf_event_paranoidor elevated privileges. - If
commmatches multiple processes, the tool asks you to pin apid. - The terminal dashboard is ASCII only and works best in an interactive TTY.
Development
Run the unit tests:
python -m unittest discover -s tests
IDE usage
This repository also includes a Copilot Skill at
.github/skills/hardware-event-observe/ so you can trigger the local CLI from
VS Code chat with a natural-language request.
Example invocations:
/hardware-event-observe 追踪 comm=node pid=16874 的 inst 和 cycles
/hardware-event-observe observe pid=16874 cache-misses branches
/hardware-event-observe 解析 out/node_targetpid16874_cycles_data_20260519T120000.data 并生成火焰图
/hardware-event-observe observe pid=16874 branch-misses --samples 10 --csv-out out/node.csv --svg-out out/node.svg
The skill delegates to the local helper script:
bash .github/skills/hardware-event-observe/scripts/run-observe.sh \
"trace pid=16874 inst cycles" --samples 5 --plain
If the script can see the local repository checkout, it installs that checkout
in editable mode so later source changes are picked up immediately. If the
skill is installed under a workspace ./skills/ directory, it defaults to a
runtime under ./.openclaw/perf-skill/venv; if it is installed globally under
~/.openclaw/skills/, it defaults to ~/.openclaw/perf-skill/venv. When no
local checkout is visible, it falls back to the pinned PyPI requirement bundled
with the skill. FlameGraph rendering also auto-clones Brendan Gregg's
FlameGraph repository under the active PERF_SKILL_HOME on first use. You can
override the shared install path with OPENCLAW_HOME or PERF_SKILL_HOME, the
virtual environment path with PERF_SKILL_VENV_DIR, the FlameGraph checkout
path with PERF_SKILL_FLAMEGRAPH_DIR, and the fallback Python package source
with PERF_SKILL_PACKAGE_SOURCE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file perf_skill-1.0.0.tar.gz.
File metadata
- Download URL: perf_skill-1.0.0.tar.gz
- Upload date:
- Size: 61.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2bed5f159d637a39bb5411e1dee43a69c85921c9d819bc082a0b667b30b23928
|
|
| MD5 |
1e5b93528e54d9b7afa8a2b412c96220
|
|
| BLAKE2b-256 |
c7b5b78848b48935a0d100073cd16cc900b3788ccd0ea8587f83f0f177fdeab8
|
Provenance
The following attestation bundles were made for perf_skill-1.0.0.tar.gz:
Publisher:
release.yml on SiyuanSun0736/perf_skill
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
perf_skill-1.0.0.tar.gz -
Subject digest:
2bed5f159d637a39bb5411e1dee43a69c85921c9d819bc082a0b667b30b23928 - Sigstore transparency entry: 1577935199
- Sigstore integration time:
-
Permalink:
SiyuanSun0736/perf_skill@6186cc8ac5da6bb63d59cfeb31af7dfcff428170 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/SiyuanSun0736
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6186cc8ac5da6bb63d59cfeb31af7dfcff428170 -
Trigger Event:
push
-
Statement type:
File details
Details for the file perf_skill-1.0.0-py3-none-any.whl.
File metadata
- Download URL: perf_skill-1.0.0-py3-none-any.whl
- Upload date:
- Size: 47.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1abb39855a3f2630fad8f9e7a81218da0f7529b0b6059eed149ff0cf11e178bf
|
|
| MD5 |
b37a6bd73885c1fe411096822db1d2ca
|
|
| BLAKE2b-256 |
18f9a2bcb8387c843a26d1f71a2f303ecc700ea56a1b0d777ad1d7cdffc6775b
|
Provenance
The following attestation bundles were made for perf_skill-1.0.0-py3-none-any.whl:
Publisher:
release.yml on SiyuanSun0736/perf_skill
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
perf_skill-1.0.0-py3-none-any.whl -
Subject digest:
1abb39855a3f2630fad8f9e7a81218da0f7529b0b6059eed149ff0cf11e178bf - Sigstore transparency entry: 1577935420
- Sigstore integration time:
-
Permalink:
SiyuanSun0736/perf_skill@6186cc8ac5da6bb63d59cfeb31af7dfcff428170 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/SiyuanSun0736
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6186cc8ac5da6bb63d59cfeb31af7dfcff428170 -
Trigger Event:
push
-
Statement type: