Skip to main content

DAfny Resource Usage Measurement

Project description

DAfny Resource Usage Measurement

A set of tools to detect and help diagnose brittle verification.

Darum? Ach, warum!

What does Darum do?

It analyzes Dafny's verification costs to find brittleness and helps the user diagnose it.

Brittleness?

Dafny verifies code by translating it internally into assertions, that then are verified by the Z3 solver.

For a long time, the common advice to help Dafny code verify successfully was to add context information to the code through manual assertions, so that Z3 would have enough information to find a solution. This works well for small projects. However, the solver can eventually have too much information and get lost following unproductive rabbitholes while searching for a solution. This causes long verification times or timeouts, which bogs down development. Worse, the verification of Dafny code depends on a measure of randomness, which causes Z3 to follow changing paths when searching for proofs, even for semantically unchanged Dafny code. This causes variability in verification times, possibly spanning multiple orders of magnitude. This is known as "verification brittleness".

Since recently, Dafny has been adding functionalities to help the user control what information reaches Z3. The problem then is that one rarely knows what allowed Z3 find a proof in the past, or what is confusing it now. Indeed, at the Dafny level, the only thing we get back is the result (verification successful, failure or timeout), plus how costly was it for the solver to reach that result.

At Consensys, we found that the distribution of verification costs in bigger codebases turns heavily multimodal. Hence, the standard statistics (average, covariance) typically used to report costs obscure important information; namely, that cost variation is not smooth, but can change drastically. And because of the inherent randomness, these variations happen for no reason discernible to the user.

Darum analyzes the solver costs, to find what parts of the code show the highest variability. It's at these points where applying Dafny's brittleness control functionalities will have the biggest effect.

What exactly is Darum?

Darum consists of 3 loosely coupled tools:

  • dafny_measure: a wrapper around dafny measure-complexity for easier management of the iterative verification process. It captures the logs and augments them with transient information for easier reference:

    • The timestamp and arguments used in this verification run
    • Dafny's stdout and stderr
    • The input file's contents and a hash, to avoid confusion when comparing successive versions of the code

    Additionally, dafny_measure warns when some Dafny toolchain bugs are detected, like when it gets stuck (#5316) or z3 processes are leaked (#5616).

  • plot_distribution: a tool to analyze and present the logs generated by dafny_measure.

    • It runs some tests on the verification results contained in the log
    • Scores the results heuristically for their potential for improvement
    • Presents the results in summary tables
    • Plots the most interesting results for further analysis.
  • compare_distribution: a tool to compare verification runs at different Assertion Batch granularity, i.e. with and without "Isolate Assertions" mode.

Installation

Darum's tools are written in Python and available in Pypi.

Probably the easiest way to install DARUM is using pipx, which should be available in all common package managers, like brew in macOS.

$ brew install pipx
...
$ pipx install darum

This will make DARUM's tools available as common CLI commands.

Eventually you can upgrade DARUM with

$ pipx upgrade darum

Usage

Each of the tools has a --help argument that lists the available options.

In general, the workflow will be:

  1. Run dafny_measure
  2. Plot the logfile with plot_distribution (happens automatically when running dafny_measure)
  3. If some member looks interesting/suspicious, run dafny_measure again in IA mode, possibly with --filter-symbol to focus only on that member
  4. Plot the new logfile with plot_distribution
  5. And/or compare both logfiles with compare_distribution.
$ dafny_measure myfile.dfy
...
Generated logfile XYZ.log

$ plot_distribution XYZ.log
...

$ dafny_measure myfile.dfy --isolate-assertions --extra "--filter-symbol expensiveFunction"
...

$ compare_distribution XYZ.log -i XYZ_IA.log
...

For further details about how Darum works and usage strategies, please see the file Details.md.

How many iterations to run with dafny_measure? (-i argument)

The default is 10, which in practice seem to work well. Bigger numbers (100 iterations or more) might be interesting to get more detail on how the distribution really looks like in badly behaved code: what are its modes, and how extreme it can get.

Note that a higher number of iterations can trigger bugs in Dafny, and fail midway without producing a log (Dafny issue #5316). Plan accordingly. To work around this, there's some functionality in Darum to analyze multiple small logfiles, which can be more reliable than trying to generate a big logfile at once. (Darum issue #1)

Interpreting the results

Take a look at the walkthrough

The plots

Plain plots

What is an acceptable span?

Badly behaving members seem to blow up their span rather abruptly, so probably a single run will quickly give an idea of which members need work. However, as a rule of thumb: in IA mode, spans seem to grow abruptly once they go over 3-5%. In default mode, this happens over 10%.

Worst offenders

ABs are scored according to their characteristics, including the fact that a non-successful AB makes subsequent ABs in the same member unreliable.

The top N ABs are plotted. For plots with failures/OoRs, these are plotted aside for attention.

The plot starts in transparent mode to make it easier to see where bars overlap. Clicking on the legend makes the corresponding plot easier to see.

Verification results that happen rarely are specially important. Hence, the Y axis is logarithmic to better show those single, rare results.

Comparative plots

This type of plot compares RUs of members verified in default mode (plotted with X symbols) against the RU total of those same members verified in IA mode (plotted as spikes). The comparisons are always member-wise: member-wide single AB vs sum of a member's ABs. Note the Log X axis.

Things to notice in each member:

  • Ideal behavior happens when a member verifies robustly and, ideally, cheaply. In the plot, this translates to Xs clustered together (meaning stability) and much farther to the left (meaning cheaper) than the spikes, which will also be clustered together (meaning that even the DAs verify stably for this member). This is the typical situation for short, simple members.
  • Wide X dispersion means brittleness in default mode. If the spikes are still clustered, this means that the DAs are still stable.
    • Maybe Z3 is finding a proof following different paths, or the search space is just too big. Reducing it to force Z3 into the fast path would help.
    • Notably, if some of the X show higher cost than the spikes, this is a clear indication that plainly breaking down the AB into smaller batches would help by trading happenstance speed for stability.
  • If even the spikes are dispersed, this means that even the individual DAs show brittleness.
    • It's likely that some DA can't verify by itself; the member-level AB verified successfully because it discovered extra context while proving previous DAs. Hence, it'd be helpful to make the discovered context explicit. Plotting individually the IA mode log will show what DA is failing.
  • Robust member verification (clustered X) with brittle IA verification (disperse spikes) might point to brittleness waiting to happen!
    • Lucky, side-effect discoveries while proving early DAs in the AB are allowing subsequent DAs to be proven. If the lucky discovery stops happening, suddenly there are failures in the AB.

The table/s

Default mode distribution plots contain a table analyzing the logs. For convenience, the column "diag" in the table summarizes the situation for each row with a series of icons:

  • 📊 : This element was plotted because was among the top scores.
  • ⌛️ : This element finished as Out of Resources.
  • ❌ : This element explicitly failed verification.
  • ❗️ : This element had both successful and failed/OoR results. Note that purely successful results are not highlighted because they're the hoped default; but fliflopping results merit extra attention. According to the Dafny developers, as long as there is a successful result, the goal should be to stabilize it to avoid the failures – as opposed to discarding the success because of the failures.
  • ❓ : This element had a single success across all iterations. It's probably prioritary to stabilize its success.
  • ⛔️ : This element was excluded from the plot on request.

IA mode distribution plots contain 2 tables. The first one is equivalent to the one just described, only applied to the individual ABs. The second table shows the total costs at the member level, but still in IA mode.

Comments

If interesting/atypical situations were detected while preparing the plots, they will be listed in a Comments section at the bottom of the plot page.

Pitfalls

Start in standard mode, dig into IA mode once a problem is apparent

As mentioned, a member can verify stably in default mode, but in IA mode present ABs that are brittle or even fail. The significance of this situation is still unclear, therefore:

  1. Probably there's no immediate harm in leaving the member as it is. However, you might still want to keep an eye on it in case that any small change in the member triggers brittleness.
  2. More importantly, It's probably best to focus on fixing problems that can be first be found at the member level with standard verification mode. On the contrary, starting by looking for problems at the IA mode might cause you to spend effort on trouble that maybe isn't really there. Dafny's --filter-symbol is a great way to avoid the temptation of fishing for unnecessary trouble in IA mode.

Keep in mind that, in IA mode, ABs are "chained". If one fails, the rest are removed.

To minimize noise, once an AB fails in IA mode, we ignore subsequent ABs in the same iteration. In such a case, the tables will show that some ABs have a smaller number of results than previous ones. In the extreme case of only 1 result remaining for a given AB, it will be tagged with the ❗️ icon.

Some remedies to keep in mind

Dafny standard library

Since version 4.4, Dafny includes a standard library that provides pre-verified code. This library includes helper lemmas for things like bitwise manipulation and non-linear arithmetic, which are typical verification pain points. Using this library instead of implementing one's own version might save much work; or, if a reimplementation is needed, can at least offer examples of how things were implemented by Dafny's own developers.

Section on Verification debugging in Ref Manual

See here.

Hacking

Clone the repo to your system and install it in editable mode with pipx, poetry or similar tools.

cd darum
pipx install -e .

DARUM's plots use HoloViz, a Python library that configures and deploys Javascript libraries into a standalone, interactive HTML page. This complexity combined with Python's own packaging complexities mean that if you try to run DARUM's scripts directly, the generated HTML pages will fail to work.

Hence, remember to use the user-facing scripts installed by tools like pipx and poetry instead.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darum-1.tar.gz (43.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

darum-1-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file darum-1.tar.gz.

File metadata

  • Download URL: darum-1.tar.gz
  • Upload date:
  • Size: 43.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Darwin/24.4.0

File hashes

Hashes for darum-1.tar.gz
Algorithm Hash digest
SHA256 f8b7e0d1adfd19c2a6a877e106fbd8d153207ac66026b9ef5013d1e7e0b68e6a
MD5 0e5489e88f795613a5681aeaba7827fe
BLAKE2b-256 e80bb8909e5366e42cf0f305e2a21925c684498ba933aecd8a4cfa83fcb10255

See more details on using hashes here.

File details

Details for the file darum-1-py3-none-any.whl.

File metadata

  • Download URL: darum-1-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Darwin/24.4.0

File hashes

Hashes for darum-1-py3-none-any.whl
Algorithm Hash digest
SHA256 b3e6d12e48108c038ec0eca037bfb5b34e768487db3b243237fae0fcfc62e5f2
MD5 60c2d5f6b3d50856249a13d73447f021
BLAKE2b-256 f17fb517d0763a69e2fe9041e488ae1a1c219dda814027a5a72272514d59e23b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page