Skip to main content

Research project to analyze the OSV database

Project description

osv analysis

This repository mines data from vulnerability databases in the OSV format, looking to map vulnerabilities to software versions they affect.

The main features are:

  • computing stats
  • exhaustively listing which revisions are affected by a vulnerability, and vice versa
  • an experimental implementation of the SZZ algorithm, see "Computing introducing commits" below.

Preliminaries

  • Install dependencies, get data files, and index them:

    pip3 install -e .
    make all.sqlite
    
  • Run swh-graph-grpc-serve on localhost (or port-forward it, eg. ssh maxxi.internal.softwareheritage.org -L 50091:localhost:50091).

Generate all reports:

make

Data sources

Stats

Publications per year start at 120 in 2003 and increase exponentially to 10k in 2024.

Reports' last modified each year are few and chaotic from 2003 to 2012-2013, then increase exponentially from 100 to 10k in 2024.

10k reports are published and last modified on the same day. There is an exponential decrease from that value to 100 that are last modified 8000 after their publication. There is also a small number of reports last modified before their publication; an exponential decrease to 100 in 100 days.

'affected' items

Each OSV document lists what software packages (which can be real packages, VCS repositories, or projects in the abstract) are affected.

Exponential decrease from 10k reports with 1 affected item to single reports with 1000 affected items. There are a couple thousands of reports that don't fit the regression and have between 100 and 115 affected items per report

Each 'affected' entry lists events for that affected package, which roughly map to when the vulnerability was introduced and fixed. As VCS commits and versions are not linear, there can be many of these.

Event types per affected entry

Number of event types per database

Event types per database

Events are grouped par range (usually only two per range), and ranges can have three types:

  • GIT when events are associated to Git commits

  • SEMVER when they are associated to regular X.Y.Z versions with the expected semantics

  • ECOSYSTEM for everything else, with no defined semantics. Due to the lack of semantics, the OSV spec recommends report explicitly list all affected versions in this case. The graph below breaks ECOSYSTEM ranges into two, based on whether they follow this recommendation

Not very readable chart, we see 40k GIT ranges from each of NVD and CVE DB, 40k ECOSYSTEM with versions from Ubuntu, 110k ECOSYSTEM without version across NVD CVE / CGA / CVE / and a couple others. 30k SEMVER in total, mostly from the MAL database

Identifiers

The OSV spec describes identifiers as "a string of the format <DB>-<ENTRYID>, where DB names the database and ENTRYID is in the format used by the database".

::include{file=output/identifiers.md}

Mapping software packages to SWH objects

Mapping package names to SWH origins

Each OSV document lists what software packages (which can be real packages, VCS repositories, or projects in the abstract) are affected.

Each of these software packages is identified by an ecosystem (eg. OSS-Fuzz, npm, PyPI, or Ubuntu).

Mapping GIT affected entries to origins and SWHIDs in SWH

Each software packages is associated with a list of version ranges that are affected by the vulnerability, and each version range is made of events (that mark when a vulnerability is introduced and then fixed).

Version ranges can be associated to Git commits (for software packages of type GIT), version numbers (for for software packages of type SEMVER), or opaque/ecosystem-specific strings (type ECOSYSTEM).

In this section, we look only at software packages of type GIT and count how many of them we can find in SWH (using the origin URL matching descripted below), and how many of there commits were in the 2025-05-18 graph.

::include{file=output/git_origins.md}

Mapping GIT affected entries to origins in SWH

For non-GIT affected packages, we currently only try to map to an origin URL.

This relies on OSV's specified ecosystems and knowledge of SWH's idiosyncrasies. See osv/map_packages.py for details.

::include{file=output/packages.md}

Worth noting are:

  • thousands of packages on NPM are not in SWH and don't seem to exist. All but a handful come from the MAL database. They are marked as "hallucinated" above.
  • the OSV spec says about Maven: "The ecosystem string might optionally have a :<REMOTE-REPO-URL> suffix to denote the remote repository URL that best represents the source of truth for this package, without a trailing slash (e.g. Maven:https://maven.google.com). If this is omitted, this is assumed to be the Maven Central repository (https://repo.maven.apache.org/maven2).". Literally not a single report uses this suffix, even though between a quarter and a half Maven packages are not Maven Central but in other repositories.

Cherry-picks stats

2024-05-16-history-hosting graph

247970180 commits are in the 7079 connected components that contain any of the introduced, fixed, last_affected, or limit commits mentioned by a vulnerability reports.

Of the commits mentioned by a vulnerability reports:

  • 243008413 are not deemed to be cherry-picks, though 300632 mention the keywords "cherry picked from commit" (this typically happens because a cherry-picked commit's message is quoted in an other commit message)
  • 4961689 have at least one valid "cherry picked from commit" stanza, 236022 have at least two. Of those stanzas:
    • 293315 reference unknown commits and don't have a repo URL
    • 0 reference unknown commits and a non-UTF8 repo URL
    • 0 reference unknown repo URLs (ie. these were origins unknown to SWH at the time of graph export)
    • 1640 reference unknown commits and a known repo URL (ie. these commits were unknown at the time of the export, but probably will be in the future)
    • 55449 claim to be cherry-picks of commits in different connected components
    • 5005620 reference known commits
  • 4989284 are cherry-picks of known commits

Of the 4700546 cherry-pick commits, 166743 are cherry-picks of multiple commits. Of the latter, 141972 can be reduced to cherry-picks of a single commit, which is itself a cherry-pick of one (or more). 805 can only be partially reduced through that process.

2025-05-18 graph

293340976 commits are in the 7767 connected components that contain any of the introduced, fixed, last_affected, or limit commits mentioned by a vulnerability reports.

Of the commits mentioned by a vulnerability reports:

  • 287043476 are not deemed to be cherry-picks, though 346120 mention the keywords "cherry picked from commit" (this typically happens because a cherry-picked commit's message is quoted in an other commit message)
  • 6296974 have at least one valid "cherry picked from commit" stanza, 303125 have at least two. Of those stanzas:
    • 348192 reference unknown commits and don't have a repo URL
    • 0 reference unknown commits and a non-UTF8 repo URL
    • 0 reference unknown repo URLs (ie. these were origins unknown to SWH at the time of graph export)
    • 1998 reference unknown commits and a known repo URL (ie. these commits were unknown at the time of the export, but probably will be in the future)
    • 70117 claim to be cherry-picks of commits in different connected components
    • 6372854 reference known commits
  • 6354701 are cherry-picks of known commits

Algorithms

Origin URL matching

Origin URLs in SWH are case-sensitive, but in OSV reports they are usually lowercased.

Given a Git repository URL from an OSV report, we try to match it to a SWH origin URL this way:

  1. if there is an exact match in swh-graph, return it
  2. remove .git at the end. if there is an exact match in swh-graph, return it (this catches 20% of the origins that failed the exact match)
  3. Send a request to https://archive.softwareheritage.org/api/1/origin/search/{url}?limit=10. If any matches the URL (case-insensitive and with .git suffix stripped from both), return it
  4. fail

Other origins are normalized similarly, with an addition for PyPI projects, because OSV reports also frequently mangle some PyPI project names by using dots/underscores/dashes interchangeably so we consider those equivalent (like PyPI does) in step 3.

(computing-introducing-commits)=

Computing introducing commits

We implement the SZZ algorithm to compute, from a known commit fixing a vulnerability, which commits may have introduced it. Foundamentally, it works this way:

  1. Look up the parent of the fixing commit, and compute the diff between it and the fixing commit
  2. Compute a git-blame of the parent of the fixing commit
  3. For each line removed (or modified) by the fixing commit, look up in the git-blame which commit introduced it
  4. Take the union of the commits introducing these lines

To compute all recommended variants, use:

make data/szz.tar  # standard SZZ, fast
data/vszz90.tar    # V-SZZ with similarity ratio 90%
data/vszz75.tar    # V-SZZ, slow

Minimally it can be run with:

cargo run --release --bin szz -- \
    --graph $(GRAPH_PATH) \
    --digestmap $(DIGESTMAP_PATH) \
    --url gzip:https://softwareheritage.s3.amazonaws.com/content/{sha1} \
    --url plain:https://archive.softwareheritage.org/api/1/content/sha1:{sha1}/raw/ \
    --db all.sqlite \
    --content-cache data/contents.rocksdb/ \
    --max-tree-diff-items 100 \
    --vuln-out-dir data/szz/

which produces in data/szz/ ndjson (newline-delimited json) files in this format:

pub struct SzzOutputRecord {
    /// Filename of the vulnerability report in the OSV database
    pub vuln_filename: String,
    /// Vulnerabilities that are a fix according to the vulnerability report
    pub fix_revs: Vec<SWHID>,
    /// Union of tag names of each revision in `fix_revs_id`/`fix_revs_swhid`.
    ///
    /// For each revision in `fix_revs_swhid`, takes the name of any release that points
    /// **directly** to it, or the name of a `refs/tags/` branch that points **directly** to it.
    pub fix_revs_tag_names: Vec<String>,
    /// Whether the vulnerability report claims the vulnerability is present since the beginning of
    /// the project
    ///
    /// ie. the event `{ "introduced": "0" }` is present.
    ///
    /// This is usually a lie, and actually means the author of the vulnerability report does not
    /// known when the vulnerability was introduced.
    pub introduced_at_zero: bool,
    /// Vulnerabilities that introduced the vulnerability according to the vulnerability report
    pub known_introduction_revs: Vec<SWHID>,
    /// Union of tag names of each revision in
    /// `known_introduction_revs_id`/`known_introduction_revs_swhid`.
    pub known_introduction_revs_tag_names: Vec<String>,
    /// Vulnerabilities that introduced the vulnerability according to SZZ,
    /// and for each path, which lines in it are considered to be the introduction
    pub computed_introduction_revs: HashMap<SWHID, IntroductionFilesRecord>,
    /// Union of tag names of each revision in
    /// `computed_introduction_revs_id`/`computed_introduction_revs_swhid`.
    pub computed_introduction_revs_tag_names: Vec<String>,
}

For example:

{"vuln_filename":"osv-output/CVE-2021-25313.json","fix_revs":["swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274"],"fix_revs_tag_names":["v2.5.6","v2.5.6-rc9"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806":{"pkg/catalogv2/helmop/operation.go":{"line_ranges":[{"start":765,"end":766},{"start":759,"end":760}]}},"swh:1:rev:1b6a525e1052da363f3f71c4451d7c2d50b7e967":{"pkg/catalogv2/helmop/operation.go":{"line_ranges":[{"start":589,"end":590}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2021-45099.json","fix_revs":["swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce"],"fix_revs_tag_names":["v10.0.0"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:8b1a4f016a3e109dfaa8726f2f3a1c1940ff4c2c":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":94,"end":95}]}},"swh:1:rev:c3cfef680a51828c57c9d4c7b24ed756cab95f13":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":97,"end":98}]}},"swh:1:rev:147fd4e87c41380f683805f28005dfcd7082356f":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":96,"end":97}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2020-36179.json","fix_revs":["swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3"],"fix_revs_tag_names":["jackson-databind-2.6.7.5"],"introduced_at_zero":false,"known_introduction_revs":["swh:1:rev:e8df0987e3034d102ee6d704d30a05a2e3ac7089"],"known_introduction_revs_tag_names":["jackson-databind-2.0.0"],"computed_introduction_revs":{"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c":{"pom.xml":{"line_ranges":[{"start":23,"end":24},{"start":12,"end":13}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2022-27818.json","fix_revs":["swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c":{"src/daemon.rs":{"line_ranges":[{"start":31,"end":34}]}},"swh:1:rev:978fa8195b46eed1b6e479e5679b1fd95a3f55a8":{"src/daemon.rs":{"line_ranges":[{"start":30,"end":32}]}},"swh:1:rev:6097674e18e2e34f68b340a40d36dcd23c258d00":{"src/daemon.rs":{"line_ranges":[{"start":307,"end":308}]},"src/server.rs":{"line_ranges":[{"start":14,"end":15}]}},"swh:1:rev:b4e6dc76f4845ab03104187a42ac6d1bbc1e0021":{"src/daemon.rs":{"line_ranges":[{"start":408,"end":409},{"start":404,"end":405}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2018-16846.json","fix_revs":["swh:1:rev:b10be4d44915a4d78a8e06aa31919e74927b142e"],"fix_revs_tag_names":["v13.2.4"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279":{"CMakeLists.txt":{"line_ranges":[{"start":3,"end":4}]}}},"computed_introduction_revs_tag_names":["v13.2.3"]}
{"vuln_filename":"osv-output/CVE-2020-4071.json","fix_revs":["swh:1:rev:8a7dfe2161d241f4a79775a99c7c94405ad3975d"],"fix_revs_tag_names":["v0.3.4"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:42ccebf98daa7c86ead0df65345361f9bdc17b5a":{"setup.cfg":{"line_ranges":[{"start":5,"end":6}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2021-44108.json","fix_revs":["swh:1:rev:d919b2744cd05abae043490f0a3dd1946c1ccb8c"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:235a041b8d7638db931114ace49e4f771508830f":{"src/amf/namf-handler.c":{"line_ranges":[{"start":202,"end":204}]}},"swh:1:rev:d0673e3066ff14ce2d965b436ccb9b3646a38705":{"lib/sbi/message.c":{"line_ranges":[{"start":467,"end":468}]}},"swh:1:rev:c9363b132093581b6fd2ce794aa63cd597bf83a6":{"src/amf/namf-handler.c":{"line_ranges":[{"start":172,"end":173}]}},"swh:1:rev:dbee687a75797e0be5f8484030d11ea22e18b63c":{"lib/sbi/message.c":{"line_ranges":[{"start":1325,"end":1326},{"start":1391,"end":1392},{"start":1499,"end":1501},{"start":1328,"end":1330},{"start":1357,"end":1358},{"start":1334,"end":1336}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2016-6306.json","fix_revs":["swh:1:rev:848d650dade802c835b4b3a1e29c7581e79494ed"],"fix_revs_tag_names":["v0.10.47"],"introduced_at_zero":false,"known_introduction_revs":["swh:1:rev:163ca274230fce536afe76c64676c332693ad7c1"],"known_introduction_revs_tag_names":["v0.10.0"],"computed_introduction_revs":{"swh:1:rev:3e711f14ae7db34350fcc5b1d7ffd4a8cfc2daef":{"src/node_version.h":{"line_ranges":[{"start":28,"end":29}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2020-25706.json","fix_revs":["swh:1:rev:39458efcd5286d50e6b7f905fedcdc1059354e6e"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:20a1b073420eeecab9eca2f1a8d86f30f81c2f23":{"lib/import.php":{"line_ranges":[{"start":983,"end":984}]}},"swh:1:rev:0ba5711f09338a7019ed5622701a7effd83ba701":{"lib/import.php":{"line_ranges":[{"start":1756,"end":1757}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2024-43440.json","fix_revs":["swh:1:rev:aea9770cfc7d003a737f4899489d1e3982efe9ac"],"fix_revs_tag_names":["v4.1.12"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:44305df587ad156e6ccc8495bfbcd45e45370c23":{"version.php":{"line_ranges":[{"start":31,"end":32},{"start":34,"end":35}]}}},"computed_introduction_revs_tag_names":[]}

We also implement some variants of it:

  • Ignoring all whitespace at the beginning or end of a line, with --tokenizer trimmed-line
  • Ignoring all whitespace, with --tokenizer whitespace-stripped-line
  • Considering all lines changed with less than N% edit distance to be the same (similar to V-SZZ), with --min-line-similarity-ratio 0.75 (for 75%, like V-SZZ)

We also provide optional verbose output. --line-details-out-dir data/szz-line-details/ enables ndjson output of the provenance of each line in this format:

pub struct SzzLineDetailsRecord {
    pub fix_rev: RevisionRecord,
    pub path: String,
    pub vulnerable_hunk_before_fix: HunkRecord,
    /// Introduction rev computed by SZZ
    pub introduction_rev: RevisionRecord,
    pub vulnerable_hunk_after_introduction: HunkRecord,
    pub hunk_before_introduction: Option<HunkRecord>,
    pub file_creation: RapidHashSet<RevisionRecord>,
    /// number of blame steps from the fix rev to find the introduction rev
    ///
    /// It is found through a DFS or BFS, so it is greater or equal to `intro_to_fix_rev_distance`.
    pub intro_to_fix_rev_num_blame_steps: u64,
    /// number of revisions between introduction_rev and fix_rev (shortest path)
    pub intro_to_fix_rev_distance: u64,
    /// number of revisions between file_creation_rev and introduction_rev (shortest path)
    pub creation_to_intro_rev_distance: u64,
    /// number of revisions between file_creation_rev and fix_rev (shortest path, usually equal
    /// to `intro_to_fix_rev_distance + creation_to_intro_rev_distance` but may be smaller)
    pub creation_to_fix_rev_distance: u64,
}

pub struct HunkRecord {
    pub hunk: String,
    pub line_range: LineRange,
}

pub struct RevisionRecord {
    pub swhid: SWHID,
    pub author_timestamp: Option<i64>,
    pub committer_timestamp: Option<i64>,
}

For example:

{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":752,"end":753}},"introduction_rev":{"swhid":"swh:1:rev:1b6a525e1052da363f3f71c4451d7c2d50b7e967","author_timestamp":1599023421,"committer_timestamp":1599023421},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":589,"end":590}},"hunk_before_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equals\",\n","line_range":{"start":589,"end":590}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":146,"creation_to_intro_rev_distance":31,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":758,"end":759}},"introduction_rev":{"swhid":"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806","author_timestamp":1612981111,"committer_timestamp":1612981722},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":759,"end":760}},"hunk_before_introduction":{"hunk":"","line_range":{"start":759,"end":759}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":35,"creation_to_intro_rev_distance":133,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":764,"end":765}},"introduction_rev":{"swhid":"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806","author_timestamp":1612981111,"committer_timestamp":1612981722},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":765,"end":766}},"hunk_before_introduction":{"hunk":"","line_range":{"start":765,"end":765}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":35,"creation_to_intro_rev_distance":133,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"# Home Assistant Core CLI\n","line_range":{"start":95,"end":96}},"introduction_rev":{"swhid":"swh:1:rev:8b1a4f016a3e109dfaa8726f2f3a1c1940ff4c2c","author_timestamp":1581777236,"committer_timestamp":1581777236},"vulnerable_hunk_after_introduction":{"hunk":"# Home Assistant Core CLI\n","line_range":{"start":94,"end":95}},"hunk_before_introduction":{"hunk":"# Home Assistant CLI\n","line_range":{"start":94,"end":95}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":237,"creation_to_intro_rev_distance":399,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"eval \"$(_HASS_CLI_COMPLETE=source_zsh hass-cli)\"\n","line_range":{"start":96,"end":97}},"introduction_rev":{"swhid":"swh:1:rev:c3cfef680a51828c57c9d4c7b24ed756cab95f13","author_timestamp":1543786701,"committer_timestamp":1543786701},"vulnerable_hunk_after_introduction":{"hunk":"eval \"$(_HASS_CLI_COMPLETE=source_zsh hass-cli)\"\n","line_range":{"start":97,"end":98}},"hunk_before_introduction":{"hunk":"","line_range":{"start":97,"end":97}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":475,"creation_to_intro_rev_distance":161,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"\n","line_range":{"start":97,"end":98}},"introduction_rev":{"swhid":"swh:1:rev:147fd4e87c41380f683805f28005dfcd7082356f","author_timestamp":1577619683,"committer_timestamp":1577619683},"vulnerable_hunk_after_introduction":{"hunk":"\n","line_range":{"start":96,"end":97}},"hunk_before_introduction":{"hunk":"","line_range":{"start":96,"end":96}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":248,"creation_to_intro_rev_distance":388,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3","author_timestamp":1624334005,"committer_timestamp":1624334005},"path":"pom.xml","vulnerable_hunk_before_fix":{"hunk":"  <version>2.6.7.5-SNAPSHOT</version>\n","line_range":{"start":12,"end":13}},"introduction_rev":{"swhid":"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c","author_timestamp":1603594079,"committer_timestamp":1603594079},"vulnerable_hunk_after_introduction":{"hunk":"  <version>2.6.7.5-SNAPSHOT</version>\n","line_range":{"start":12,"end":13}},"hunk_before_introduction":{"hunk":"  <version>2.6.7.4</version>\n","line_range":{"start":12,"end":13}},"file_creation":[{"swhid":"swh:1:rev:90c4352c4d2412fbe4be10e93e2c520b9658a752","author_timestamp":1324625127,"committer_timestamp":1324625127}],"intro_to_fix_rev_num_blame_steps":1,"intro_to_fix_rev_distance":4,"creation_to_intro_rev_distance":1267,"creation_to_fix_rev_distance":1271}
{"fix_rev":{"swhid":"swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3","author_timestamp":1624334005,"committer_timestamp":1624334005},"path":"pom.xml","vulnerable_hunk_before_fix":{"hunk":"    <tag>HEAD</tag>\n","line_range":{"start":23,"end":24}},"introduction_rev":{"swhid":"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c","author_timestamp":1603594079,"committer_timestamp":1603594079},"vulnerable_hunk_after_introduction":{"hunk":"    <tag>HEAD</tag>\n","line_range":{"start":23,"end":24}},"hunk_before_introduction":{"hunk":"    <tag>jackson-databind-2.6.7.4</tag>\n","line_range":{"start":23,"end":24}},"file_creation":[{"swhid":"swh:1:rev:90c4352c4d2412fbe4be10e93e2c520b9658a752","author_timestamp":1324625127,"committer_timestamp":1324625127}],"intro_to_fix_rev_num_blame_steps":1,"intro_to_fix_rev_distance":4,"creation_to_intro_rev_distance":1267,"creation_to_fix_rev_distance":1271}
{"fix_rev":{"swhid":"swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818","author_timestamp":1648221476,"committer_timestamp":1648221476},"path":"src/daemon.rs","vulnerable_hunk_before_fix":{"hunk":"        if !config_file_path.exists() {\n            log::error!(\"{:#?} doesn't exist\", config_file_path);\n            exit(1);\n","line_range":{"start":96,"end":99}},"introduction_rev":{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411},"vulnerable_hunk_after_introduction":{"hunk":"    if !config_file_path.exists() {\n        log::error!(\"{:#?} doesn't exist\", config_file_path);\n        exit(1);\n","line_range":{"start":31,"end":34}},"hunk_before_introduction":null,"file_creation":[{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411}],"intro_to_fix_rev_num_blame_steps":72,"intro_to_fix_rev_distance":156,"creation_to_intro_rev_distance":0,"creation_to_fix_rev_distance":156}
{"fix_rev":{"swhid":"swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818","author_timestamp":1648221476,"committer_timestamp":1648221476},"path":"src/daemon.rs","vulnerable_hunk_before_fix":{"hunk":"        }\n\n","line_range":{"start":99,"end":101}},"introduction_rev":{"swhid":"swh:1:rev:978fa8195b46eed1b6e479e5679b1fd95a3f55a8","author_timestamp":1644118333,"committer_timestamp":1644118333},"vulnerable_hunk_after_introduction":{"hunk":"    }\n\n","line_range":{"start":30,"end":32}},"hunk_before_introduction":{"hunk":"","line_range":{"start":30,"end":30}},"file_creation":[{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411}],"intro_to_fix_rev_num_blame_steps":72,"intro_to_fix_rev_distance":107,"creation_to_intro_rev_distance":51,"creation_to_fix_rev_distance":156}

and when --min-line-similarity-ratio is given, --middle-revisions-out-dir enables CSV output listing all revisions that made a non-significant change to a line, in this format:

pub struct MiddleCommitRecord {
    /// parent revision of `predecessor_rev`
    pub rev: SWHID,
    pub rev_author_timestamp: Option<i64>,
    pub rev_committer_timestamp: Option<i64>,
    /// middle rev touching the line
    pub predecessor_rev: SWHID,
    pub predecessor_rev_author_timestamp: Option<i64>,
    pub predecessor_rev_committer_timestamp: Option<i64>,
    pub hunk_id: String,
    /// fix rev
    pub ancestor_rev: SWHID,
    pub ancestor_rev_author_timestamp: Option<i64>,
    pub ancestor_rev_committer_timestamp: Option<i64>,
    /// always "V-SZZ_middle_revision"
    pub tag: String,
    pub code_before_id: SWHID,
    pub code_after_id: SWHID,
    pub file_path: String,
    /// entire file at 'rev'
    pub code_before: String,
    /// entire file at 'predecessor_rev'
    pub code_after: String,
}

For example:

rev,rev_author_timestamp,rev_committer_timestamp,predecessor_rev,predecessor_rev_author_timestamp,predecessor_rev_committer_timestamp,hunk_id,ancestor_rev,ancestor_rev_author_timestamp,ancestor_rev_committer_timestamp,tag,code_before,code_after
swh:1:rev:ff68f28b1e21feb9fd584847b2272aef2fc370dd,1533889455,1533889686,swh:1:rev:93dcbcf3b9e0726c03b45b7e74ec9ca4c89eab03,1533893246,1533893246,,swh:1:rev:9b5bbd48a72096930af08402c5e07fce7dd770f3,1544087928,1544087928,V-SZZ_similar_line,"
","
"
swh:1:rev:ff68f28b1e21feb9fd584847b2272aef2fc370dd,1533889455,1533889686,swh:1:rev:93dcbcf3b9e0726c03b45b7e74ec9ca4c89eab03,1533893246,1533893246,,swh:1:rev:9b5bbd48a72096930af08402c5e07fce7dd770f3,1544087928,1544087928,V-SZZ_similar_line,"	fmt.Fprintf(w, `
","	fmt.Fprintf(w, `
"
swh:1:rev:71adb3c4170dc47f71c21bf8d95ed7ddd640819e,1635286588,1635286588,swh:1:rev:d9492ec19b76aca2b13e18131fe46078810984af,1635287082,1635287082,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"SET (CARES_LIB_VERSIONINFO ""7:0:5"")
","SET (CARES_LIB_VERSIONINFO ""7:1:5"")
"
swh:1:rev:7586c5f19f94923b9c722351cfd41696cd9764d9,1634813012,1634813012,swh:1:rev:800e4727d1e38cec97767437b8202f60a94f3f1d,1635175567,1635175567,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"PROJECT (c-ares LANGUAGES C VERSION ""1.17.2"" )
","PROJECT (c-ares LANGUAGES C VERSION ""1.18.0"" )
"
swh:1:rev:7586c5f19f94923b9c722351cfd41696cd9764d9,1634813012,1634813012,swh:1:rev:800e4727d1e38cec97767437b8202f60a94f3f1d,1635175567,1635175567,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"SET (CARES_LIB_VERSIONINFO ""6:3:4"")
","SET (CARES_LIB_VERSIONINFO ""7:0:5"")
"
swh:1:rev:11a2bf8efd88d961f3b2c5dea04b09b4af247bce,1625070329,1625070329,swh:1:rev:fe282cf172c63f2bca21e8fda50a318cad4a7c69,1626972694,1626972694,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"PROJECT (c-ares LANGUAGES C VERSION ""1.17.0"" )
","PROJECT (c-ares LANGUAGES C VERSION ""1.17.2"" )
"

SZZ-related diffs

We also have support for producing diffs of all revisions mentioned by any of SZZ's outputs. The corresponding diffs for the recommended SZZ variants can be computed with:

make data/szz-diffs.tar.zst
make data/vszz90-diffs.tar.zst
make data/vszz75-diffs.tar.zst

See the Makefile for details.

Customizing SZZ

The SZZ implementation (SzzProcessor) is parametrized by multiple types, which can be provided by users:

  • StrategyFactory which returns instances of StrategyFactory which themselves compute:
    • given a version range from an OSV document the fix revision to start from (NaiveStrategy returns the "fix" events)
    • from a list of diffs, the set of vulnerable hunks (NaiveStrategy returns all deleted/modified hunks)
  • RevisionSkipper which takes as input a revision and its parent (and the two different versions of a file in each), and returns, if the revision should be skipped, a mapping from lines in the revision to lines in the parent. RevisionSkipper never returns anything (ie. it skips no revision)
  • Tokenizer which takes as input a version of a file, and returns its lines with customizable comparison implementations.

Data formats

This package produces files in various formats:

  • all.sqlite: a database with verbatim OSV documents plus some indexes, and an integer id for each document. See swh/osv/to_sqlite.py for the exact schema
  • connected_components.wccs: a renumbering of revisions connected to any vulnerable commit. This allows revisions to be identified by a small integer (in the range [0; 300M]) instead of being a sparse subset of all node ids in the graph ([0; 60G]). This is an epserde serialization of swh_graph_stdlib::connectivity::SubgraphWccs, which is based on an Elias-Fano sequence. It also identifies which connected component a revision belongs to, which is useful to identify cherry-picks.
  • commit2vuln_without_cherrypicks.*: a map from small revision id to id of a document in sqlite, using the BVGraph format (note: this is not actually a graph, it just reuses BVGraph as a generic map from integers to set of integers). It is built directly from the introduced and fixed information in OSV documents using graph traversals
  • commit2vuln_with_cherrypicks.*: same as commit2vuln_without_cherrypicks.*, but enriches the sets of introduced and fixed events by mining commit messages for cherry-pick information. It does so by considering that any cherry-pick of an introducing (resp. fixing) commit is also an introducing (resp. fixing) commit, transitively.
  • commit2vuln_without_cherrypicks/*.parquet and commit2vuln_using_cherrypicks/*.parquet: same as above, but designed for portability at the expense of query time and file size

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swh_osv-0.0.0.tar.gz (68.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swh_osv-0.0.0-py3-none-any.whl (58.9 kB view details)

Uploaded Python 3

File details

Details for the file swh_osv-0.0.0.tar.gz.

File metadata

  • Download URL: swh_osv-0.0.0.tar.gz
  • Upload date:
  • Size: 68.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for swh_osv-0.0.0.tar.gz
Algorithm Hash digest
SHA256 54b540212d4a5e35deb36d78949c3a0d63685821eef9e5b6f7a3faa5b4a7c852
MD5 35467dddeff0c5d4670cd98f86b49210
BLAKE2b-256 2603075d3211ad91ea3d307d2aa64cae99b87270635009166d765b38d203ca1e

See more details on using hashes here.

File details

Details for the file swh_osv-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: swh_osv-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 58.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for swh_osv-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1db82e42ecb36569bc68d00cd59bc07be97e9a17dc2783e93e5a5899dd0e48aa
MD5 afb48cf8ea646332114f12c2740541ab
BLAKE2b-256 cc8c77b51139439ded2ecdba0452e75a643b0ae60d8efe4cc9a9bca6caa57e38

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page