Research project to analyze the OSV database
Project description
osv analysis
This repository mines data from vulnerability databases in the OSV format, looking to map vulnerabilities to software versions they affect.
The main features are:
- computing stats
- exhaustively listing which revisions are affected by a vulnerability, and vice versa
- an experimental implementation of the SZZ algorithm, see "Computing introducing commits" below.
Preliminaries
-
Install dependencies, get data files, and index them:
pip3 install -e . make all.sqlite -
Run swh-graph-grpc-serve on localhost (or port-forward it, eg.
ssh maxxi.internal.softwareheritage.org -L 50091:localhost:50091).
Generate all reports:
make
Data sources
- https://osv-vulnerabilities.storage.googleapis.com/all.zip
- https://storage.googleapis.com/cve-osv-conversion/index.html?prefix=osv-output/
Stats
Publications per year start at 120 in 2003 and increase exponentially to 10k in 2024.
Reports' last modified each year are few and chaotic from 2003 to 2012-2013, then increase exponentially from 100 to 10k in 2024.
10k reports are published and last modified on the same day. There is an exponential decrease from that value to 100 that are last modified 8000 after their publication. There is also a small number of reports last modified before their publication; an exponential decrease to 100 in 100 days.
'affected' items
Each OSV document lists what software packages (which can be real packages, VCS repositories, or projects in the abstract) are affected.
Each 'affected' entry lists events for that affected package, which roughly map to when the vulnerability was introduced and fixed. As VCS commits and versions are not linear, there can be many of these.
Number of event types per database
Events are grouped par range (usually only two per range), and ranges can have three types:
-
GIT when events are associated to Git commits
-
SEMVER when they are associated to regular X.Y.Z versions with the expected semantics
-
ECOSYSTEM for everything else, with no defined semantics. Due to the lack of semantics, the OSV spec recommends report explicitly list all affected versions in this case. The graph below breaks ECOSYSTEM ranges into two, based on whether they follow this recommendation
Identifiers
The OSV spec describes identifiers as
"a string of the format <DB>-<ENTRYID>, where DB names the database and ENTRYID is in the format used by the database".
::include{file=output/identifiers.md}
Mapping software packages to SWH objects
Mapping package names to SWH origins
Each OSV document lists what software packages (which can be real packages, VCS repositories, or projects in the abstract) are affected.
Each of these software packages is identified by an
ecosystem
(eg. OSS-Fuzz, npm, PyPI, or Ubuntu).
Mapping GIT affected entries to origins and SWHIDs in SWH
Each software packages is associated with a list of version ranges that are affected by the vulnerability, and each version range is made of events (that mark when a vulnerability is introduced and then fixed).
Version ranges can be associated to Git commits (for software packages of type GIT),
version numbers (for for software packages of type SEMVER),
or opaque/ecosystem-specific strings (type ECOSYSTEM).
In this section, we look only at software packages of type GIT and count how many of them we can find in SWH (using the origin URL matching descripted below), and how many of there commits were in the 2025-05-18 graph.
::include{file=output/git_origins.md}
Mapping GIT affected entries to origins in SWH
For non-GIT affected packages, we currently only try to map to an origin URL.
This relies on OSV's specified ecosystems and knowledge of SWH's idiosyncrasies. See osv/map_packages.py for details.
::include{file=output/packages.md}
Worth noting are:
- thousands of packages on NPM are not in SWH and don't seem to exist. All but a handful come from the MAL database. They are marked as "hallucinated" above.
- the OSV spec says about Maven: "The ecosystem string might optionally have a
:<REMOTE-REPO-URL>suffix to denote the remote repository URL that best represents the source of truth for this package, without a trailing slash (e.g.Maven:https://maven.google.com). If this is omitted, this is assumed to be the Maven Central repository (https://repo.maven.apache.org/maven2).". Literally not a single report uses this suffix, even though between a quarter and a half Maven packages are not Maven Central but in other repositories.
Cherry-picks stats
2024-05-16-history-hosting graph
247970180 commits are in the 7079 connected components that contain any of the introduced, fixed, last_affected, or limit commits mentioned by a vulnerability reports.
Of the commits mentioned by a vulnerability reports:
- 243008413 are not deemed to be cherry-picks, though 300632 mention the keywords "cherry picked from commit" (this typically happens because a cherry-picked commit's message is quoted in an other commit message)
- 4961689 have at least one valid "cherry picked from commit" stanza, 236022 have at least two. Of those stanzas:
- 293315 reference unknown commits and don't have a repo URL
- 0 reference unknown commits and a non-UTF8 repo URL
- 0 reference unknown repo URLs (ie. these were origins unknown to SWH at the time of graph export)
- 1640 reference unknown commits and a known repo URL (ie. these commits were unknown at the time of the export, but probably will be in the future)
- 55449 claim to be cherry-picks of commits in different connected components
- 5005620 reference known commits
- 4989284 are cherry-picks of known commits
Of the 4700546 cherry-pick commits, 166743 are cherry-picks of multiple commits. Of the latter, 141972 can be reduced to cherry-picks of a single commit, which is itself a cherry-pick of one (or more). 805 can only be partially reduced through that process.
2025-05-18 graph
293340976 commits are in the 7767 connected components that contain any of the introduced, fixed, last_affected, or limit commits mentioned by a vulnerability reports.
Of the commits mentioned by a vulnerability reports:
- 287043476 are not deemed to be cherry-picks, though 346120 mention the keywords "cherry picked from commit" (this typically happens because a cherry-picked commit's message is quoted in an other commit message)
- 6296974 have at least one valid "cherry picked from commit" stanza, 303125 have at least two. Of those stanzas:
- 348192 reference unknown commits and don't have a repo URL
- 0 reference unknown commits and a non-UTF8 repo URL
- 0 reference unknown repo URLs (ie. these were origins unknown to SWH at the time of graph export)
- 1998 reference unknown commits and a known repo URL (ie. these commits were unknown at the time of the export, but probably will be in the future)
- 70117 claim to be cherry-picks of commits in different connected components
- 6372854 reference known commits
- 6354701 are cherry-picks of known commits
Algorithms
Origin URL matching
Origin URLs in SWH are case-sensitive, but in OSV reports they are usually lowercased.
Given a Git repository URL from an OSV report, we try to match it to a SWH origin URL this way:
- if there is an exact match in swh-graph, return it
- remove
.gitat the end. if there is an exact match in swh-graph, return it (this catches 20% of the origins that failed the exact match) - Send a request to
https://archive.softwareheritage.org/api/1/origin/search/{url}?limit=10. If any matches the URL (case-insensitive and with.gitsuffix stripped from both), return it - fail
Other origins are normalized similarly, with an addition for PyPI projects, because OSV reports also frequently mangle some PyPI project names by using dots/underscores/dashes interchangeably so we consider those equivalent (like PyPI does) in step 3.
(computing-introducing-commits)=
Computing introducing commits
We implement the SZZ algorithm to compute, from a known commit fixing a vulnerability, which commits may have introduced it. Foundamentally, it works this way:
- Look up the parent of the fixing commit, and compute the diff between it and the fixing commit
- Compute a git-blame of the parent of the fixing commit
- For each line removed (or modified) by the fixing commit, look up in the git-blame which commit introduced it
- Take the union of the commits introducing these lines
To compute all recommended variants, use:
make data/szz.tar # standard SZZ, fast
data/vszz90.tar # V-SZZ with similarity ratio 90%
data/vszz75.tar # V-SZZ, slow
Minimally it can be run with:
cargo run --release --bin szz -- \
--graph $(GRAPH_PATH) \
--digestmap $(DIGESTMAP_PATH) \
--url gzip:https://softwareheritage.s3.amazonaws.com/content/{sha1} \
--url plain:https://archive.softwareheritage.org/api/1/content/sha1:{sha1}/raw/ \
--db all.sqlite \
--content-cache data/contents.rocksdb/ \
--max-tree-diff-items 100 \
--vuln-out-dir data/szz/
which produces in data/szz/ ndjson (newline-delimited json) files in this format:
pub struct SzzOutputRecord {
/// Filename of the vulnerability report in the OSV database
pub vuln_filename: String,
/// Vulnerabilities that are a fix according to the vulnerability report
pub fix_revs: Vec<SWHID>,
/// Union of tag names of each revision in `fix_revs_id`/`fix_revs_swhid`.
///
/// For each revision in `fix_revs_swhid`, takes the name of any release that points
/// **directly** to it, or the name of a `refs/tags/` branch that points **directly** to it.
pub fix_revs_tag_names: Vec<String>,
/// Whether the vulnerability report claims the vulnerability is present since the beginning of
/// the project
///
/// ie. the event `{ "introduced": "0" }` is present.
///
/// This is usually a lie, and actually means the author of the vulnerability report does not
/// known when the vulnerability was introduced.
pub introduced_at_zero: bool,
/// Vulnerabilities that introduced the vulnerability according to the vulnerability report
pub known_introduction_revs: Vec<SWHID>,
/// Union of tag names of each revision in
/// `known_introduction_revs_id`/`known_introduction_revs_swhid`.
pub known_introduction_revs_tag_names: Vec<String>,
/// Vulnerabilities that introduced the vulnerability according to SZZ,
/// and for each path, which lines in it are considered to be the introduction
pub computed_introduction_revs: HashMap<SWHID, IntroductionFilesRecord>,
/// Union of tag names of each revision in
/// `computed_introduction_revs_id`/`computed_introduction_revs_swhid`.
pub computed_introduction_revs_tag_names: Vec<String>,
}
For example:
{"vuln_filename":"osv-output/CVE-2021-25313.json","fix_revs":["swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274"],"fix_revs_tag_names":["v2.5.6","v2.5.6-rc9"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806":{"pkg/catalogv2/helmop/operation.go":{"line_ranges":[{"start":765,"end":766},{"start":759,"end":760}]}},"swh:1:rev:1b6a525e1052da363f3f71c4451d7c2d50b7e967":{"pkg/catalogv2/helmop/operation.go":{"line_ranges":[{"start":589,"end":590}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2021-45099.json","fix_revs":["swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce"],"fix_revs_tag_names":["v10.0.0"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:8b1a4f016a3e109dfaa8726f2f3a1c1940ff4c2c":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":94,"end":95}]}},"swh:1:rev:c3cfef680a51828c57c9d4c7b24ed756cab95f13":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":97,"end":98}]}},"swh:1:rev:147fd4e87c41380f683805f28005dfcd7082356f":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":96,"end":97}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2020-36179.json","fix_revs":["swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3"],"fix_revs_tag_names":["jackson-databind-2.6.7.5"],"introduced_at_zero":false,"known_introduction_revs":["swh:1:rev:e8df0987e3034d102ee6d704d30a05a2e3ac7089"],"known_introduction_revs_tag_names":["jackson-databind-2.0.0"],"computed_introduction_revs":{"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c":{"pom.xml":{"line_ranges":[{"start":23,"end":24},{"start":12,"end":13}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2022-27818.json","fix_revs":["swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c":{"src/daemon.rs":{"line_ranges":[{"start":31,"end":34}]}},"swh:1:rev:978fa8195b46eed1b6e479e5679b1fd95a3f55a8":{"src/daemon.rs":{"line_ranges":[{"start":30,"end":32}]}},"swh:1:rev:6097674e18e2e34f68b340a40d36dcd23c258d00":{"src/daemon.rs":{"line_ranges":[{"start":307,"end":308}]},"src/server.rs":{"line_ranges":[{"start":14,"end":15}]}},"swh:1:rev:b4e6dc76f4845ab03104187a42ac6d1bbc1e0021":{"src/daemon.rs":{"line_ranges":[{"start":408,"end":409},{"start":404,"end":405}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2018-16846.json","fix_revs":["swh:1:rev:b10be4d44915a4d78a8e06aa31919e74927b142e"],"fix_revs_tag_names":["v13.2.4"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279":{"CMakeLists.txt":{"line_ranges":[{"start":3,"end":4}]}}},"computed_introduction_revs_tag_names":["v13.2.3"]}
{"vuln_filename":"osv-output/CVE-2020-4071.json","fix_revs":["swh:1:rev:8a7dfe2161d241f4a79775a99c7c94405ad3975d"],"fix_revs_tag_names":["v0.3.4"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:42ccebf98daa7c86ead0df65345361f9bdc17b5a":{"setup.cfg":{"line_ranges":[{"start":5,"end":6}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2021-44108.json","fix_revs":["swh:1:rev:d919b2744cd05abae043490f0a3dd1946c1ccb8c"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:235a041b8d7638db931114ace49e4f771508830f":{"src/amf/namf-handler.c":{"line_ranges":[{"start":202,"end":204}]}},"swh:1:rev:d0673e3066ff14ce2d965b436ccb9b3646a38705":{"lib/sbi/message.c":{"line_ranges":[{"start":467,"end":468}]}},"swh:1:rev:c9363b132093581b6fd2ce794aa63cd597bf83a6":{"src/amf/namf-handler.c":{"line_ranges":[{"start":172,"end":173}]}},"swh:1:rev:dbee687a75797e0be5f8484030d11ea22e18b63c":{"lib/sbi/message.c":{"line_ranges":[{"start":1325,"end":1326},{"start":1391,"end":1392},{"start":1499,"end":1501},{"start":1328,"end":1330},{"start":1357,"end":1358},{"start":1334,"end":1336}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2016-6306.json","fix_revs":["swh:1:rev:848d650dade802c835b4b3a1e29c7581e79494ed"],"fix_revs_tag_names":["v0.10.47"],"introduced_at_zero":false,"known_introduction_revs":["swh:1:rev:163ca274230fce536afe76c64676c332693ad7c1"],"known_introduction_revs_tag_names":["v0.10.0"],"computed_introduction_revs":{"swh:1:rev:3e711f14ae7db34350fcc5b1d7ffd4a8cfc2daef":{"src/node_version.h":{"line_ranges":[{"start":28,"end":29}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2020-25706.json","fix_revs":["swh:1:rev:39458efcd5286d50e6b7f905fedcdc1059354e6e"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:20a1b073420eeecab9eca2f1a8d86f30f81c2f23":{"lib/import.php":{"line_ranges":[{"start":983,"end":984}]}},"swh:1:rev:0ba5711f09338a7019ed5622701a7effd83ba701":{"lib/import.php":{"line_ranges":[{"start":1756,"end":1757}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2024-43440.json","fix_revs":["swh:1:rev:aea9770cfc7d003a737f4899489d1e3982efe9ac"],"fix_revs_tag_names":["v4.1.12"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:44305df587ad156e6ccc8495bfbcd45e45370c23":{"version.php":{"line_ranges":[{"start":31,"end":32},{"start":34,"end":35}]}}},"computed_introduction_revs_tag_names":[]}
We also implement some variants of it:
- Ignoring all whitespace at the beginning or end of a line, with
--tokenizer trimmed-line - Ignoring all whitespace, with
--tokenizer whitespace-stripped-line - Considering all lines changed with less than N% edit distance to be the same (similar to V-SZZ), with
--min-line-similarity-ratio 0.75(for 75%, like V-SZZ)
We also provide optional verbose output. --line-details-out-dir data/szz-line-details/ enables ndjson output of the provenance of each line in this format:
pub struct SzzLineDetailsRecord {
pub fix_rev: RevisionRecord,
pub path: String,
pub vulnerable_hunk_before_fix: HunkRecord,
/// Introduction rev computed by SZZ
pub introduction_rev: RevisionRecord,
pub vulnerable_hunk_after_introduction: HunkRecord,
pub hunk_before_introduction: Option<HunkRecord>,
pub file_creation: RapidHashSet<RevisionRecord>,
/// number of blame steps from the fix rev to find the introduction rev
///
/// It is found through a DFS or BFS, so it is greater or equal to `intro_to_fix_rev_distance`.
pub intro_to_fix_rev_num_blame_steps: u64,
/// number of revisions between introduction_rev and fix_rev (shortest path)
pub intro_to_fix_rev_distance: u64,
/// number of revisions between file_creation_rev and introduction_rev (shortest path)
pub creation_to_intro_rev_distance: u64,
/// number of revisions between file_creation_rev and fix_rev (shortest path, usually equal
/// to `intro_to_fix_rev_distance + creation_to_intro_rev_distance` but may be smaller)
pub creation_to_fix_rev_distance: u64,
}
pub struct HunkRecord {
pub hunk: String,
pub line_range: LineRange,
}
pub struct RevisionRecord {
pub swhid: SWHID,
pub author_timestamp: Option<i64>,
pub committer_timestamp: Option<i64>,
}
For example:
{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":752,"end":753}},"introduction_rev":{"swhid":"swh:1:rev:1b6a525e1052da363f3f71c4451d7c2d50b7e967","author_timestamp":1599023421,"committer_timestamp":1599023421},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":589,"end":590}},"hunk_before_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equals\",\n","line_range":{"start":589,"end":590}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":146,"creation_to_intro_rev_distance":31,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":758,"end":759}},"introduction_rev":{"swhid":"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806","author_timestamp":1612981111,"committer_timestamp":1612981722},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":759,"end":760}},"hunk_before_introduction":{"hunk":"","line_range":{"start":759,"end":759}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":35,"creation_to_intro_rev_distance":133,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":764,"end":765}},"introduction_rev":{"swhid":"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806","author_timestamp":1612981111,"committer_timestamp":1612981722},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":765,"end":766}},"hunk_before_introduction":{"hunk":"","line_range":{"start":765,"end":765}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":35,"creation_to_intro_rev_distance":133,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"# Home Assistant Core CLI\n","line_range":{"start":95,"end":96}},"introduction_rev":{"swhid":"swh:1:rev:8b1a4f016a3e109dfaa8726f2f3a1c1940ff4c2c","author_timestamp":1581777236,"committer_timestamp":1581777236},"vulnerable_hunk_after_introduction":{"hunk":"# Home Assistant Core CLI\n","line_range":{"start":94,"end":95}},"hunk_before_introduction":{"hunk":"# Home Assistant CLI\n","line_range":{"start":94,"end":95}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":237,"creation_to_intro_rev_distance":399,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"eval \"$(_HASS_CLI_COMPLETE=source_zsh hass-cli)\"\n","line_range":{"start":96,"end":97}},"introduction_rev":{"swhid":"swh:1:rev:c3cfef680a51828c57c9d4c7b24ed756cab95f13","author_timestamp":1543786701,"committer_timestamp":1543786701},"vulnerable_hunk_after_introduction":{"hunk":"eval \"$(_HASS_CLI_COMPLETE=source_zsh hass-cli)\"\n","line_range":{"start":97,"end":98}},"hunk_before_introduction":{"hunk":"","line_range":{"start":97,"end":97}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":475,"creation_to_intro_rev_distance":161,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"\n","line_range":{"start":97,"end":98}},"introduction_rev":{"swhid":"swh:1:rev:147fd4e87c41380f683805f28005dfcd7082356f","author_timestamp":1577619683,"committer_timestamp":1577619683},"vulnerable_hunk_after_introduction":{"hunk":"\n","line_range":{"start":96,"end":97}},"hunk_before_introduction":{"hunk":"","line_range":{"start":96,"end":96}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":248,"creation_to_intro_rev_distance":388,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3","author_timestamp":1624334005,"committer_timestamp":1624334005},"path":"pom.xml","vulnerable_hunk_before_fix":{"hunk":" <version>2.6.7.5-SNAPSHOT</version>\n","line_range":{"start":12,"end":13}},"introduction_rev":{"swhid":"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c","author_timestamp":1603594079,"committer_timestamp":1603594079},"vulnerable_hunk_after_introduction":{"hunk":" <version>2.6.7.5-SNAPSHOT</version>\n","line_range":{"start":12,"end":13}},"hunk_before_introduction":{"hunk":" <version>2.6.7.4</version>\n","line_range":{"start":12,"end":13}},"file_creation":[{"swhid":"swh:1:rev:90c4352c4d2412fbe4be10e93e2c520b9658a752","author_timestamp":1324625127,"committer_timestamp":1324625127}],"intro_to_fix_rev_num_blame_steps":1,"intro_to_fix_rev_distance":4,"creation_to_intro_rev_distance":1267,"creation_to_fix_rev_distance":1271}
{"fix_rev":{"swhid":"swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3","author_timestamp":1624334005,"committer_timestamp":1624334005},"path":"pom.xml","vulnerable_hunk_before_fix":{"hunk":" <tag>HEAD</tag>\n","line_range":{"start":23,"end":24}},"introduction_rev":{"swhid":"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c","author_timestamp":1603594079,"committer_timestamp":1603594079},"vulnerable_hunk_after_introduction":{"hunk":" <tag>HEAD</tag>\n","line_range":{"start":23,"end":24}},"hunk_before_introduction":{"hunk":" <tag>jackson-databind-2.6.7.4</tag>\n","line_range":{"start":23,"end":24}},"file_creation":[{"swhid":"swh:1:rev:90c4352c4d2412fbe4be10e93e2c520b9658a752","author_timestamp":1324625127,"committer_timestamp":1324625127}],"intro_to_fix_rev_num_blame_steps":1,"intro_to_fix_rev_distance":4,"creation_to_intro_rev_distance":1267,"creation_to_fix_rev_distance":1271}
{"fix_rev":{"swhid":"swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818","author_timestamp":1648221476,"committer_timestamp":1648221476},"path":"src/daemon.rs","vulnerable_hunk_before_fix":{"hunk":" if !config_file_path.exists() {\n log::error!(\"{:#?} doesn't exist\", config_file_path);\n exit(1);\n","line_range":{"start":96,"end":99}},"introduction_rev":{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411},"vulnerable_hunk_after_introduction":{"hunk":" if !config_file_path.exists() {\n log::error!(\"{:#?} doesn't exist\", config_file_path);\n exit(1);\n","line_range":{"start":31,"end":34}},"hunk_before_introduction":null,"file_creation":[{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411}],"intro_to_fix_rev_num_blame_steps":72,"intro_to_fix_rev_distance":156,"creation_to_intro_rev_distance":0,"creation_to_fix_rev_distance":156}
{"fix_rev":{"swhid":"swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818","author_timestamp":1648221476,"committer_timestamp":1648221476},"path":"src/daemon.rs","vulnerable_hunk_before_fix":{"hunk":" }\n\n","line_range":{"start":99,"end":101}},"introduction_rev":{"swhid":"swh:1:rev:978fa8195b46eed1b6e479e5679b1fd95a3f55a8","author_timestamp":1644118333,"committer_timestamp":1644118333},"vulnerable_hunk_after_introduction":{"hunk":" }\n\n","line_range":{"start":30,"end":32}},"hunk_before_introduction":{"hunk":"","line_range":{"start":30,"end":30}},"file_creation":[{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411}],"intro_to_fix_rev_num_blame_steps":72,"intro_to_fix_rev_distance":107,"creation_to_intro_rev_distance":51,"creation_to_fix_rev_distance":156}
and when --min-line-similarity-ratio is given, --middle-revisions-out-dir enables CSV output listing all revisions that made a non-significant change to a line, in this format:
pub struct MiddleCommitRecord {
/// parent revision of `predecessor_rev`
pub rev: SWHID,
pub rev_author_timestamp: Option<i64>,
pub rev_committer_timestamp: Option<i64>,
/// middle rev touching the line
pub predecessor_rev: SWHID,
pub predecessor_rev_author_timestamp: Option<i64>,
pub predecessor_rev_committer_timestamp: Option<i64>,
pub hunk_id: String,
/// fix rev
pub ancestor_rev: SWHID,
pub ancestor_rev_author_timestamp: Option<i64>,
pub ancestor_rev_committer_timestamp: Option<i64>,
/// always "V-SZZ_middle_revision"
pub tag: String,
pub code_before_id: SWHID,
pub code_after_id: SWHID,
pub file_path: String,
/// entire file at 'rev'
pub code_before: String,
/// entire file at 'predecessor_rev'
pub code_after: String,
}
For example:
rev,rev_author_timestamp,rev_committer_timestamp,predecessor_rev,predecessor_rev_author_timestamp,predecessor_rev_committer_timestamp,hunk_id,ancestor_rev,ancestor_rev_author_timestamp,ancestor_rev_committer_timestamp,tag,code_before,code_after
swh:1:rev:ff68f28b1e21feb9fd584847b2272aef2fc370dd,1533889455,1533889686,swh:1:rev:93dcbcf3b9e0726c03b45b7e74ec9ca4c89eab03,1533893246,1533893246,,swh:1:rev:9b5bbd48a72096930af08402c5e07fce7dd770f3,1544087928,1544087928,V-SZZ_similar_line,"
","
"
swh:1:rev:ff68f28b1e21feb9fd584847b2272aef2fc370dd,1533889455,1533889686,swh:1:rev:93dcbcf3b9e0726c03b45b7e74ec9ca4c89eab03,1533893246,1533893246,,swh:1:rev:9b5bbd48a72096930af08402c5e07fce7dd770f3,1544087928,1544087928,V-SZZ_similar_line," fmt.Fprintf(w, `
"," fmt.Fprintf(w, `
"
swh:1:rev:71adb3c4170dc47f71c21bf8d95ed7ddd640819e,1635286588,1635286588,swh:1:rev:d9492ec19b76aca2b13e18131fe46078810984af,1635287082,1635287082,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"SET (CARES_LIB_VERSIONINFO ""7:0:5"")
","SET (CARES_LIB_VERSIONINFO ""7:1:5"")
"
swh:1:rev:7586c5f19f94923b9c722351cfd41696cd9764d9,1634813012,1634813012,swh:1:rev:800e4727d1e38cec97767437b8202f60a94f3f1d,1635175567,1635175567,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"PROJECT (c-ares LANGUAGES C VERSION ""1.17.2"" )
","PROJECT (c-ares LANGUAGES C VERSION ""1.18.0"" )
"
swh:1:rev:7586c5f19f94923b9c722351cfd41696cd9764d9,1634813012,1634813012,swh:1:rev:800e4727d1e38cec97767437b8202f60a94f3f1d,1635175567,1635175567,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"SET (CARES_LIB_VERSIONINFO ""6:3:4"")
","SET (CARES_LIB_VERSIONINFO ""7:0:5"")
"
swh:1:rev:11a2bf8efd88d961f3b2c5dea04b09b4af247bce,1625070329,1625070329,swh:1:rev:fe282cf172c63f2bca21e8fda50a318cad4a7c69,1626972694,1626972694,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"PROJECT (c-ares LANGUAGES C VERSION ""1.17.0"" )
","PROJECT (c-ares LANGUAGES C VERSION ""1.17.2"" )
"
SZZ-related diffs
We also have support for producing diffs of all revisions mentioned by any of SZZ's outputs. The corresponding diffs for the recommended SZZ variants can be computed with:
make data/szz-diffs.tar.zst
make data/vszz90-diffs.tar.zst
make data/vszz75-diffs.tar.zst
See the Makefile for details.
Customizing SZZ
The SZZ implementation (SzzProcessor) is parametrized by multiple types, which can be provided by users:
StrategyFactorywhich returns instances ofStrategyFactorywhich themselves compute:- given a version range from an OSV document the fix revision to start from (
NaiveStrategyreturns the "fix" events) - from a list of diffs, the set of vulnerable hunks (
NaiveStrategyreturns all deleted/modified hunks)
- given a version range from an OSV document the fix revision to start from (
RevisionSkipperwhich takes as input a revision and its parent (and the two different versions of a file in each), and returns, if the revision should be skipped, a mapping from lines in the revision to lines in the parent.RevisionSkippernever returns anything (ie. it skips no revision)Tokenizerwhich takes as input a version of a file, and returns its lines with customizable comparison implementations.- the default
line_tokenizerreturns lines as-is TrimmedAsciiTokenizerreturns lines whose comparisons are insensitive to leading and trailing ASCII spacesStrippedAsciiWhitespaceTokenizerreturns lines whose comparisons are insensitive to any ASCII spaces
- the default
Data formats
This package produces files in various formats:
all.sqlite: a database with verbatim OSV documents plus some indexes, and an integer id for each document. Seeswh/osv/to_sqlite.pyfor the exact schemaconnected_components.wccs: a renumbering of revisions connected to any vulnerable commit. This allows revisions to be identified by a small integer (in the range [0; 300M]) instead of being a sparse subset of all node ids in the graph ([0; 60G]). This is an epserde serialization of swh_graph_stdlib::connectivity::SubgraphWccs, which is based on an Elias-Fano sequence. It also identifies which connected component a revision belongs to, which is useful to identify cherry-picks.commit2vuln_without_cherrypicks.*: a map from small revision id to id of a document in sqlite, using the BVGraph format (note: this is not actually a graph, it just reuses BVGraph as a generic map from integers to set of integers). It is built directly from theintroducedandfixedinformation in OSV documents using graph traversalscommit2vuln_with_cherrypicks.*: same ascommit2vuln_without_cherrypicks.*, but enriches the sets ofintroducedandfixedevents by mining commit messages for cherry-pick information. It does so by considering that any cherry-pick of an introducing (resp. fixing) commit is also an introducing (resp. fixing) commit, transitively.commit2vuln_without_cherrypicks/*.parquetandcommit2vuln_using_cherrypicks/*.parquet: same as above, but designed for portability at the expense of query time and file size
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swh_osv-0.0.0.tar.gz.
File metadata
- Download URL: swh_osv-0.0.0.tar.gz
- Upload date:
- Size: 68.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54b540212d4a5e35deb36d78949c3a0d63685821eef9e5b6f7a3faa5b4a7c852
|
|
| MD5 |
35467dddeff0c5d4670cd98f86b49210
|
|
| BLAKE2b-256 |
2603075d3211ad91ea3d307d2aa64cae99b87270635009166d765b38d203ca1e
|
File details
Details for the file swh_osv-0.0.0-py3-none-any.whl.
File metadata
- Download URL: swh_osv-0.0.0-py3-none-any.whl
- Upload date:
- Size: 58.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1db82e42ecb36569bc68d00cd59bc07be97e9a17dc2783e93e5a5899dd0e48aa
|
|
| MD5 |
afb48cf8ea646332114f12c2740541ab
|
|
| BLAKE2b-256 |
cc8c77b51139439ded2ecdba0452e75a643b0ae60d8efe4cc9a9bca6caa57e38
|