Skip to main content

CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first

Project description

dvc-utils

CLI for diffing DVC files, optionally passing both through another command first

Installation

pip install dvc-utils

Usage

dvc-utils --help
# Usage: dvc-utils [OPTIONS] COMMAND [ARGS]...
#
# Options:
#   --help  Show this message and exit.
#
# Commands:
#   diff  Diff a DVC-tracked file at two commits (or one commit vs. current
#         worktree), optionally passing both through another command first

The single subcommand, dvc-utils diff, is also exposed directly as dvc-dff:

dvc-diff

dvc-diff --help
# Usage: dvc-diff [OPTIONS] [exec_cmd...] <path>
#
#   Diff a file at two commits (or one commit vs. current worktree), optionally
#   passing both through `cmd` first
#
#   Examples:
#
#   dvc-utils diff -r HEAD^..HEAD wc -l foo.dvc  # Compare the number of lines
#   (`wc -l`) in `foo` (the file referenced by `foo.dvc`) at the previous vs.
#   current commit (`HEAD^..HEAD`).
#
#   dvc-utils diff md5sum foo  # Diff the `md5sum` of `foo` (".dvc" extension is
#   optional) at HEAD (last committed value) vs. the current worktree content.
#
# Options:
#   -c, --color              Colorize the output
#   -r, --refspec TEXT       <commit 1>..<commit 2> (compare two commits) or
#                            <commit> (compare <commit> to the worktree)
#   -S, --no-shell           Don't pass `shell=True` to Python `subprocess`es
#   -U, --unified INTEGER    Number of lines of context to show (passes through
#                            to `diff`)
#   -v, --verbose            Log intermediate commands to stderr
#   -w, --ignore-whitespace  Ignore whitespace differences (pass `-w` to `diff`)
#   -x, --exec-cmd TEXT      Command(s) to execute before diffing; alternate
#                            syntax to passing commands as positional arguments
#   --help                   Show this message and exit.

Examples

Parquet file

See sample commands and output below for inspecting changes to a DVC-tracked Parquet file in a given commit.

Setup:

git clone https://github.com/hudcostreets/nj-crashes && cd nj-crashes # Clone + enter example repo
commit=c8ae28e  # Example commit that changed some DVC-tracked Parquet files
path=njdot/data/2001/NewJersey2001Accidents.pqt.dvc  # One of the changed files

Schema diff

Use parquet2json to observe schema changes to a Parquet file:

parquet_schema() {
    parquet2json "$1" schema
}
export -f parquet_schema
dvc-diff -r $commit^..$commit parquet_schema $path
Output
2d1
<   OPTIONAL BYTE_ARRAY Year (STRING);
8,10d6
<   OPTIONAL BYTE_ARRAY Crash Date (STRING);
<   OPTIONAL BYTE_ARRAY Crash Day Of Week (STRING);
<   OPTIONAL BYTE_ARRAY Crash Time (STRING);
14,17c10,13
<   OPTIONAL BYTE_ARRAY Total Killed (STRING);
<   OPTIONAL BYTE_ARRAY Total Injured (STRING);
<   OPTIONAL BYTE_ARRAY Pedestrians Killed (STRING);
<   OPTIONAL BYTE_ARRAY Pedestrians Injured (STRING);
---
>   OPTIONAL INT64 Total Killed;
>   OPTIONAL INT64 Total Injured;
>   OPTIONAL INT64 Pedestrians Killed;
>   OPTIONAL INT64 Pedestrians Injured;
20,21c16,17
<   OPTIONAL BYTE_ARRAY Alcohol Involved (STRING);
<   OPTIONAL BYTE_ARRAY HazMat Involved (STRING);
---
>   OPTIONAL BOOLEAN Alcohol Involved;
>   OPTIONAL BOOLEAN HazMat Involved;
23c19
<   OPTIONAL BYTE_ARRAY Total Vehicles Involved (STRING);
---
>   OPTIONAL INT64 Total Vehicles Involved;
29c25
<   OPTIONAL BYTE_ARRAY Mile Post (STRING);
---
>   OPTIONAL DOUBLE Mile Post;
47,48c43,44
<   OPTIONAL BYTE_ARRAY Latitude (STRING);
<   OPTIONAL BYTE_ARRAY Longitude (STRING);
---
>   OPTIONAL DOUBLE Latitude;
>   OPTIONAL DOUBLE Longitude;
51a48
>   OPTIONAL INT64 Date (TIMESTAMP(MICROS,false));

Here we can see that various date/time columns were consolidated, and several stringly-typed columns were converted to ints, floats, and booleans.

Row diff

Diff the first row of the Parquet file above (pretty-printed as JSON using jq), before and after the given commit:

pretty_print_first_row() {
    # Print first row of Parquet file as JSON, pretty-print with jq
    parquet2json "$1" cat -l 1 | jq .
}
export -f pretty_print_first_row
dvc-diff -r $commit^..$commit pretty_print_first_row $path
Output
2d1
<   "Year": "2001",
8,10d6
<   "Crash Date": "12/21/2001",
<   "Crash Day Of Week": "F",
<   "Crash Time": "1834",
14,17c10,13
<   "Total Killed": "0",
<   "Total Injured": "0",
<   "Pedestrians Killed": "0",
<   "Pedestrians Injured": "0",
---
>   "Total Killed": 0,
>   "Total Injured": 0,
>   "Pedestrians Killed": 0,
>   "Pedestrians Injured": 0,
20,21c16,17
<   "Alcohol Involved": "N",
<   "HazMat Involved": "N",
---
>   "Alcohol Involved": false,
>   "HazMat Involved": false,
23c19
<   "Total Vehicles Involved": "2",
---
>   "Total Vehicles Involved": 2,
29c25
<   "Mile Post": "",
---
>   "Mile Post": null,
47,48c43,44
<   "Latitude": "",
<   "Longitude": "",
---
>   "Latitude": null,
>   "Longitude": null,
51c47,48
<   "Reporting Badge No.": "830"
---
>   "Reporting Badge No.": "830",
>   "Date": "2001-12-21 18:34:00 +00:00"

This reflects the schema changes above.

Row count diff

parquet_row_count() {
    parquet2json "$1" rowcount
}
export -f parquet_row_count
dvc-diff -r $commit^..$commit parquet_row_count $path

This time we get no output; the given $commit didn't change the row count in the DVC-tracked Parquet file $path.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvc-utils-0.0.6.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

dvc_utils-0.0.6-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file dvc-utils-0.0.6.tar.gz.

File metadata

  • Download URL: dvc-utils-0.0.6.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for dvc-utils-0.0.6.tar.gz
Algorithm Hash digest
SHA256 3528446d9b53eeccfe2e7bbdd058da6793099764ba610ea9c083c5851f721c65
MD5 3008e07779ce31f0df148780448bbef0
BLAKE2b-256 b4e59f2b9d8bf40dec93518d02bd51d9d412db0b39097fade96008ffb4da96d7

See more details on using hashes here.

File details

Details for the file dvc_utils-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: dvc_utils-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for dvc_utils-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b22ab70fdee41dae997457a9f746978f3a41da746ea14873d46cdc9500c5f8d9
MD5 3aabbaee5ee4c1523ebb8d5b10d9bb33
BLAKE2b-256 b05c60ad6f00fb310e4a420214b4701c262921e24d5f6a816f90950c721b81da

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page