Skip to main content

CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first

Project description

dvc-utils

Diff DVC files, optionally piping through other commands first.

Installation

pip install dvc-utils

Usage

dvc-utils --help
# Usage: dvc-utils [OPTIONS] COMMAND [ARGS]...
#
# Options:
#   --help  Show this message and exit.
#
# Commands:
#   diff  Diff a DVC-tracked file at two commits (or one commit vs. current
#         worktree), optionally passing both through another command first

The single subcommand, dvc-utils diff, is also exposed directly as dvc-dff:

dvc-diff

dvc-diff --help
# Usage: dvc-diff [OPTIONS] [exec_cmd...] <path>
#
#   Diff a file at two commits (or one commit vs. current worktree), optionally
#   passing both through `cmd` first
#
#   Examples:
#
#   dvc-utils diff -r HEAD^..HEAD wc -l foo.dvc  # Compare the number of lines
#   (`wc -l`) in `foo` (the file referenced by `foo.dvc`) at the previous vs.
#   current commit (`HEAD^..HEAD`).
#
#   dvc-utils diff md5sum foo  # Diff the `md5sum` of `foo` (".dvc" extension is
#   optional) at HEAD (last committed value) vs. the current worktree content.
#
# Options:
#   -c, --color                  Colorize the output
#   -r, --refspec TEXT           <commit 1>..<commit 2> (compare two commits) or
#                                <commit> (compare <commit> to the worktree)
#   -s, --shell-executable TEXT  Shell to use for executing commands; defaults
#                                to $SHELL (/bin/bash)
#   -S, --no-shell               Don't pass `shell=True` to Python
#                                `subprocess`es
#   -U, --unified INTEGER        Number of lines of context to show (passes
#                                through to `diff`)
#   -v, --verbose                Log intermediate commands to stderr
#   -w, --ignore-whitespace      Ignore whitespace differences (pass `-w` to
#                                `diff`)
#   -x, --exec-cmd TEXT          Command(s) to execute before diffing; alternate
#                                syntax to passing commands as positional
#                                arguments
#   --help                       Show this message and exit.

Examples

Parquet

See sample commands and output below for inspecting changes to a DVC-tracked Parquet file in a given commit.

Setup:

git clone https://github.com/hudcostreets/nj-crashes && cd nj-crashes # Clone + enter example repo
commit=c8ae28e  # Example commit that changed some DVC-tracked Parquet files
path=njdot/data/2001/NewJersey2001Accidents.pqt.dvc  # One of the changed files

Schema diff

Use parquet2json to observe schema changes to a Parquet file:

parquet_schema() {
    parquet2json "$1" schema
}
export -f parquet_schema
dvc-diff -r $commit^..$commit parquet_schema $path
Output
2d1
<   OPTIONAL BYTE_ARRAY Year (STRING);
8,10d6
<   OPTIONAL BYTE_ARRAY Crash Date (STRING);
<   OPTIONAL BYTE_ARRAY Crash Day Of Week (STRING);
<   OPTIONAL BYTE_ARRAY Crash Time (STRING);
14,17c10,13
<   OPTIONAL BYTE_ARRAY Total Killed (STRING);
<   OPTIONAL BYTE_ARRAY Total Injured (STRING);
<   OPTIONAL BYTE_ARRAY Pedestrians Killed (STRING);
<   OPTIONAL BYTE_ARRAY Pedestrians Injured (STRING);
---
>   OPTIONAL INT64 Total Killed;
>   OPTIONAL INT64 Total Injured;
>   OPTIONAL INT64 Pedestrians Killed;
>   OPTIONAL INT64 Pedestrians Injured;
20,21c16,17
<   OPTIONAL BYTE_ARRAY Alcohol Involved (STRING);
<   OPTIONAL BYTE_ARRAY HazMat Involved (STRING);
---
>   OPTIONAL BOOLEAN Alcohol Involved;
>   OPTIONAL BOOLEAN HazMat Involved;
23c19
<   OPTIONAL BYTE_ARRAY Total Vehicles Involved (STRING);
---
>   OPTIONAL INT64 Total Vehicles Involved;
29c25
<   OPTIONAL BYTE_ARRAY Mile Post (STRING);
---
>   OPTIONAL DOUBLE Mile Post;
47,48c43,44
<   OPTIONAL BYTE_ARRAY Latitude (STRING);
<   OPTIONAL BYTE_ARRAY Longitude (STRING);
---
>   OPTIONAL DOUBLE Latitude;
>   OPTIONAL DOUBLE Longitude;
51a48
>   OPTIONAL INT64 Date (TIMESTAMP(MICROS,false));

Here we can see that various date/time columns were consolidated, and several stringly-typed columns were converted to ints, floats, and booleans.

Row diff

Diff the first row of the Parquet file above (pretty-printed as JSON using jq), before and after the given commit:

pretty_print_first_row() {
    # Print first row of Parquet file as JSON, pretty-print with jq
    parquet2json "$1" cat -l 1 | jq .
}
export -f pretty_print_first_row
dvc-diff -r $commit^..$commit pretty_print_first_row $path
Output
2d1
<   "Year": "2001",
8,10d6
<   "Crash Date": "12/21/2001",
<   "Crash Day Of Week": "F",
<   "Crash Time": "1834",
14,17c10,13
<   "Total Killed": "0",
<   "Total Injured": "0",
<   "Pedestrians Killed": "0",
<   "Pedestrians Injured": "0",
---
>   "Total Killed": 0,
>   "Total Injured": 0,
>   "Pedestrians Killed": 0,
>   "Pedestrians Injured": 0,
20,21c16,17
<   "Alcohol Involved": "N",
<   "HazMat Involved": "N",
---
>   "Alcohol Involved": false,
>   "HazMat Involved": false,
23c19
<   "Total Vehicles Involved": "2",
---
>   "Total Vehicles Involved": 2,
29c25
<   "Mile Post": "",
---
>   "Mile Post": null,
47,48c43,44
<   "Latitude": "",
<   "Longitude": "",
---
>   "Latitude": null,
>   "Longitude": null,
51c47,48
<   "Reporting Badge No.": "830"
---
>   "Reporting Badge No.": "830",
>   "Date": "2001-12-21 18:34:00 +00:00"

This reflects the schema changes above.

Row count diff

parquet_row_count() {
    parquet2json "$1" rowcount
}
export -f parquet_row_count
dvc-diff -r $commit^..$commit parquet_row_count $path

This time we get no output; the given $commit didn't change the row count in the DVC-tracked Parquet file $path.

GZipped CSVs

Here's a "one-liner" I used in ctbk.dev, to normalize and compare headers of .csv.gz.dvc files between two commits:

# Save some `sed` substitution commands to file `seds`:
cat <<EOF >seds
s/station_//
s/latitude/lat/
s/longitude/lng/
s/starttime/started_at/
s/stoptime/ended_at/
s/usertype/member_casual/
EOF
# Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
r=c0..c1
# List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
gdno $r s3/ctbk/csvs/ | \
pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
Aliases used in the pipeline:
  • gdno (git diff --name-only): list files changed in the given commit range and directory
  • pel: parallel alias that prepends an echo {} to the command
  • ddcr (dvc-diff -cr): colorized diff output, revision range $r
  • guc (gunzip -c): uncompress the .csv.gz files
  • h1 (head -n1): only examine each file's header line
  • spc (tr , $'\n'): split the header line by commas (so each column name will be on one line, for easier diffing below)
  • kq (tr -d '"'): kill quote characters (in this case, header-column name quoting changed, but I don't care about that)
  • kcr (tr -d '\r'): kill carriage returns (line endings also changed)
  • snc (sed -f 'snake_case.sed'): snake-case column names
  • sdf (sed -f): execute the sed substitution commands defined in the seds file above
  • sort: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)

Note:

  • Most of these are exported Bash functions, allowing them to be used inside the parallel command.
  • I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the seds commands).

Example output:

…
s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration

This helped me see that the data update in question (c0..c1) dropped some fields (bikeid, birth_year, gender, tripduration) and added others (ride_id, rideable_type), for 202001 and later.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvc-utils-0.0.9.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

dvc_utils-0.0.9-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file dvc-utils-0.0.9.tar.gz.

File metadata

  • Download URL: dvc-utils-0.0.9.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for dvc-utils-0.0.9.tar.gz
Algorithm Hash digest
SHA256 4cd7769222c002dce4e49e18494128462d37c85efb164e7150dedf58e9396075
MD5 32d068396b3b80a582946989d07d92d3
BLAKE2b-256 d1479c9855b6d2c720ff83e3ec4a8649950a43cf53bb33e3c830e88b40641d58

See more details on using hashes here.

File details

Details for the file dvc_utils-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: dvc_utils-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for dvc_utils-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 3090ceb3a189a8f5d16e4d7ec528672ace8172df921ed126972945d0b54b8ecc
MD5 b52164235e8386376c44e7ac2eabec10
BLAKE2b-256 5783e54eaeb119380f784a74592b9791acd653b7cec7656da5a4990a7e7e95c0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page