CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first
Project description
dvc-utils
Diff DVC files, optionally piping through other commands first.
Installation
pip install dvc-utils
Usage
dvc-utils --help
# Usage: dvc-utils [OPTIONS] COMMAND [ARGS]...
#
# Options:
# --help Show this message and exit.
#
# Commands:
# diff Diff a DVC-tracked file at two commits (or one commit vs. current
# worktree), optionally passing both through another command first
The single subcommand, dvc-utils diff
, is also exposed directly as dvc-dff
:
dvc-diff
dvc-diff --help
# Usage: dvc-diff [OPTIONS] [exec_cmd...] <path>
#
# Diff a file at two commits (or one commit vs. current worktree), optionally
# passing both through `cmd` first
#
# Examples:
#
# dvc-utils diff -r HEAD^..HEAD wc -l foo.dvc # Compare the number of lines
# (`wc -l`) in `foo` (the file referenced by `foo.dvc`) at the previous vs.
# current commit (`HEAD^..HEAD`).
#
# dvc-utils diff md5sum foo # Diff the `md5sum` of `foo` (".dvc" extension is
# optional) at HEAD (last committed value) vs. the current worktree content.
#
# Options:
# -c, --color Colorize the output
# -r, --refspec TEXT <commit 1>..<commit 2> (compare two commits) or
# <commit> (compare <commit> to the worktree)
# -s, --shell-executable TEXT Shell to use for executing commands; defaults
# to $SHELL (/bin/bash)
# -S, --no-shell Don't pass `shell=True` to Python
# `subprocess`es
# -U, --unified INTEGER Number of lines of context to show (passes
# through to `diff`)
# -v, --verbose Log intermediate commands to stderr
# -w, --ignore-whitespace Ignore whitespace differences (pass `-w` to
# `diff`)
# -x, --exec-cmd TEXT Command(s) to execute before diffing; alternate
# syntax to passing commands as positional
# arguments
# --help Show this message and exit.
Examples
Parquet
See sample commands and output below for inspecting changes to a DVC-tracked Parquet file in a given commit.
Setup:
git clone https://github.com/hudcostreets/nj-crashes && cd nj-crashes # Clone + enter example repo
commit=c8ae28e # Example commit that changed some DVC-tracked Parquet files
path=njdot/data/2001/NewJersey2001Accidents.pqt.dvc # One of the changed files
Schema diff
Use parquet2json
to observe schema changes to a Parquet file:
parquet_schema() {
parquet2json "$1" schema
}
export -f parquet_schema
dvc-diff -r $commit^..$commit parquet_schema $path
Output
2d1
< OPTIONAL BYTE_ARRAY Year (STRING);
8,10d6
< OPTIONAL BYTE_ARRAY Crash Date (STRING);
< OPTIONAL BYTE_ARRAY Crash Day Of Week (STRING);
< OPTIONAL BYTE_ARRAY Crash Time (STRING);
14,17c10,13
< OPTIONAL BYTE_ARRAY Total Killed (STRING);
< OPTIONAL BYTE_ARRAY Total Injured (STRING);
< OPTIONAL BYTE_ARRAY Pedestrians Killed (STRING);
< OPTIONAL BYTE_ARRAY Pedestrians Injured (STRING);
---
> OPTIONAL INT64 Total Killed;
> OPTIONAL INT64 Total Injured;
> OPTIONAL INT64 Pedestrians Killed;
> OPTIONAL INT64 Pedestrians Injured;
20,21c16,17
< OPTIONAL BYTE_ARRAY Alcohol Involved (STRING);
< OPTIONAL BYTE_ARRAY HazMat Involved (STRING);
---
> OPTIONAL BOOLEAN Alcohol Involved;
> OPTIONAL BOOLEAN HazMat Involved;
23c19
< OPTIONAL BYTE_ARRAY Total Vehicles Involved (STRING);
---
> OPTIONAL INT64 Total Vehicles Involved;
29c25
< OPTIONAL BYTE_ARRAY Mile Post (STRING);
---
> OPTIONAL DOUBLE Mile Post;
47,48c43,44
< OPTIONAL BYTE_ARRAY Latitude (STRING);
< OPTIONAL BYTE_ARRAY Longitude (STRING);
---
> OPTIONAL DOUBLE Latitude;
> OPTIONAL DOUBLE Longitude;
51a48
> OPTIONAL INT64 Date (TIMESTAMP(MICROS,false));
Here we can see that various date/time columns were consolidated, and several stringly-typed columns were converted to ints, floats, and booleans.
Row diff
Diff the first row of the Parquet file above (pretty-printed as JSON using jq
), before and after the given commit:
pretty_print_first_row() {
# Print first row of Parquet file as JSON, pretty-print with jq
parquet2json "$1" cat -l 1 | jq .
}
export -f pretty_print_first_row
dvc-diff -r $commit^..$commit pretty_print_first_row $path
Output
2d1
< "Year": "2001",
8,10d6
< "Crash Date": "12/21/2001",
< "Crash Day Of Week": "F",
< "Crash Time": "1834",
14,17c10,13
< "Total Killed": "0",
< "Total Injured": "0",
< "Pedestrians Killed": "0",
< "Pedestrians Injured": "0",
---
> "Total Killed": 0,
> "Total Injured": 0,
> "Pedestrians Killed": 0,
> "Pedestrians Injured": 0,
20,21c16,17
< "Alcohol Involved": "N",
< "HazMat Involved": "N",
---
> "Alcohol Involved": false,
> "HazMat Involved": false,
23c19
< "Total Vehicles Involved": "2",
---
> "Total Vehicles Involved": 2,
29c25
< "Mile Post": "",
---
> "Mile Post": null,
47,48c43,44
< "Latitude": "",
< "Longitude": "",
---
> "Latitude": null,
> "Longitude": null,
51c47,48
< "Reporting Badge No.": "830"
---
> "Reporting Badge No.": "830",
> "Date": "2001-12-21 18:34:00 +00:00"
This reflects the schema changes above.
Row count diff
parquet_row_count() {
parquet2json "$1" rowcount
}
export -f parquet_row_count
dvc-diff -r $commit^..$commit parquet_row_count $path
This time we get no output; the given $commit
didn't change the row count in the DVC-tracked Parquet file $path
.
GZipped CSVs
Here's a "one-liner" I used in ctbk.dev, to normalize and compare headers of .csv.gz.dvc
files between two commits:
# Save some `sed` substitution commands to file `seds`:
cat <<EOF >seds
s/station_//
s/latitude/lat/
s/longitude/lng/
s/starttime/started_at/
s/stoptime/ended_at/
s/usertype/member_casual/
EOF
# Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
r=c0..c1
# List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
gdno $r s3/ctbk/csvs/ | \
pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
Explanation of aliases
gdno
(git diff --name-only
): list files changed in the given commit range and directorypel
:parallel
alias that prepends anecho {}
to the commandddcr
(dvc-diff -cr
): colorizeddiff
output, revision range$r
guc
(gunzip -c
): uncompress the.csv.gz
filesh1
(head -n1
): only examine each file's header linespc
(tr , $'\n'
): split the header line by commas (so each column name will be on one line, for easierdiff
ing below)kq
(tr -d '"'
): kill quote characters (in this case, header-column name quoting changed, but I don't care about that)kcr
(tr -d '\r'
): kill carriage returns (line endings also changed)snc
(sed -f 'snake_case.sed'
): snake-case column namessdf
(sed -f
): execute thesed
substitution commands defined in theseds
file abovesort
: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)
Note:
- Most of these are exported Bash functions, allowing them to be used inside the
parallel
command. - I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the
seds
commands).
Example output:
…
s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
…
This helped me see that the data update in question (c0..c1
) dropped some fields (bikeid, birth_year
, gender
, tripduration
) and added others (ride_id
, rideable_type
), for 202001
and later.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dvc-utils-0.1.0.tar.gz
.
File metadata
- Download URL: dvc-utils-0.1.0.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 409f79c0d28b48c0c2a30089b29da7e0a14ae4d2a10506e8877ea3e8825236c0 |
|
MD5 | 180806c6e3681732bd6d1af9025411c9 |
|
BLAKE2b-256 | 0c46e6792493778aad92b09a783d765ff6ea5b42953091e7314297c60fba9587 |
File details
Details for the file dvc_utils-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: dvc_utils-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 907ece6e4dec46ec61320e0b3b60cbc285ef622e0e372b3aef5d5b9a0f6b8e81 |
|
MD5 | 0d0d42a2534fdb2742e28769572fce55 |
|
BLAKE2b-256 | aa98e0edd952a9f6aec46b1f3ace9b55478292e4a6b4f4d79f797a4afc12b06e |