Python CLI tool and library for diffing CSV and JSON files
Project description
csv-diff
Tool for viewing the difference between two CSV, TSV or JSON files. See Generating a commit log for San Francisco’s official list of trees (and the sf-tree-history repo commit log) for background information on this project.
Installation
pip install csv-diff
Usage
Consider two CSV files:
one.csv
id,name,age
1,Cleo,4
2,Pancakes,2
two.csv
id,name,age
1,Cleo,5
3,Bailey,1
csv-diff
can show a human-readable summary of differences between the files:
$ csv-diff one.csv two.csv --key=id
1 row changed, 1 row added, 1 row removed
1 row changed
Row 1
age: "4" => "5"
1 row added
id: 3
name: Bailey
age: 1
1 row removed
id: 2
name: Pancakes
age: 2
The --key=id
option means that the id
column should be treated as the unique key, to identify which records have changed.
The tool will automatically detect if your files are comma- or tab-separated. You can over-ride this automatic detection and force the tool to use a specific format using --format=tsv
or --format=csv
.
You can also feed it JSON files, provided they are a JSON array of objects where each object has the same keys. Use --format=json
if your input files are JSON.
Use --show-unchanged
to include full details of the unchanged values for rows with at least one change in the diff output:
% csv-diff one.csv two.csv --key=id --show-unchanged
1 row changed
id: 1
age: "4" => "5"
Unchanged:
name: "Cleo"
JSON output
You can use the --json
option to get a machine-readable difference:
$ csv-diff one.csv two.csv --key=id --json
{
"added": [
{
"id": "3",
"name": "Bailey",
"age": "1"
}
],
"removed": [
{
"id": "2",
"name": "Pancakes",
"age": "2"
}
],
"changed": [
{
"key": "1",
"changes": {
"age": [
"4",
"5"
]
}
}
],
"columns_added": [],
"columns_removed": []
}
Adding templated extras
You can specify additional keys to be displayed in the human-readable format using the --extra
option:
--extra name "Python format string with {id} for variables"
For example, to output a link to https://news.ycombinator.com/latest?id={id}
for each item with an ID, you could use this:
csv-diff one.csv two.csv --key=id \
--extra latest "https://news.ycombinator.com/latest?id={id}"
These extras display something like this:
1 row changed
id: 41459472
points: "24" => "25"
numComments: "5" => "6"
extras:
latest: https://news.ycombinator.com/latest?id=41459472
As a Python library
You can also import the Python library into your own code like so:
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv"), key="id"),
load_csv(open("two.csv"), key="id")
)
diff
will now contain the same data structure as the output in the --json
example above.
If the columns in the CSV have changed, those added or removed columns will be ignored when calculating changes made to specific rows.
As a Docker container
Build the image
$ docker build -t csvdiff .
Run the container
$ docker run --rm -v $(pwd):/files csvdiff
Suppose current directory contains two csv files : one.csv two.csv
$ docker run --rm -v $(pwd):/files csvdiff one.csv two.csv
Alternatives
- csvdiff is a "fast diff tool for comparing CSV files" - you may get better results from this than from
csv-diff
against larger files.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file csv_diff-1.2.tar.gz
.
File metadata
- Download URL: csv_diff-1.2.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6c251542fbcd9d6eef8b27c3870fc4dfc3592c880c34dac57ed94e382f2d53b |
|
MD5 | b5afcf156b1fa071544282526d987b13 |
|
BLAKE2b-256 | 8e40428351c5f57b676e30b7f3a2940b3d017ee1b8e4e091dec4931e488a59fe |
File details
Details for the file csv_diff-1.2-py3-none-any.whl
.
File metadata
- Download URL: csv_diff-1.2-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf9c621a45d250f54a8b08cab14813509aa2709b3f7ad45dffc130383f9a8190 |
|
MD5 | ec98c4d69318a98a4d099a96fa66f937 |
|
BLAKE2b-256 | 239b72d060d002cd391044ddd0816bd5027f6fda6d62104b81a90a7c3611345c |