Skip to main content

Rich file comparison with a focus on structured and tabular data

Project description

sdiff

Rich file comparison with a focus on structured and tabular data

mosaic-edit

About

sdiff is a diff tool and a library. You can use it to build diffs and compare strings, sequences, arrays, nested sequences, matrices, texts, tables, files, etc. It runs Myers diff algorithm under the hood. Implemented in python+Cython.

Features

sdiff is not a drop-in replacement for your diff tool. But it does some things nicely.

  • You can use it for text as usual.
  • sdiff supports tables
  • pretty fast
  • exposes low-level python API to compare/align arbitrary sequences
  • The CLI sdiff tool can be used to compare entire directories while discovering file types on the fly. It can be fine-tuned to include/exclude files, align file names through regexes, set various similarity measures, provide colored reports in various formats.

Install

pip install sdiff

Install the latest git version

pip install git+https://github.com/pulkin/sdiff.git

Examples

CLI

> sdiff a.csv b.csv
comparing a.csv vs b.csv
  Country     Region Date       Kilotons of Co2 Metric Tons Per Capita
- ----------- ------ ---------- --------------- ----------------------
(3 row(s) match)
3 Afghanistan Asia   01-01-2019 6080            0.16                  
4 Afghanistan Asia   01-01-2018 6070            0.17                  
5 Afghanistan Asia   01-01-2013 ---5990---      0.19                  
                                +++6000+++                            
6 Afghanistan Asia   01-01-2015 5950            0.18                  
7 Afghanistan Asia   01-01-2016 5300            0.15                  
(1 row(s) match)

API

from sdiff.sequence import diff

print(diff(
  ['apples', 'bananas', 'carrots', 'dill'],
  ['apples', 'carrots', 'dill', 'eggplant']
).to_string())
a≈b (ratio=0.7500)
··a[0:1]=b[0:1]: ['apples'] = ['apples']
··a[1:2]≠b[1:1]: ['bananas'] ≠ []
··a[2:4]=b[1:3]: ['carrots', 'dill'] = ['carrots', 'dill']
··a[4:4]≠b[3:4]: [] ≠ ['eggplant']

More examples

Align and correspond nested sequences: strings inside a list inside another list

from sdiff.sequence import diff_nested

print(diff_nested(
  [["alice", "bob"], ["charlie", "dan"]],
  [0, 1, ["friends", "alice2", "bob2"], ["karen", "dan"]],
  min_ratio=0.5,
).to_string())
a≈b (ratio=0.6667)
··a[0:0]≠b[0:2]: [] ≠ [0, 1]
··a[0:2]≈b[2:4]: [['alice', 'bob'], ['charlie', 'dan']] ≈ [['friends', 'alice2', 'bob2'], ['karen', 'dan']]
····a[0]≈b[2] (ratio=0.8000)  # recognizes partially aligned ["alice", "bob"] and ["friends", "alice2", "bob2"]
(...)
········a[0][0]≈b[2][1] (ratio=0.9091)  # recognizes similarity between 'alice' and 'alice2'
··········a[0][0][0:5]=b[2][1][0:5]: 'alice' = 'alice'
··········a[0][0][5:5]≠b[2][1][5:6]: '' ≠ '2'
········a[0][1]≈b[2][2] (ratio=0.8571)  # recognizes similarity between 'bob' and 'bob2'
··········a[0][1][0:3]=b[2][2][0:3]: 'bob' = 'bob'
··········a[0][1][3:3]≠b[2][2][3:4]: '' ≠ '2'
(...)

Align numpy matrices

Given two 2D arrays, compute aligned rows and columns

import numpy as np
from sdiff.numpy import diff_aligned_2d

a = np.array([[0, 1], [2, 3]])
b = np.array([[0, 1, 4], [7, 8, 9], [2, 3, 6]])
# a is a "sub-matrix" of b

d = diff_aligned_2d(a, b, -1, min_ratio=0.5)

Inflated versions of the two matrices (-1 from the above is a fill value)

print(d.a)
print(d.b)
[[ 0  1 -1]
 [-1 -1 -1]  < an empty row needs to be added to a to align with b
 [ 2  3 -1]]
        ^^
# an empty column needs to be added as well
 
# inflated version of b is b itself in this case
[[0 1 4]
 [7 8 9]
 [2 3 6]]

License

LICENSE.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdiff-0.1.7.tar.gz (263.4 kB view details)

Uploaded Source

File details

Details for the file sdiff-0.1.7.tar.gz.

File metadata

  • Download URL: sdiff-0.1.7.tar.gz
  • Upload date:
  • Size: 263.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdiff-0.1.7.tar.gz
Algorithm Hash digest
SHA256 fb7221ad5215fa82a92673b0b839e81575c4fa55f78dc252b71f4281841e8a51
MD5 dbd6476a5898769e120c8b26a946f82e
BLAKE2b-256 5fc1b3340dc5a4d926e78fa11857095a4b2b160b1840995892a76c92a51ac45a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdiff-0.1.7.tar.gz:

Publisher: pypi.yml on pulkin/sdiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page