Sequence matcher with displacement detection.
Project description
mdiff
mdiff is a package for finding difference between two input sequences with ability to detect sequence elements displacements. The features are:
- New SequenceMatcher class compatible with python built-in
difflib.SequenceMatcher
class. - CLI for using package as a tool for comparing files.
Installation
For plain python package (no additional dependencies):
pip install mdiff
For additional CLI tool functionalities (uses external packages such as colorama, or Typer):
pip install mdiff[cli]
Usage
Sequence Matching
HeckelSequenceMatcher
is a class for comparing pairs of sequences of any type, as long as sequences
are comparable and hashable. Unlike builtin difflib.SequenceMatcher
, it detects and marks elements
displacement between sequences. This class provides get_opcodes()
method which returns Sequence of opcodes
with differences between sequences, similar as difflib.SequenceMatcher.get_opcodes()
does, but
with additional move
and moved
tags for displaced elements.
HeckelSequenceMatcher
implements Paul Heckel's algorithm described in
"A Technique for Isolating Differences Between Files" paper, which can be found
here.
HeckelSequenceMatcher(a: Sequence[Any] = '', b: Sequence[Any] = '', replace_mode=True)
Initialize sequence matcher object, parameters:
a
- source(old) sequence.b
- target(new) sequence.replace_mode
- if True: it merges consecutive pairs ofinsert
anddelete
blocks intoreplace
operation. Remainsinsert
anddelete
opcodes otherwise.
get_opcodes() -> List[OpCode]
Returns list of OpCode objects describing how to turn sequence a
into b
.
OpCode consists of attributes: tag
, i1
, i2
, j1
, j2
. OpCode can be unpacked as tuple.
Usually the first tuple has i1 == j1 == 0
, and remaining tuples have i1
equal to the i2
from the preceding tuple, and, likewise, j1
equal to the previous j2
. However, this rule is broken when
move
and moved
tags appears in OpCodes list, due to sequence elements displacement detection.
The tags are strings, with these meanings:
replace
-a[i1:i2]
should be replaced byb[j1:j2]
delete
-a[i1:i2]
should be deleted. Note thatj1==j2
in this case.insert
-b[j1:j2]
should be inserted ata[i1:i1]
. Note thati1==i2
in this case.equal
-a[i1:i2] == b[j1:j2]
move
-a[i1:i2]
should be moved tob[j1:j2]
position. Note thatj1==j2
in this case.moved
- is opposite tag formove
. It's not an operation necessary for turning sequencea
intob
. It indicates thatb[j1:j2]
is moved fromi1
position (orb[j1:j2]
should be moved back toa[i1:i2]
). Note thati1==j2
in this case. It can be used for sequence elements movement visualisation.
Examples:
from mdiff import HeckelSequenceMatcher
a = ['line1', 'line2', 'line3', 'line4', 'line5']
b = ['line1', 'line3', 'line2', 'line4', 'line6']
sm = HeckelSequenceMatcher(a, b)
opcodes = sm.get_opcodes()
for opcode in opcodes:
print(opcode)
OpCode('equal', 0, 1, 0, 1)
OpCode('move', 1, 2, 2, 2)
OpCode('equal', 2, 3, 1, 2)
OpCode('moved', 1, 1, 2, 3)
OpCode('equal', 3, 4, 3, 4)
OpCode('replace', 4, 5, 4, 5)
Extracting changes from input sequences:
...
for tag, i1, i2, j1, j2 in opcodes:
print('{:7} a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
equal a[0:1] --> b[0:1] ['line1'] --> ['line1']
move a[1:2] --> b[2:2] ['line2'] --> []
equal a[2:3] --> b[1:2] ['line3'] --> ['line3']
moved a[1:1] --> b[2:3] [] --> ['line2']
equal a[3:4] --> b[3:4] ['line4'] --> ['line4']
replace a[4:5] --> b[4:5] ['line5'] --> ['line6']
DisplacementSequenceMatcher
DisplacementSequenceMatcher
is a variation of HeckelSequenceMatcher
class. The algorithm keeps tracking of every sequence element occurrence, which might give better result when both sequences have common duplicated elements. It tries to detect all sequences elements displacements, where HeckelSequenceMatcher
might sometimes treat displaced elements as delete/insert. Use this class if finding all sequences displacements is crucial.
Text diff
diff_lines_with_similarities
This function takes two input text sequences, turns them into lists of lines, generates opcodes for those lines and tries to find single characters differences in similar lines.
Parameters:
a: str
- source text.b: str
- target text.cutoff: float = 0.75
- value in range [0:1], where 0.0 means that lines are completely different and 1.0 means that lines are exactly the same. Line similarity cutoff is used to determine if sub opcodes for similar lines should be generated. Ifcutoff == 1
, then in-line diff won't be generated.line_sm: SequenceMatcherBase = None
- SequenceMatcher object used to find differences between input texts lines.HeckelSequenceMatcher()
will be used if not specified.inline_sm: SequenceMatcherBase = None
- SequenceMatcher object used to find differences between similar lines (i.e. usingdifflib.SequenceMatcher
when in-line diff displacement detection is not desirable).difflib.SequenceMatcher()
will be used if not specified.
Returns Tuple[a_lines: List[str], b_lines: List[str], opcodes: List[CompositeOpCode]]
where:
a_lines
- is list of lines froma
input text sequence.b_lines
- is list of lines fromb
input text sequence.opcodes
- is list ofCompositeOpCode
which behave the same way asOpCode
(hastag i1 i2 j1 j2
fields and can be unpacked), but has additionalchildren_opcodes
which stores list of nested opcodes with SequenceMatcher result for similar lines. List is empty if lines were not similar enough. (note that similar lines opcodes are generated only forreplace
tags, so children_opcodes list will be empty for every other tag).
Example
from mdiff import diff_lines_with_similarities, CompositeOpCode
a = 'line1\nline2\nline3\nline4\nline5'
b = 'line1\nline3\nline2\nline4\nline6'
a_lines, b_lines, opcodes = diff_lines_with_similarities(a, b, cutoff=0.75)
for opcode in opcodes:
tag, i1, i2, j1, j2 = opcode
print('{:7} a_lines[{}:{}] --> b_lines[{}:{}] {!r:>10} --> {!r}'.
format(tag, i1, i2, j1, j2, a_lines[i1:i2], b_lines[j1:j2]))
if isinstance(opcode, CompositeOpCode) and opcode.children_opcodes:
for ltag, li1, li2, lj1, lj2 in opcode.children_opcodes:
print('\t{:7} a_lines[{}][{}:{}] --> b_lines[{}][{}:{}] {!r:>10} --> {!r}'
.format(ltag, i1, li1, li2, j1, lj1, lj2, a_lines[i1][li1:li2], b_lines[j1][lj1:lj2]))
equal a_lines[0:1] --> b_lines[0:1] ['line1'] --> ['line1']
move a_lines[1:2] --> b_lines[2:2] ['line2'] --> []
equal a_lines[2:3] --> b_lines[1:2] ['line3'] --> ['line3']
moved a_lines[1:1] --> b_lines[2:3] [] --> ['line2']
equal a_lines[3:4] --> b_lines[3:4] ['line4'] --> ['line4']
replace a_lines[4:5] --> b_lines[4:5] ['line5'] --> ['line6']
equal a_lines[4][0:4] --> b_lines[4][0:4] 'line' --> 'line'
replace a_lines[4][4:5] --> b_lines[4][4:5] '5' --> '6'
Indented tags shows in-line differences, in this case line5
and line6
strings have the only difference at last character.
CLI Tool
mdiff also provides CLI tool (available only if installed using pip install mdiff[cli]
). For more information
type mdiff --help
Usage: mdiff [OPTIONS] SOURCE_FILE TARGET_FILE
Reads 2 files from provided paths, compares their content and prints diff.
If compared lines in text files are similar enough (exceed cutoff) then
extracts in-line diff.
There are few possible strategies to choose to use independently in line-
level and in-line-level diff:
standard: uses built in python SequenceMatcher object to generate diff,
elements movement detection not supported.
heckel: detects elements movement in a human-readable form, might not
catch all of moves and differences.
displacement: detects all differences and movements, might not be very
useful when both input files contains many common lines (for example
many empty newlines).
Arguments:
SOURCE_FILE Source file path to compare. [required]
TARGET_FILE Target file path to compare. [required]
Options:
--line-sm [standard|heckel|displacement]
Choose sequence matching method to detect
differences between lines. [default:
heckel]
--inline-sm [standard|heckel|displacement]
Choose sequence matching method to detect
in-line differences between similar lines.
[default: heckel]
--cutoff FLOAT RANGE Line similarity ratio cutoff. If value
exceeded then finds in-line differences in
similar lines. [default: 0.75; 0.0<=x<=1.0]
--char-mode [utf8|ascii] Character set used when printing diff
result. [default: utf8]
--color-mode [fore|back] Color mode used when printing diff result.
[default: fore]
--install-completion [bash|zsh|fish|powershell|pwsh]
Install completion for the specified shell.
--show-completion [bash|zsh|fish|powershell|pwsh]
Show completion for the specified shell, to
copy it or customize the installation.
--help Show this message and exit.
Example
Sample output for mdiff a.txt b.txt
command:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mdiff-0.0.5.tar.gz
.
File metadata
- Download URL: mdiff-0.0.5.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 727bf4b3945670f5ce49144af874796b2bd6ea8f13748480c2b77abf2507f962 |
|
MD5 | 056839b6997465e53694e2b9468e861f |
|
BLAKE2b-256 | a38f0a18e688990f0a2faa34a8992180952eece2423e650995b572c6c4ffb99e |