Sequence matcher with displacement detection.
Project description
mdiff
mdiff is a package, tool and application for comparing and generating diff info for input data. It has an ability to detect sequence elements displacements (i.e. line in text have been moved up or down).
The features are:
- New
SequenceMatcher
class generating opcodes with sequence elements displacement detection (new opcodes tags:move
andmoved
). - API for comparing and generating diff for text inputs.
- CLI for using package as a tool for comparing files.
- Simple Standalone GUI application for comparing text contents.
Links
Requirements
- Python 3.8+
Installation
For plain python package (no additional dependencies):
pip install mdiff
For additional CLI tool and GUI functionalities (uses external packages such as colorama, or Typer):
pip install mdiff[cli]
Examples
Generating opcodes for input sequences:
from mdiff import HeckelSequenceMatcher
a = ['line1', 'line2', 'line3', 'line4', 'line5']
b = ['line1', 'line3', 'line2', 'line4', 'line6']
sm = HeckelSequenceMatcher(a, b)
opcodes = sm.get_opcodes()
for opcode in opcodes:
print(opcode)
OpCode('equal', 0, 1, 0, 1)
OpCode('move', 1, 2, 2, 2)
OpCode('equal', 2, 3, 1, 2)
OpCode('moved', 1, 1, 2, 3)
OpCode('equal', 3, 4, 3, 4)
OpCode('replace', 4, 5, 4, 5)
Extracting changes from input sequences using opcodes:
...
for tag, i1, i2, j1, j2 in opcodes:
print('{:7} a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
equal a[0:1] --> b[0:1] ['line1'] --> ['line1']
move a[1:2] --> b[2:2] ['line2'] --> []
equal a[2:3] --> b[1:2] ['line3'] --> ['line3']
moved a[1:1] --> b[2:3] [] --> ['line2']
equal a[3:4] --> b[3:4] ['line4'] --> ['line4']
replace a[4:5] --> b[4:5] ['line5'] --> ['line6']
Generating and printing diff for input strings:
from mdiff import diff_lines_with_similarities, CompositeOpCode
a = 'line1\nline2\nline3\nline4\nline5'
b = 'line1\nline3\nline2\nline4\nline6'
a_lines, b_lines, opcodes = diff_lines_with_similarities(a, b, cutoff=0.75)
for opcode in opcodes:
tag, i1, i2, j1, j2 = opcode
print('{:7} a_lines[{}:{}] --> b_lines[{}:{}] {!r:>10} --> {!r}'.
format(tag, i1, i2, j1, j2, a_lines[i1:i2], b_lines[j1:j2]))
if isinstance(opcode, CompositeOpCode) and opcode.children_opcodes:
for ltag, li1, li2, lj1, lj2 in opcode.children_opcodes:
print('\t{:7} a_lines[{}][{}:{}] --> b_lines[{}][{}:{}] {!r:>10} --> {!r}'
.format(ltag, i1, li1, li2, j1, lj1, lj2, a_lines[i1][li1:li2], b_lines[j1][lj1:lj2]))
equal a_lines[0:1] --> b_lines[0:1] ['line1'] --> ['line1']
move a_lines[1:2] --> b_lines[2:2] ['line2'] --> []
equal a_lines[2:3] --> b_lines[1:2] ['line3'] --> ['line3']
moved a_lines[1:1] --> b_lines[2:3] [] --> ['line2']
equal a_lines[3:4] --> b_lines[3:4] ['line4'] --> ['line4']
replace a_lines[4:5] --> b_lines[4:5] ['line5'] --> ['line6']
equal a_lines[4][0:4] --> b_lines[4][0:4] 'line' --> 'line'
replace a_lines[4][4:5] --> b_lines[4][4:5] '5' --> '6'
Indented tags shows in-line differences, in this case line5
and line6
strings have the only difference at last character.
API
General idea is to provide new SequenceMatcher
object that's compatible with built-in difflib.SequenceMatcher
class,
so they can be used interchangeably to generate opcodes.
HeckelSequenceMatcher
HeckelSequenceMatcher
is a class for comparing pairs of sequences of any type, as long as sequences
are comparable and hashable. Unlike builtin difflib.SequenceMatcher
, it detects and marks elements
displacement between sequences. This class provides get_opcodes()
method which returns Sequence of opcodes
with differences between sequences, similar as difflib.SequenceMatcher.get_opcodes()
does, but
with additional move
and moved
tags for displaced elements.
HeckelSequenceMatcher
implements Paul Heckel's algorithm described in
"A Technique for Isolating Differences Between Files" paper, which can be found
here.
HeckelSequenceMatcher(a: Sequence[Any] = '', b: Sequence[Any] = '', replace_mode=True)
Initialize sequence matcher object, parameters:
a
- source(old) sequence.b
- target(new) sequence.replace_mode
- if True: it merges consecutive pairs ofinsert
anddelete
blocks intoreplace
operation. Remainsinsert
anddelete
opcodes otherwise.
get_opcodes() -> List[OpCode]
Returns list of OpCode
objects describing how to turn sequence a
into b
.
OpCode
consists of attributes: tag
, i1
, i2
, j1
, j2
. OpCode
can be unpacked as tuple.
Usually the first tuple has i1 == j1 == 0
, and remaining tuples have i1
equal to the i2
from the preceding tuple, and, likewise, j1
equal to the previous j2
. However, this rule is broken when
move
and moved
tags appears in OpCodes list, due to sequence elements displacement detection.
The tags are strings, with these meanings:
replace
-a[i1:i2]
should be replaced byb[j1:j2]
delete
-a[i1:i2]
should be deleted. Note thatj1==j2
in this case.insert
-b[j1:j2]
should be inserted ata[i1:i1]
. Note thati1==i2
in this case.equal
-a[i1:i2] == b[j1:j2]
move
-a[i1:i2]
should be moved tob[j1:j2]
position. Note thatj1==j2
in this case.moved
- is opposite tag formove
. It's not an operation necessary for turning sequencea
intob
. It indicates thatb[j1:j2]
is moved fromi1
position (orb[j1:j2]
should be moved back toa[i1:i2]
). Note thati1==j2
in this case. It can be used for sequence elements movement visualisation.
DisplacementSequenceMatcher
DisplacementSequenceMatcher
is a variation of HeckelSequenceMatcher
class.
The algorithm keeps tracking of every sequence element occurrence, which might give better result when both sequences have common unique elements.
It tries to detect all sequences elements displacements, where HeckelSequenceMatcher
might sometimes treat displaced elements as delete/insert.
This SequenceMatcher
class might be useful when finding all sequences displacements is crucial
, however for ordinary text sequences it might produce human-unreadable diff.
This object has the same methods as HeckelSequenceMatcher
Generating text diff
diff_lines_with_similarities(...)
This function takes two input text sequences, turns them into lists of lines, generates opcodes for those lines and tries to find single characters differences in similar lines.
Parameters:
a: str
- source text.b: str
- target text.cutoff: float = 0.75
- value in range [0:1], where 0.0 means that lines are completely different and 1.0 means that lines are exactly the same. Line similarity cutoff is used to determine if sub opcodes for similar lines should be generated. Ifcutoff == 1
, then in-line diff won't be generated.line_sm: SequenceMatcherBase = None
- SequenceMatcher object used to find differences between input texts lines.HeckelSequenceMatcher()
will be used if not specified.inline_sm: SequenceMatcherBase = None
- SequenceMatcher object used to find differences between similar lines (i.e. usingdifflib.SequenceMatcher
when in-line diff displacement detection is not desirable).difflib.SequenceMatcher()
will be used if not specified.keepends = False
- Whether to keep newline characters when splitting input sequences.case_sensitive = True
- Whether to perform string case-sensitive comparison when generating diff.
Returns a tuple (a_lines: List[str], b_lines: List[str], opcodes: List[CompositeOpCode])
where:
a_lines
- is list of lines froma
input text sequence.b_lines
- is list of lines fromb
input text sequence.opcodes
- is list ofCompositeOpCode
which behave the same way asOpCode
(hastag i1 i2 j1 j2
fields and can be unpacked), but has additionalchildren_opcodes
which stores list of nested opcodes with SequenceMatcher result for similar lines. List is empty if lines were not similar enough. (note that similar lines opcodes are generated only forreplace
tags, so children_opcodes list will be empty for every other tag).
CLI Tool
mdiff also provides CLI tool (available only if installed using pip install mdiff[cli]
). For more information
type mdiff --help
Usage: mdiff [OPTIONS] SOURCE_FILE TARGET_FILE
Reads 2 files from provided paths, compares their content and prints diff.
If compared lines in text files are similar enough (exceed cutoff) then
extracts in-line diff.
There are few possible strategies to choose to use independently in line-
level and in-line-level diff:
standard: uses built in python SequenceMatcher object to generate diff,
elements movement detection not supported.
heckel: detects elements movement in a human-readable form, might not
catch all of moves and differences.
displacement: detects all differences and movements, might not be very
useful when both input files contains many common lines (for example
many empty newlines).
Arguments:
SOURCE_FILE Source file path to compare. [required]
TARGET_FILE Target file path to compare. [required]
Options:
--line-sm [standard|heckel|displacement]
Choose sequence matching method to detect
differences between lines. [default:
heckel]
--inline-sm [standard|heckel|displacement]
Choose sequence matching method to detect
in-line differences between similar lines.
[default: heckel]
--cutoff FLOAT RANGE Line similarity ratio cutoff. If value
exceeded then finds in-line differences in
similar lines. [default: 0.75; 0.0<=x<=1.0]
--gui / --no-gui Open GUI window with diff result. [default:
no-gui]
--case-sensitive / --no-case-sensitive
Whether diff is going to be case sensitive.
[default: case-sensitive]
--char-mode [utf8|ascii] Terminal character set used when printing
diff result. [default: utf8]
--color-mode [fore|back] Terminal color mode used when printing diff
result. [default: fore]
--install-completion [bash|zsh|fish|powershell|pwsh]
Install completion for the specified shell.
--show-completion [bash|zsh|fish|powershell|pwsh]
Show completion for the specified shell, to
copy it or customize the installation.
--help Show this message and exit.
Example
Sample output for mdiff a.txt b.txt
command:
Showing result in GUI mdiff a.txt b.txt --gui
:
Standalone GUI application
Standalone app provides simple text editor, which allows generating diff based on user input.
Use mdiff-gui
command to launch GUI application in interpreter mode, or use precompiled version from here.
Building Standalone GUI application
GUI application can be built directly from source by typing below commands:
> pip install -r requirements.txt
> ./pyinstaler_start.sh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mdiff-0.1.5.tar.gz
.
File metadata
- Download URL: mdiff-0.1.5.tar.gz
- Upload date:
- Size: 32.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 567fe1f0e1d61bddc33f60d1f4b5d342f518210120c7e7d4b2e12942157d8f7b |
|
MD5 | f1ca31d82cfd7588fffb010fa391c7de |
|
BLAKE2b-256 | 8f801b7c27efa983f510588665a2c7b7dab394c63886533b5ecb43e8d937f29f |