Skip to main content

Sequence matcher with displacement detection.

Project description

mdiff

mdiff is a package for finding difference between two input sequences with ability to detect sequence elements displacements. The features are:

  • New SequenceMatcher class compatible with python built-in difflib.SequenceMatcher class.
  • CLI for using package as a tool for comparing files.

Installation

For plain python package (no additional dependencies):

pip install mdiff

For additional CLI tool and GUI functionalities (uses external packages such as colorama, or Typer):

pip install mdiff[tools]

Usage

Sequence Matching

HeckelSequenceMatcher is a class for comparing pairs of sequences of any type, as long as sequences are comparable and hashable. Unlike builtin difflib.SequenceMatcher, it detects and marks elements displacement between sequences. This class provides get_opcodes() method which returns Sequence of opcodes with differences between sequences, similar as difflib.SequenceMatcher.get_opcodes() does, but with additional move and moved tags for displaced elements.

HeckelSequenceMatcher implements Paul Heckel's algorithm described in "A Technique for Isolating Differences Between Files" paper, which can be found here.

HeckelSequenceMatcher(a: Sequence[Any] = '', b: Sequence[Any] = '', replace_mode=True)

Initialize sequence matcher object, parameters:

  • a - source(old) sequence.
  • b - target(new) sequence.
  • replace_mode - if True: it merges consecutive pairs of insert and delete blocks into replace operation. Remains insert and delete opcodes otherwise.

get_opcodes() -> List[OpCode]

Returns list of OpCode objects describing how to turn sequence a into b. OpCode consists of attributes: tag, i1, i2, j1, j2. OpCode can be unpacked as tuple.

Usually the first tuple has i1 == j1 == 0, and remaining tuples have i1 equal to the i2 from the preceding tuple, and, likewise, j1 equal to the previous j2. However, this rule is broken when move and moved tags appears in OpCodes list, due to sequence elements displacement detection.

The tags are strings, with these meanings:

  • replace - a[i1:i2] should be replaced by b[j1:j2]
  • delete - a[i1:i2] should be deleted. Note that j1==j2 in this case.
  • insert - b[j1:j2] should be inserted at a[i1:i1]. Note that i1==i2 in this case.
  • equal - a[i1:i2] == b[j1:j2]
  • move - a[i1:i2] should be moved to b[j1:j2] position. Note that j1==j2 in this case.
  • moved - is opposite tag for move. It's not an operation necessary for turning sequence a into b. It indicates that b[j1:j2] is moved from i1 position (or b[j1:j2] should be moved back to a[i1:i2]). Note that i1==j2 in this case. It can be used for sequence elements movement visualisation.

Examples:

from mdiff import HeckelSequenceMatcher

a = ['line1', 'line2', 'line3', 'line4', 'line5']
b = ['line1', 'line3', 'line2', 'line4', 'line6']
sm = HeckelSequenceMatcher(a, b)
opcodes = sm.get_opcodes()
for opcode in opcodes:
    print(opcode)
OpCode('equal', 0, 1, 0, 1)
OpCode('move', 1, 2, 2, 2)
OpCode('equal', 2, 3, 1, 2)
OpCode('moved', 1, 1, 2, 3)
OpCode('equal', 3, 4, 3, 4)
OpCode('replace', 4, 5, 4, 5)

Extracting changes from input sequences:

...
for tag, i1, i2, j1, j2 in opcodes:
    print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
equal     a[0:1] --> b[0:1]  ['line1'] --> ['line1']
move      a[1:2] --> b[2:2]  ['line2'] --> []
equal     a[2:3] --> b[1:2]  ['line3'] --> ['line3']
moved     a[1:1] --> b[2:3]         [] --> ['line2']
equal     a[3:4] --> b[3:4]  ['line4'] --> ['line4']
replace   a[4:5] --> b[4:5]  ['line5'] --> ['line6']

DisplacementSequenceMatcher

DisplacementSequenceMatcher is a variation of HeckelSequenceMatcher class. The algorithm keeps tracking of every sequence element occurrence, which might give better result when both sequences have common duplicated elements. It tries to detect all sequences elements displacements, where HeckelSequenceMatcher might sometimes treat displaced elements as delete/insert. Use this class if finding all sequences displacements is crucial.

Text diff

diff_lines_with_similarities

This function takes two input text sequences, turns them into lists of lines, generates opcodes for those lines and tries to find single characters differences in similar lines.

Parameters:

  • a: str - source text.
  • b: str - target text.
  • cutoff: float = 0.75 - value in range [0:1], where 0.0 means that lines are completely different and 1.0 means that lines are exactly the same. Line similarity cutoff is used to determine if sub opcodes for similar lines should be generated. If cutoff == 1, then in-line diff won't be generated.
  • line_sm: SequenceMatcherBase = None - SequenceMatcher object used to find differences between input texts lines. HeckelSequenceMatcher() will be used if not specified.
  • inline_sm: SequenceMatcherBase = None - SequenceMatcher object used to find differences between similar lines (i.e. using difflib.SequenceMatcher when in-line diff displacement detection is not desirable). difflib.SequenceMatcher() will be used if not specified.

Returns Tuple[a_lines: List[str], b_lines: List[str], opcodes: List[CompositeOpCode]] where:

  • a_lines - is list of lines from a input text sequence.
  • b_lines - is list of lines from b input text sequence.
  • opcodes - is list of CompositeOpCode which behave the same way as OpCode (has tag i1 i2 j1 j2 fields and can be unpacked), but has additional children_opcodes which stores list of nested opcodes with SequenceMatcher result for similar lines. List is empty if lines were not similar enough. (note that similar lines opcodes are generated only for replace tags, so children_opcodes list will be empty for every other tag).

Example

from mdiff import diff_lines_with_similarities, CompositeOpCode

a = 'line1\nline2\nline3\nline4\nline5'
b = 'line1\nline3\nline2\nline4\nline6'
a_lines, b_lines, opcodes = diff_lines_with_similarities(a, b, cutoff=0.75)

for opcode in opcodes:
    tag, i1, i2, j1, j2 = opcode
    print('{:7}   a_lines[{}:{}] --> b_lines[{}:{}] {!r:>10} --> {!r}'.
          format(tag, i1, i2, j1, j2, a_lines[i1:i2], b_lines[j1:j2]))
    if isinstance(opcode, CompositeOpCode) and opcode.children_opcodes:
        for ltag, li1, li2, lj1, lj2 in opcode.children_opcodes:
            print('\t{:7}   a_lines[{}][{}:{}] --> b_lines[{}][{}:{}] {!r:>10} --> {!r}'
                  .format(ltag, i1, li1, li2, j1, lj1, lj2, a_lines[i1][li1:li2], b_lines[j1][lj1:lj2]))
equal     a_lines[0:1] --> b_lines[0:1]  ['line1'] --> ['line1']
move      a_lines[1:2] --> b_lines[2:2]  ['line2'] --> []
equal     a_lines[2:3] --> b_lines[1:2]  ['line3'] --> ['line3']
moved     a_lines[1:1] --> b_lines[2:3]         [] --> ['line2']
equal     a_lines[3:4] --> b_lines[3:4]  ['line4'] --> ['line4']
replace   a_lines[4:5] --> b_lines[4:5]  ['line5'] --> ['line6']
	equal     a_lines[4][0:4] --> b_lines[4][0:4]     'line' --> 'line'
	replace   a_lines[4][4:5] --> b_lines[4][4:5]        '5' --> '6'

Indented tags shows in-line differences, in this case line5 and line6 strings have the only difference at last character.

CLI Tool

mdiff also provides CLI tool (available only if installed using pip install mdiff[tools]). For more information type mdiff --help

Usage: mdiff [OPTIONS] SOURCE_FILE TARGET_FILE

  Reads 2 files from provided paths, compares their content and prints diff.
  If compared lines in text files are similar enough (exceed cutoff) then
  extracts in-line diff.

  There are few possible strategies to choose to use independently in line-
  level and in-line-level diff:

      standard: uses built in python SequenceMatcher object to generate diff,
      elements movement detection not supported.

      heckel: detects elements movement in a human-readable form, might not
      catch all of moves and differences.

      displacement: detects all differences and movements, might not be very
      useful when both input files contains     many common lines (for example
      many empty newlines).

Arguments:
  SOURCE_FILE  Source file path to compare.  [required]
  TARGET_FILE  Target file path to compare.  [required]

Options:
  --line-sm [standard|heckel|displacement]
                                  Choose sequence matching method to detect
                                  differences between lines.  [default:
                                  heckel]
  --inline-sm [standard|heckel|displacement]
                                  Choose sequence matching method to detect
                                  in-line differences between similar lines.
                                  [default: heckel]
  --cutoff FLOAT RANGE            Line similarity ratio cutoff. If value
                                  exceeded then finds in-line differences in
                                  similar lines.  [default: 0.75; 0.0<=x<=1.0]
  --gui / --no-gui                Open GUI window with diff result.  [default:
                                  no-gui]
  --case-sensitive / --no-case-sensitive
                                  Whether diff is going to be case sensitive.
                                  [default: case-sensitive]
  --char-mode [utf8|ascii]        Terminal character set used when printing
                                  diff result.  [default: utf8]
  --color-mode [fore|back]        Terminal color mode used when printing diff
                                  result.  [default: fore]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Example

Sample output for mdiff a.txt b.txt command:

Showing result in GUI mdiff a.txt b.txt --gui:

GUI

Use mdiff-gui command to launch GUI application.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdiff-0.1.1.tar.gz (29.8 kB view details)

Uploaded Source

File details

Details for the file mdiff-0.1.1.tar.gz.

File metadata

  • Download URL: mdiff-0.1.1.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for mdiff-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6ffc9a56ae6714a40bae173334e14a8b671eb15aacaf36d08804facba12e6276
MD5 fe7a89d4031884e8494307bf8a486d05
BLAKE2b-256 3118714f38a10ec597f36d07952fc9f45877ade46c88e6cd5685cf7b6c47845f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page