Skip to main content

Python wrapper for the similar Rust diffing library

Project description

rapidiff

A Python diff library backed by Rust. It provides a difflib.SequenceMatcher-style API with faster line, character, and word comparisons for common Python workloads.

Features

  • Multiple algorithms: Myers, Patience, LCS, and Google diff-match-patch
  • Fast Python API: public Python calls benchmark faster than difflib on the included correctness-gated scenarios
  • Multiple tokenization modes: lines, words, characters, graphemes, unicode words
  • Structured spans: get_diff() returns old/new spans with source-text positions and original text slices
  • Difflib compatibility checks: opcodes, ratios, unified/context diffs, and sequence setters are covered by regression tests
  • Rust 2024 extension: built with current PyO3 and maturin

Installation

From PyPI

pip install rapidiff

From Source

Requirements:

  • Python 3.10+
  • Rust 1.95+ recommended
  • maturin
git clone https://github.com/imvladikon/rapidiff
cd rapidiff

python -m venv venv
./venv/bin/python -m pip install "maturin[patchelf]" pytest hypothesis

VIRTUAL_ENV="$PWD/venv" ./venv/bin/python -m maturin develop --release

# Or build a wheel.
./venv/bin/python -m maturin build --release

Quick Start

Basic Usage

from rapidiff import SequenceMatcher

sm = SequenceMatcher(
    a="Hello World\nLine 2", 
    b="Hallo Welt\nLine 2", 
    algorithm="myers", 
    mode="chars"
)

print(f"Similarity: {sm.ratio():.2%}")

for opcode in sm.get_opcodes():
    print(opcode)

Algorithm Comparison

from rapidiff import SequenceMatcher

old_text = "The quick brown fox jumps over the lazy dog"
new_text = "The fast brown fox leaps over the lazy cat"

algorithms = ['myers', 'patience', 'lcs', 'google']

for algorithm in algorithms:
    sm = SequenceMatcher(a=old_text, b=new_text, algorithm=algorithm, mode='words')
    print(f"{algorithm:8s}: {sm.ratio():.3f}")

Different Tokenization Modes

from rapidiff import SequenceMatcher

text1 = "Hello world! 🌍"
text2 = "Hello earth! 🌎"

modes = ['chars', 'words', 'unicode_words', 'graphemes']

for mode in modes:
    sm = SequenceMatcher(a=text1, b=text2, algorithm='myers', mode=mode)
    print(f"{mode:12s}: {sm.ratio():.3f}")

Large Text Performance

from rapidiff import SequenceMatcher
import time

# Load large texts (e.g., from Project Gutenberg)
with open('war_and_peace.txt', 'r') as f:
    text1 = f.read()
with open('war_and_peace_modified.txt', 'r') as f:
    text2 = f.read()

start = time.time()
sm = SequenceMatcher(a=text1, b=text2, algorithm='myers', mode='lines')
ratio = sm.ratio()
elapsed = time.time() - start

print(f"Processed {len(text1)} chars in {elapsed:.2f}s")
print(f"Similarity: {ratio:.1%}")

Advanced Usage

Working with Different Text Types

from rapidiff import SequenceMatcher

# Code diffing
code1 = "def hello():\n    print('world')"
code2 = "def hello():\n    print('universe')"
sm = SequenceMatcher(a=code1, b=code2, algorithm='patience', mode='lines')

# Document diffing  
doc1 = "The cat sat on the mat."
doc2 = "A cat was sitting on the rug."
sm = SequenceMatcher(a=doc1, b=doc2, algorithm='google', mode='words')

# Character-level analysis
text1 = "café"
text2 = "coffee" 
sm = SequenceMatcher(a=text1, b=text2, algorithm='myers', mode='graphemes')

API Compatibility

from rapidiff import SequenceMatcher, unified_diff

sm = SequenceMatcher(None, "", "", algorithm="myers", mode="chars")

sm.set_seq1("hello")
sm.set_seq2("world")
sm.set_seqs("hello", "hello")

print(sm.ratio())
print(sm.quick_ratio())
print(sm.real_quick_ratio())

diff_results = sm.get_diff()
for result in diff_results:
    print(f"Old spans: {len(result.old_spans)}")
    print(f"New spans: {len(result.new_spans)}")

old_lines = ["one\n", "two\n"]
new_lines = ["one\n", "2\n"]
print("".join(unified_diff(old_lines, new_lines)))

🧪 Algorithms & Modes

Available Algorithms

Algorithm Backend Rust crate Best For
myers Myers diff through similar::TextDiff / capture_diff_slices similar (docs.rs) Default general-purpose text diffing
patience Patience diff through similar::TextDiff / capture_diff_slices similar (docs.rs) Code or documents with many unique stable tokens
lcs Longest Common Subsequence through similar::TextDiff / capture_diff_slices similar (docs.rs) Simple sequence similarity and compatibility checks
google, google_efficient diff-match-patch efficient implementation diff-match-patch-rs (docs.rs) Web/content-style text with flexible cleanup heuristics

The Python extension layer is built with PyO3 and maturin. Public Python calls use the Rust extension module; the benchmark scripts report timings through that Python API rather than timing Rust functions directly.

For google, non-ASCII inputs fall back to the similar Myers backend internally so Unicode spans and opcodes remain valid Python strings. rapidiff also contains an internal wagner_fisher implementation for reference tests and diagnostics. It is intentionally not documented as a public algorithm because it uses dynamic programming and is not intended for large user inputs.

Available Modes

Mode Description Use Case
lines Split by newlines File/document comparison
words Split by whitespace Text content analysis
chars Character by character Detailed text analysis
graphemes Unicode grapheme clusters International text
unicode_words Whitespace word mode kept for API compatibility Multi-language content

Testing

Run the comprehensive test suite:

VIRTUAL_ENV="$PWD/venv" ./venv/bin/python -m maturin develop --release
./venv/bin/python -m pytest tests -q
cargo clippy -- -D warnings

Current QA includes:

  • Core functionality: all algorithms and modes
  • Difflib compatibility: ratios, opcodes, grouped opcodes, matching blocks, unified diff, context diff
  • Span invariants: get_diff() spans must match source-text slices by their positions
  • Tokenization edge cases: whitespace, tabs, Unicode, emojis, line endings
  • Large text stress: Project Gutenberg-style text with tail insertions
  • Edge cases: Corruption scenarios, paraphrasing, empty sequences
  • Property-based testing: Hypothesis-powered ratio and span invariants
  • Examples validation: example code runs without errors

Performance

Run the Python-level benchmark:

./venv/bin/python scripts/benchmark_rapidiff.py

The benchmark first verifies that rapidiff and difflib.SequenceMatcher(...) produce matching ratios and opcodes for each scenario. It then reports median timings and IQR. Latest local run on CPython 3.12.3:

Scenario Action rapidiff difflib Speedup
lines-2k-sparse ratio 1.425ms 29.281ms 20.55x
lines-2k-sparse opcodes 1.510ms 28.906ms 19.14x
lines-2k-sparse ratio+opcodes 1.524ms 29.110ms 19.10x
chars-3k-unique ratio 0.746ms 13.890ms 18.62x
chars-3k-unique opcodes 0.765ms 14.018ms 18.32x
chars-3k-unique ratio+opcodes 0.998ms 13.890ms 13.92x
words-4k-unique ratio 6.443ms 37.737ms 5.86x
words-4k-unique opcodes 6.520ms 37.609ms 5.77x
words-4k-unique ratio+opcodes 6.626ms 37.591ms 5.67x

The benchmark measures the public Python API, including matcher creation. ratio+opcodes benefits from cached comparison data inside SequenceMatcher.

For scaling checks, run:

./venv/bin/python scripts/benchmark_scaling.py

This writes docs/performance_scaling.md and docs/performance_scaling.svg. The plot uses median Python-level ratio+opcodes time on the X axis and sequence length on the Y axis; length is line count for lines, character count for chars, and word count for words. Color groups each mode, while line style and marker shape distinguish rapidiff from builtin difflib. The timing script correctness-checks every scenario against difflib before measuring.

rapidiff vs difflib scaling

Latest local scaling run:

Mode Length rapidiff median difflib median Speedup
lines 250 0.203ms 2.148ms 10.56x
lines 500 0.377ms 4.635ms 12.28x
lines 1,000 0.640ms 9.385ms 14.68x
lines 2,000 1.834ms 18.858ms 10.28x
lines 4,000 3.079ms 38.686ms 12.56x
chars 500 0.199ms 2.231ms 11.20x
chars 1,000 0.380ms 4.342ms 11.42x
chars 2,000 0.754ms 9.016ms 11.95x
chars 4,000 1.499ms 18.307ms 12.21x
chars 8,000 3.217ms 37.182ms 11.56x
words 500 0.851ms 4.163ms 4.89x
words 1,000 1.563ms 9.242ms 5.91x
words 2,000 2.994ms 18.000ms 6.01x
words 4,000 5.957ms 37.064ms 6.22x
words 8,000 12.121ms 76.059ms 6.27x

Fixed Issues

This version fixes several critical issues found in span/offset calculations:

  • Operation detection: correctly identifies replace/insert/delete operations
  • Similarity ratios: stays within the 0.0-1.0 range
  • Span calculations: span text is extracted from the original source text by position
  • Whitespace handling: word spans preserve original tabs and repeated spaces
  • Unicode handling: supports international text, emoji, and grapheme mode
  • Large text handling: stress-tested on a War and Peace-sized text with tail-only changes

API Reference

SequenceMatcher Class

class SequenceMatcher:
    def __init__(
        self,
        isjunk=None,
        a: str = "",
        b: str = "",
        autojunk: bool = True,
        algorithm: str = "myers",
        mode: str = "lines",
    ):
        """Initialize sequence matcher."""
        
    def ratio(self) -> float:
        """Return similarity ratio between 0.0 and 1.0."""
        
    def quick_ratio(self) -> float:
        """Return upper bound estimate of ratio."""
        
    def get_opcodes(self) -> list[tuple[str, int, int, int, int]]:
        """Return list of diff operations."""
        
    def get_diff(self) -> list[DiffResult]:
        """Return structured diff results with spans."""
        
    def set_seq1(self, a: str) -> None:
        """Set first sequence."""
        
    def set_seq2(self, b: str) -> None:
        """Set second sequence."""
        
    def set_seqs(self, a: str, b: str) -> None:
        """Set both sequences."""

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (python -m pytest tests/ -v)
  4. Commit changes (git commit -m 'Add amazing feature')
  5. Push to branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

Project Status

  • Version: 0.3.0
  • Status: Release candidate
  • Test Coverage: 85 tests passing, 6 benchmark/hyperfine tests skipped when unavailable
  • Installation: Local wheel and editable maturin install verified
  • Performance: Benchmarked against Python's difflib
  • Documentation: README, benchmark script, examples, and usage guides
  • Compatibility: Python 3.10+ on Linux, macOS, Windows

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rapidiff-0.3.0.tar.gz (66.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rapidiff-0.3.0-cp312-cp312-win_amd64.whl (291.3 kB view details)

Uploaded CPython 3.12Windows x86-64

rapidiff-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (422.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

rapidiff-0.3.0-cp312-cp312-macosx_11_0_arm64.whl (372.3 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

rapidiff-0.3.0-cp311-cp311-win_amd64.whl (286.8 kB view details)

Uploaded CPython 3.11Windows x86-64

rapidiff-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (417.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

rapidiff-0.3.0-cp311-cp311-macosx_11_0_arm64.whl (372.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

rapidiff-0.3.0-cp310-cp310-win_amd64.whl (286.9 kB view details)

Uploaded CPython 3.10Windows x86-64

rapidiff-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (417.8 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

rapidiff-0.3.0-cp310-cp310-macosx_11_0_arm64.whl (372.4 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file rapidiff-0.3.0.tar.gz.

File metadata

  • Download URL: rapidiff-0.3.0.tar.gz
  • Upload date:
  • Size: 66.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rapidiff-0.3.0.tar.gz
Algorithm Hash digest
SHA256 6c79ea64910e9ab55633cb0229ab57b845fcb5e9b4504dbbbdfede7715906e26
MD5 a1e0cadc888e28723ad34b6868b22685
BLAKE2b-256 9093efc47131fb09d30df737e5321216df5773d3cb21aa75dff805047ddc8ad5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0.tar.gz:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: rapidiff-0.3.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 291.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rapidiff-0.3.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 b8e996d8fa314ddfb2a1c1171bb8aa55028d1cf38eadeae48cce7093f6d51f77
MD5 9d27083e614ac4e158fc503e80390222
BLAKE2b-256 3712a43330b775d9ae7fc120eded3528f7c49673130efc4cfa2fd805b995a119

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp312-cp312-win_amd64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rapidiff-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 351b0829261cc0c01ef1d3d6c1472122a34955a45f36cc1a9c0116701c94c3b7
MD5 6ab525c2ea8ba6a12928e6dd0e471b9b
BLAKE2b-256 4dd36151d571f41ff7763ac74f1c9f9cc1d502b0e07ec2ae893efce946c89c79

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rapidiff-0.3.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 03d17e7360d0ed5534dbc8aacbbf0b3b76d2ac8477e0f7e2fa64dbef785d5477
MD5 e7bf72c29b2d4c8a6aa88646624b1ed0
BLAKE2b-256 86073b341b071384cc7ad46f38243973f4b0db2e5a96ce0e1de305c2e19021a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: rapidiff-0.3.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 286.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rapidiff-0.3.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d01e4fdfee80c23cb17119ae01e3bd8b6f65d06cabaf28ec9bfb76ba280946f4
MD5 24ef66ba659dc4d0adecb75e6ec2043b
BLAKE2b-256 f5892241c76d6977d911fd370d82ff54e7dacf8c99353f7b41ef1ef93be5bc06

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp311-cp311-win_amd64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rapidiff-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 53dd0df09efddf2c758b6375d018db60843b0663fc81dad9e958904ccdf42a40
MD5 2497f96bc7fb51aac637dc22f19b78f4
BLAKE2b-256 bc9bfcf82bf8ad3aed55694bef4bb3974290c1519bbfb893f84d9bafa112fc86

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rapidiff-0.3.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b5acf8da667f9ffbbfc022c2d24defe245b763cd817b24be416a3a8b1795596c
MD5 e05bfcc9ed799cb6af7dcb3b2d262019
BLAKE2b-256 321640337d2049743848ba2ffc24687962c01a63a473508d8999e9be6bb3258e

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: rapidiff-0.3.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 286.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rapidiff-0.3.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c55da376b866aca899662a2b7712729dc17652ab96c9247dc9835bdfabfa14f5
MD5 e9da23db56cf162d5ecf325d9954f37f
BLAKE2b-256 5fed2c419701100ba998c340926a8b5bd6c17955e7a8b05353bd6a696989292f

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp310-cp310-win_amd64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rapidiff-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 14722fcae865ebf90d02ad2a2b67fbe7709e27d5a5dc84c5a803b6658dff710e
MD5 07be0c41786ba5e4b8391f4611165d52
BLAKE2b-256 4874df967792db6ad9fcf8af0f761bf1e096e6915e51abc71b495d8ad4ec2e69

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rapidiff-0.3.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rapidiff-0.3.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 304bcf3b920f50b4ed259c4ab9e0563f6782e1e03d061fceb556f236a2a36673
MD5 55c6991d821fbe4833c54d57f2b7ded1
BLAKE2b-256 a94f3ca9e81ef37cf218fc02fbf43a31a3c62c6b3f0ae786c058f9a70ecfaeac

See more details on using hashes here.

Provenance

The following attestation bundles were made for rapidiff-0.3.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: release.yml on imvladikon/rapidiff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page