Skip to main content

A Rust implementation of Python's difflib.unified_diff with PyO3 bindings

Project description

difflib-rs

A high-performance Rust implementation of Python's difflib.unified_diff function with PyO3 bindings.

Overview

This package provides a Rust-based implementation of the unified diff algorithm, offering significant performance improvements over Python's built-in difflib module while maintaining API compatibility.

Features

  • 🚀 3-5x Faster: Consistently outperforms Python's difflib across all file sizes and change patterns
  • 100% Compatible: Drop-in replacement for difflib.unified_diff with identical output
  • Thoroughly Tested: Comprehensive test suite ensuring byte-for-byte compatibility with Python's implementation
  • Easy to use: Simple Python API with PyO3 bindings

Performance

The Rust implementation consistently outperforms Python's built-in difflib module while producing identical output:

Benchmark Results (Baseline - HashMap Implementation)

Small to Medium Files (10% changes)

File Size Python Time Rust Time Speedup Output Lines
100 lines 0.0001s 0.00003s 2.96x 71
500 lines 0.0005s 0.00013s 3.89x 300
1,000 lines 0.0011s 0.00024s 4.65x 587
2,000 lines 0.0023s 0.00056s 4.14x 1,222

Files with Heavy Changes (50% changes)

File Size Python Time Rust Time Speedup Output Lines
100 lines 0.0001s 0.00002s 5.72x 131
500 lines 0.0009s 0.00019s 4.62x 655
1,000 lines 0.0017s 0.00036s 4.76x 1,285

Large Files with Few Changes

File Size Changes Python Time Rust Time Speedup Output Lines
5,000 lines 5 0.0025s 0.00073s 3.43x 47
10,000 lines 5 0.0051s 0.00148s 3.44x 47
20,000 lines 5 0.0086s 0.00295s 2.92x 47

Large Files with Medium Changes (5% changed)

File Size Changes Python Time Rust Time Speedup Output Lines
5,000 lines 250 0.0068s 0.00138s 4.91x 1,869
10,000 lines 500 0.0146s 0.00285s 5.12x 3,793
20,000 lines 1,000 0.0357s 0.00680s 5.25x 7,569

Special Cases

Test Case Python Time Rust Time Speedup
Identical sequences (5,000 lines) 0.00145s 0.00033s 4.43x
Completely different (1,000 lines) 0.00030s 0.00020s 1.50x

Key Optimizations

The performance improvements come from:

  • FxHashMap (Firefox's fast hash) instead of Python's dict for sparse representation
  • Efficient HashMap swapping to avoid allocations (using std::mem::swap)
  • Queue-based matching algorithm for better cache locality
  • Optimized string operations leveraging Rust's zero-cost abstractions
  • Popularity heuristic to skip overly common elements (matches Python's algorithm)

Installation

From pip

pip install difflib-rs

Build from source

# Clone the repository
git clone https://github.com/sweepai/difflib-rs.git
cd difflib-rs

# Set up virtual environment
python -m venv venv
source venv/bin/activate

# Install build dependencies
pip install maturin pytest

# Build and install
maturin develop --release

Usage

This is a drop-in replacement for Python's difflib.unified_diff. Simply replace your import:

- from difflib import unified_diff
+ from difflib_rs import unified_diff

# Compare two sequences of lines
a = ['line1', 'line2', 'line3']
b = ['line1', 'modified', 'line3']

diff = unified_diff(
    a, b,
    fromfile='original.txt',
    tofile='modified.txt',
    fromfiledate='2023-01-01',
    tofiledate='2023-01-02'
)

for line in diff:
    print(line, end='')

Note: Currently only unified_diff is supported. Other difflib functions are not implemented.

Extra: String-based API

For additional convenience, use unified_diff_str directly with (unsplit) strings:

from difflib_rs import unified_diff_str

# Compare two strings directly - no need to split first!
text_a = """line1
line2
line3"""

text_b = """line1
modified
line3"""

# The function handles splitting internally (more efficient)
diff = unified_diff_str(
    text_a, text_b,
    fromfile='original.txt',
    tofile='modified.txt',
    keepends=False  # Whether to keep line endings in the diff
)

for line in diff:
    print(line, end='')

The unified_diff_str function:

  • Takes strings directly instead of lists
  • Handles line splitting internally in Rust (faster than Python's splitlines())
  • Supports \n, \r\n, and \r line endings
  • Has a keepends parameter to preserve line endings in the output

String Splitting Performance

Performance comparison of unified_diff_str vs unified_diff with Python splitlines():

File Size Python split + Rust diff All Rust (unified_diff_str) Speedup
100 lines 0.0000s 0.0000s 1.88x
500 lines 0.0002s 0.0001s 1.39x
1000 lines 0.0003s 0.0003s 1.27x
2000 lines 0.0007s 0.0006s 1.22x

API

The unified_diff function accepts the same parameters as Python's difflib.unified_diff:

  • a, b: Sequences of lines to compare
  • fromfile, tofile: Filenames for the diff header
  • fromfiledate, tofiledate: File modification dates
  • n: Number of context lines (default: 3)
  • lineterm: Line terminator (default: '\n')

Development

# Activate virtual environment
source venv/bin/activate

# Run tests
python -m pytest tests/ -v

# Run benchmarks
python -m pytest tests/test_benchmark.py -s

# Build the package with optimizations
maturin develop --release

Author

Everything in this project was written by Sweep AI, an AI agent for Jetbrains IDEs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

difflib_rs-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (258.5 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

difflib_rs-0.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (255.7 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

difflib_rs-0.1.1-cp310-abi3-macosx_11_0_arm64.whl (245.7 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file difflib_rs-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for difflib_rs-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de84c7622c7f5248181c49782575110262968fc854a50407a516a97e7ef7c488
MD5 ebee4c942c94914ef92436cced37cd83
BLAKE2b-256 47fbe0aae3ab5e11487b6a44c583acca395893c17c055e3336432d3176ec0e2a

See more details on using hashes here.

File details

Details for the file difflib_rs-0.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for difflib_rs-0.1.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9bc80e6e325b67b8727d2b8358301a0911d6056ef60788429ee974bd3d723c01
MD5 bbfeabbb91707a606eea49f178179be5
BLAKE2b-256 ee154927523431c996422345d686a9bb3f8a09f16c5ea4b1c711c46f30223b4c

See more details on using hashes here.

File details

Details for the file difflib_rs-0.1.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for difflib_rs-0.1.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 986203a711871b9b93e3a883353ec944c885838f334e4a80b7343ec6fde530b0
MD5 790c0a5891df933fe7f9de313166d8f3
BLAKE2b-256 07a06bcb9fc1dd863ed72f97e7d552de1083adfa42d28e14f3714328e708175f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page