Skip to main content

Fast large file synchronization inspired by rsync

Project description

https://travis-ci.org/xolox/python-pdiffcopy.svg?branch=master https://coveralls.io/repos/xolox/python-pdiffcopy/badge.svg?branch=master

The pdiffcopy program synchronizes large binary data files between Linux servers at blazing speeds by performing delta transfers and spreading its work over many CPU cores. It’s currently tested on Python 2.7, 3.5+ and PyPy (2.7) on Ubuntu Linux but is expected to work on most Linux systems.

Status

Although the first prototype of pdiffcopy was developed back in June 2019 it wasn’t until March 2020 that the first release was published as an open source project.

There are lots of features and improvements I’d love to add but more importantly the project needs to actually be used for a while before I’ll consider changing the alpha label to beta or mature.

Installation

The pdiffcopy package is available on PyPI which means installation should be as simple as:

$ pip install 'pdiffcopy[client,server]'

There’s actually a multitude of ways to install Python packages (e.g. the per user site-packages directory, virtual environments or just installing system wide) and I have no intention of getting into that discussion here, so if this intimidates you then read up on your options before returning to these instructions 😉.

The names between the square brackets (client and server) are called “extras” and they enable you to choose whether to install the client dependencies, server dependencies or both.

Command line

Usage: pdiffcopy [OPTIONS] [SOURCE, TARGET]

Synchronize large binary data files between Linux servers at blazing speeds by performing delta transfers and spreading the work over many CPU cores.

One of the SOURCE and TARGET arguments is expected to be the pathname of a local file and the other argument is expected to be a URL that provides the location of a remote pdiffcopy server and a remote filename. File data will be read from SOURCE and written to TARGET.

If no positional arguments are given the server is started.

Supported options:

Option

Description

-b, --block-size=BYTES

Customize the block size of the delta transfer. Can be a plain integer number (bytes) or an expression like 5K, 1MiB, etc.

-m, --hash-method=NAME

Customize the hash method of the delta transfer (defaults to ‘sha1’ but supports all hash methods provided by the Python hashlib module).

-W, --whole-file

Disable the delta transfer algorithm (skips computing of hashing and downloads all blocks unconditionally).

-c, --concurrency=COUNT

Change the number of parallel block hash / copy operations.

-n, --dry-run

Scan for differences between the source and target file and report the similarity index, but don’t write any changed blocks to the target.

-B, --benchmark=COUNT

Evaluate the effectiveness of delta transfer by mutating the TARGET file (which must be a local file) and resynchronizing its contents. This process is repeated COUNT times, with varying similarity. At the end an overview is printed.

-l, --listen=ADDRESS

Listen on the specified IP:PORT or PORT.

-v, --verbose

Increase logging verbosity (can be repeated).

-q, --quiet

Decrease logging verbosity (can be repeated).

-h, --help

Show this message and exit.

Benchmarks

The command line interface provides a simple way to evaluate the effectiveness of the delta transfer implementation and compare it against rsync. The tables in the following sections are based on that benchmark.

Low concurrency

Concurrency:

6 processes on 4 CPU cores

Disks:

Magnetic storage (slow)

Filesize:

1.79 GiB

The following table shows the results of the benchmark on a 1.79 GiB datafile that’s synchronized between two bare metal servers that each have four CPU cores and spinning disks, where pdiffcopy was run with a concurrency of six [1]:

Delta

Data size

pdiffcopy

rsync

10%

183 MiB

3.20 seconds

38.55 seconds

20%

366 MiB

4.15 seconds

44.33 seconds

30%

549 MiB

5.17 seconds

49.63 seconds

40%

732 MiB

6.09 seconds

53.74 seconds

50%

916 MiB

6.99 seconds

57.49 seconds

60%

1.07 GiB

8.06 seconds

1 minute and 0.97 seconds

70%

1.25 GiB

9.06 seconds

1 minute and 2.38 seconds

80%

1.43 GiB

10.12 seconds

1 minute and 4.20 seconds

90%

1.61 GiB

10.89 seconds

1 minute and 3.80 seconds

100%

1.79 GiB

12.05 seconds

1 minute and 4.14 seconds

High concurrency

Concurrency:

10 processes on 48 CPU cores

Disks:

NVMe (fast)

Filesize:

5.5 GiB

Here’s a benchmark based on a 5.5 GB datafile that’s synchronized between two bare metal servers that each have 48 CPU cores and high-end NVMe disks, where pdiffcopy was run with a concurrency of ten:

Delta

Data size

pdiffcopy

rsync

10%

562 MiB

4.23 seconds

49.96 seconds

20%

1.10 GiB

6.76 seconds

1 minute and 2.38 seconds

30%

1.65 GiB

9.43 seconds

1 minute and 13.73 seconds

40%

2.20 GiB

12.41 seconds

1 minute and 19.67 seconds

50%

2.75 GiB

14.54 seconds

1 minute and 25.86 seconds

60%

3.29 GiB

17.21 seconds

1 minute and 26.97 seconds

70%

3.84 GiB

19.79 seconds

1 minute and 27.46 seconds

80%

4.39 GiB

23.10 seconds

1 minute and 26.15 seconds

90%

4.94 GiB

25.19 seconds

1 minute and 21.96 seconds

100%

5.43 GiB

27.82 seconds

1 minute and 19.17 seconds

This benchmark shows how well pdiffcopy can scale up its performance by running on a large number of CPU cores. Notice how the smaller the delta is, the bigger the edge is that pdiffcopy has over rsync? This is because pdiffcopy computes the differences between the local and remote file using many CPU cores at the same time. This operation requires only reading, and that parallelizes surprisingly well on modern NVMe disks.

Silly concurrency

Concurrency:

20 processes on 48 CPU cores

Disks:

NVMe (fast)

Filesize:

5.5 GiB

In case you looked at the high concurrency benchmark above, noticed the large number of CPU cores available and wondered whether increasing the concurrency further would make a difference, this section is for you 😉. Having taken the effort of developing pdiffcopy and enabling it to run on many CPU cores I was curious myself so I reran the high concurrency benchmark using 20 processes instead of 10. Here are the results:

Delta

Data size

pdiffcopy

rsync

10%

562 MiB

3.80 seconds

49.71 seconds

20%

1.10 GiB

6.25 seconds

1 minute and 3.37 seconds

30%

1.65 GiB

8.90 seconds

1 minute and 12.40 seconds

40%

2.20 GiB

11.44 seconds

1 minute and 19.57 seconds

50%

2.75 GiB

14.21 seconds

1 minute and 25.43 seconds

60%

3.29 GiB

16.45 seconds

1 minute and 28.12 seconds

70%

3.84 GiB

19.05 seconds

1 minute and 28.34 seconds

80%

4.39 GiB

21.95 seconds

1 minute and 25.49 seconds

90%

4.94 GiB

24.60 seconds

1 minute and 22.27 seconds

100%

5.43 GiB

26.42 seconds

1 minute and 18.73 seconds

As you can see increasing the concurrency from 10 to 20 does make the benchmark a bit faster, however the margin is so small that it’s hardly worth bothering. I interpret this to mean that the NVMe disks on these servers can be more or less saturated using 8–12 writer processes.

Limitations

While inspired by rsync the goal definitely isn’t feature parity with rsync. Right now only single files can be transferred and only the file data is copied, not the metadata. It’s a proof of concept that works but is limited. While I’m tempted to add support for synchronization of directory trees and file metadata just because its convenient, it’s definitely not my intention to compete with rsync in the domain of synchronizing large directory trees, because I would most likely fail.

Error handling is currently very limited and interrupting the program using Control-C may get you stuck with an angry pool of multiprocessing workers that refuse to shut down 😝. In all seriousness, hitting Control-C a couple of times should break out of it, otherwise try Control-\ (that’s a backslash, it should send a QUIT signal).

History

In June 2019 I found myself in a situation where I wanted to quickly synchronize large binary datafiles (a small set of very large MySQL *.ibd files totaling several hundred gigabytes) using the abundant computing resources available to me (48 CPU cores, NVMe disks, bonded network interfaces, you name it 😉).

I spent quite a bit of time experimenting with running many rsync processes in parallel, but the small number of very large files was “clogging up the pipe” so to speak, no matter what I did. This was how I realized that rsync was a really poor fit, which was a disappointment for me because rsync has long been one my go-to programs for ad hoc problem solving on Linux servers 🙂.

In any case I decided to prove to myself that the hardware available to me could do much more than what rsync was getting me and after a weekend of hacking on a prototype I had something that could outperform rsync even though it was written in Python and used HTTP as a transport 😁. During this weekend I decided that my prototype was worthy of being published as an open source project, however it wasn’t until months later that I actually found the time to do so.

About the name

The name pdiffcopy is intended as a (possibly somewhat obscure) abbreviation of “Parallel Differential Copy”:

  • Parallel because it’s intended run on many CPU cores.

  • Differential because of the delta transfer mechanism.

But mostly I just needed a short, unique name like rsync so that searching for this project will actually turn up this project instead of a dozen others 😇.

Contact

The latest version of pdiffcopy is available on PyPI and GitHub. The documentation is hosted on Read the Docs and includes a changelog. For bug reports please create an issue on GitHub. If you have questions, suggestions, etc. feel free to send me an e-mail at peter@peterodding.com.

License

This software is licensed under the MIT license.

© 2020 Peter Odding.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdiffcopy-1.0.1.tar.gz (30.4 kB view hashes)

Uploaded source

Built Distribution

pdiffcopy-1.0.1-py2.py3-none-any.whl (26.8 kB view hashes)

Uploaded py2 py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page