Skip to main content
Donate to the Python Software Foundation or Purchase a PyCharm License to Benefit the PSF! Donate Now

Block-based backup and restore utility for virtual machine images

Project description

Overview

Backy is a block-based backup and restore utility for virtual machine images.

Backy is intended to be:

  • space-, time-, and network-efficient
  • trivial to restore
  • reliable.

To achieve this, we rely on:

  • space-efficient storages (CoW filesystems, content-hashed chunking)
  • using a snapshot-capable source for our volumes (i.e. Ceph RBD) that allows easy extraction of changes between snapshots,
  • leverage proven, existing low-level tools,
  • keep the code-base small, simple, and well-tested.

We also have a few ground rules for the implementation:

  • VM data is stored self-contained on the filesystem and can be moved between servers using regular FS tools like copy, rsync, or such.
  • No third party daemons are required to interact with backy: no database server. The scheduler daemon is only responsible for scheduling and simply calls regular CLI commands to perform a backup. Backy may interact with external daemons like Ceph or Consul, depending on the source storage implementation.

Operations

Full restore

Check which revision to restore:

$ backy -b /srv/backy/<vm> status

Maybe set up the Ceph environment - depending on your configuration:

$ export CEPH_ARGS="--id $HOSTNAME"

Restore the full image through a Pipe:

$ backy restore -r <revision> - | rbd import - <pool>/<rootimage>

Restoring individual files

Backy provides an NBD server to access backups through a mountable device:

$ cd /srv/backy/$vm
$ backy nbd-server

In a different shell you can now mount this:

$ nbd-client -N <revision> localhost 9000 /dev/nbd0
$ mkdir  -p /mnt/restore/<vm>
$ mount -r /dev/nbd0p1 /mnt/restore/<vm>

When done:

$ umount /mnt/restore/<vm>
$ nbd-client -d /dev/nbd0

Also stop the nbd-server with Ctrl-C.

Setting up backy

  1. Create a sufficiently large backup partition using a COW-capable filesystem like btrfs and mount it under /srv/backy.

  2. Create a configuration file at /etc/backy.conf. See man page for details.

  3. Start the scheduler with your favourite init system:

    backy -l /var/log/backy.log scheduler -c /path/to/backy.conf
    

    The scheduler runs in the foreground until it is shot by SIGTERM.

  4. Set up monitoring using backy check.

  5. Set up log rotation for /var/log/backy.conf and /srv/backy/*/backy.log.

The file paths given above match the built-in defaults, but paths are fully configurable.

Features

Telnet shell

Telnet into localhost port 6023 to get an interactive console. The console can currently be used to inspect the scheduler’s live status.

Self-check

Backy includes a self-checking facility. Invoke backy check to see if there is a recent revision present for all configured backup jobs:

$ backy check
OK: 9 jobs within SLA

Both output and exit code are suited for processing with Nagios-compatible monitoring systems.

Pluggable backup sources

Backy comes with a number of plug-ins which define block-file like sources:

  • file extracts data from simple image files living on a regular file system.
  • ceph-rbd pulls data from RBD images using Ceph features like snapshots.
  • flyingcircus is an extension to the ceph-rbd source which we use internally on the Flying Circus hosting platform. It uses advanced features like Consul integration.

It should be easy to write plug-ins for additional sources.

Adaptive verification

Backy always verifies freshly created backups. Verification scale depends on the source type: file-based sources get fully verified. Ceph-based sources are verified based on random samples for runtime reasons.

Zero-configuration scheduling

The backy scheduler is intended to run continuously. It will spread jobs according to the configured run intervals over the day. After resuming from an interruption, it will reschedule missed jobs so that SLAs are still kept if possible.

Backup jobs can be triggered at specific times as well: just invoke backy backup manually.

Performance

Backy is designed to use all of the available storage and network bandwidth by running several instances in parallel. The backing storage must be prepared for this kind of (mixed) load. Finding optimal settings needs a bit of experimentation given that hardware and load profiles differ from site to site. The following section contains a few points to start off.

Storage backend

If the backing storage is a RAID array, its stripe size should be aligned with the filesystem. We have made good experiences with 256k stripes. Also check for 512B/4K block misalignments on HDDs. We’re using it usually with RAID-6 and have seen reasonable performance with both hardware and software RAID.

Filesystem

We generally recommend XFS since it provides a high degree of parallelism and is able to handle very large directories well.

Note that the standard cfq I/O scheduler is not a good pick for highly parallel bulk I/O on multiple drives. Use deadline or noop.

Kernel

Since backy performs a lot of metadata operations, make sure that inodes and dentries are not evicted from the VFS cache too early. We found that lowering the vm.vfs_cache_pressure sysctl can make quite a difference in total backup performance. We’re currently getting good results setting it to 10. You may also want to increase vm.min_free_kbytes to avoid page allocation errors on 10 GbE network interfaces.

Authors

License

GPLv3

Changelog

2.4.3 (2019-04-17)

  • Avoid superfluous work when a source is missing, specifically avoid unnecessary Ceph/KVM interaction. (#27020)

2.4.2 (2019-04-17)

  • Optimize handling offline/deletion pending VMs: don’t just timeout. Take snapshots and make backups (as long as the image exists). (#22345)
  • Documentation update about performance implications.
  • Clean up build system instrumentation to get our Jenkins going again.

2.4.1 (2018-12-06)

  • Optimization: bundle ‘unlink’ calls to improve cache locality of VFS metadata
  • Optimization: load all known chunks at startup to avoid further seeky IO and large metadata parsing, this also speeds up purges.
  • Reduce OS VFS cache trashing by explicitly indicating data we don’t expect to read again.

2.4 (2018-11-30)

  • Add support for Ceph Jewel’s –whole-object diff export. (#24636)
  • Improve garbage collection of old snapshot requests. (#100024)
  • Switch to a new chunked store format: remove one level of directories to massively reduce seeky IO.
  • Reorder and improve potentially seeky IOPS in the per-chunk write effort. Do not create directories lazily.
  • Require Python 3.6.

2.3 (2018-05-16)

  • Add major operational support for handling inconsistencies.
  • Operators can mark revisions as ‘distrusted’, either all for a backup, or individual revisions or time ranges.
  • If the newest revision is distrusted then we always perform a full backup instead of a differential backup.
  • If there is any revision that is distrusted then every chunk will be written even if it exists (once during a backup, identical chunks within a backup will be trusted once written).
  • Implement an explicit verification procedure that will automatically trigger after a backup and will verify distrusted revisions and either delete them or mark them as verified.
  • Safety belt: always verify content hashes when reading chunks.
  • Improve status report logging.

2.2 (2017-08-28)

  • Introduce a new backend storage mechanism, independent of BTRFS: instead of using COW, use a directory with content-hashed 4MiB chunks of the file. Deduplication happens automatically based on the 4MiB chunks.

  • Made the use of fadvise features more opportunistic: don’t fail if they are not supported by the backend as they are only optimizations anyway.

  • Introduce an exponential backoff after a failed backup: instead of retrying quickly and thus hogging the queue (in the case that timeouts are involved) we now backoff exponentially starting with 2 minutes, then 4, then 8, … until we reach 6 hours as the biggest backoff time.

    You can still trigger an explicit run using the telnet “run” command for the affected backup. This will put the backup in the run queue immediately but will not reset the error counter or backoff interval in case it fails again.

  • Performance improvement on restore: don’t read the restore target. We don’t have to optimize CoW in this case. (#28268)

2.1.5 (2016-07-01)

  • Bugfix release: fix data corruption bug in the new full-always mode. (FC #21963)

2.1.4 (2016-06-20)

  • Add “full-always” flag to Ceph and Flyingcircus sources. (FC #21960)
  • Rewrite full backup code to make use of shallow copies to conserve disk space. (FC #21960)

2.1.3 (2016-06-09)

  • Fix new timeout to be 5 minutes by default, not 5 days.
  • Do not sort blocks any longer: we do not win much from seeking over volumes with random blocks anyway and this helps for a more even distribution with the new timeout over multiple runs.

2.1.2 (2016-06-09)

  • Fix backup of images containing holes (#33).
  • Introduce a (short) timeout for partial image verification. Especially very large images and images that are backed up frequently do not profit from running for hours to verify them, blocking further backups. (FC #21879)

2.1.1 (2016-01-15)

  • Fix logging bugs.
  • Shut down daemon loop cleanly on signal reception.

2.1 (2016-01-08)

  • Add optional regex filter to the jobs command in the telnet shell.
  • Provide list of failed jobs in check output, not only the total number.
  • Add status-interval, telnet-addrs, and telnet-port configuration options.
  • Automatically recover from missing/damaged last or last.rev symlinks (#19532).
  • Use {BASE_DIR}/.lock as daemon lock file instead of the status file.
  • Usability improvements: count jobs, more informative log output.
  • Support restoring to block special files like LVM volumes (#31).

2.0 (2015-11-06)

  • backy now accepts a -l option to specify a log file. If no such option is given, it logs to stdout.
  • Add backy find -r REVISION subcommand to query image paths from shell scripts.
  • Fix monitoring bug where partially written images made the check go green (#30).
  • Greatly improve error handling and detection of failed jobs.
  • Performance improvement: turn off line buffering in bulk file operations (#20).
  • The scheduler reports child failures (exit status > 0) now in the main log.
  • Fix fallocate() behaviour on 32 bit systems.
  • The flyingcircus source type now requires 3 arguments: vm, pool, image.

2.0b3 (2015-10-02)

  • Improve telnet console.
  • Provide Nix build script.
  • Generate requirements.txt automatically from buildout’s versions.cfg.

2.0b2 (2015-09-15)

  • Introduce scheduler and rework the main backup command. The backy command is now only responsible for dealing with individual backups.

    It does no longer care about scheduling.

    A new daemon and a central configuration file is responsible for that now. However, it simply calls out to the existing backy command so we can still manually interact with the system even if we do not use the daemon.

  • Add consul integration for backing up Flying Circus root disk images with clean snapshots (by asking fc.qemu to use fs-freeze before preparing a Ceph snapshot).

  • Switch to shorter UUIDs. Existing files with old UUIDs are compatible.

  • Turn the configuration format into YAML. Old files are still compatible. New configs will be generated as YAML.

  • Performance: defrag all new files automatically to avoid btrfs degrading extent performance. It appears this doesn’t completely duplicate all CoW data. Will have to monitor this in the future.

2.0b1 (2014-07-07)

  • Clean up docs.
  • Add classifiers in setup.py.
  • More or less complete rewrite expecting a copy-on-write filesystem as the target.
  • Flexible backup scheduling using free-form tags.
  • Compatible with Python 3.2-3.4.
  • Initial open source import as provided by Daniel Kraft (D9T).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
backy-2.4.3.tar.gz (74.8 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page