WAL-E is a program designed to perform continuous archiving of
PostgreSQL WAL files and base backups.
To correspond on using WAL-E or to collaborate on its development, do
not hesitate to send mail to the mailing list at
wal-e@googlegroups.com (archives and subscription settings).
Github issues are also currently being used to track known problems,
so please feel free to submit those.
Of these, the “push” operators send backup data to storage and “fetch”
operators get backup data from storage. “wal” operators send/get
write ahead log, and “backup” send/get a hot backup of the base
database that WAL segments can be applied to.
All of these operators work in a context of three environment-variable
based settings:
AWS_ACCESS_KEY_ID or WABS_ACCOUNT_NAME
AWS_SECRET_ACCESS_KEY or WABS_ACCESS_KEY
WALE_S3_PREFIX or WALE_WABS_PREFIX
For Swift the following environment variables are needed:
SWIFT_AUTHURL
SWIFT_TENANT
SWIFT_USER
SWIFT_PASSWORD
WALE_SWIFT_PREFIX
There are also these variables:
SWIFT_AUTH_VERSION which defaults to 2. Some object stores such as
Softlayer require version 1.
SWIFT_ENDPOINT_TYPE defaults to publicURL, this may be set to internalURL on
object stores like Rackspace Cloud Files in order to use the internal
network.
With the exception of AWS_SECRET_ACCESS_KEY and WABS_ACCESS_KEY, all
of these can be specified as arguments as well. The AWS_* variables
are the standard access-control keying system provided by Amazon,
where the WABS_* are the standard access credentials defined by
Windows Azure.
The WALE_S3_PREFIX, WALE_WABS_PREFIX and WALE_SWIFT_PREFIX (_PREFIX)
variables can be thought of as a context whereby this program operates
on a single database cluster at a time. Generally, for any one
database the _PREFIX will be the same between all four operators.
This context-driven approach attempts to help users avoid errors such
as one database overwriting the WAL segments of another, as long as
the _PREFIX is set uniquely for each database. Use whichever variable
is appropriate for the store you are using.
This software also has Python dependencies: installing with pip
will attempt to resolve them:
gevent>=0.13.1
boto>=2.6.0
azure>=0.7.0
python-swiftclient>=1.8.0
python-keystoneclient>=0.4.2
argparse, if not on Python 2.7
It is possible to use WAL-E without the dependencies of back-end
storage one does not use installed: the imports for those are only
performed if the storage configuration demands their use.
The storage backend is determined by the defined _PREFIX. Prefixes with the
scheme s3 will be directed towards S3, those with the scheme wabs will
be directed towards Windows Azure Blob Service, and those with the scheme
swift will be directed towards an OpenStack Swift installation.
It is generally recommended that one use some sort of environment
variable management with WAL-E: working with it this way is less verbose,
less prone to error, and less likely to expose secret information in
logs.
At this time, AWS_SECRET_ACCESS_KEY and WABS_ACCESS_KEY are the only
secret values, and recording it frequently in logs is not recommended.
The tool has never and should never accept secret information in argv
to avoid process table security problems. However, the user running
PostgreSQL (typically ‘postgres’) must be able to run a program that
can access this secret information, as part of its archive_command.
envdir, part of the daemontools package is one recommended approach
to setting environment variables. One can prepare an
envdir-compatible directory like so:
# Assumption: the group is trusted to read secret information
# S3 Setup
$ umask u=rwx,g=rx,o=
$ mkdir -p /etc/wal-e.d/env
$ echo "secret-key-content" > /etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
$ echo "access-key" > /etc/wal-e.d/env/AWS_ACCESS_KEY_ID
$ echo 's3://some-bucket/directory/or/whatever' > \
/etc/wal-e.d/env/WALE_S3_PREFIX
$ chown -R root:postgres /etc/wal-e.d
# Assumption: the group is trusted to read secret information
# WABS Setup
$ umask u=rwx,g=rx,o=
$ mkdir -p /etc/wal-e.d/env
$ echo "secret-key-content" > /etc/wal-e.d/env/WABS_ACCESS_KEY
$ echo "access-key" > /etc/wal-e.d/env/WABS_ACCOUNT_NAME
$ echo 'wabs://some-container/directory/or/whatever' > \
/etc/wal-e.d/env/WALE_WABS_PREFIX
$ chown -R root:postgres /etc/wal-e.d
After having done this preparation, it is possible to run WAL-E
commands much more simply, with less risk of accidentally using
incorrect values:
envdir is conveniently combined with the archive_command functionality
used by PostgreSQL to enable continuous archiving. To enable
continuous archiving, one needs to edit postgresql.conf and
restart the server. The important settings to enable continuous
archiving are related here:
wal_level = archive # hot_standby in 9.0 is also acceptable
archive_mode = on
archive_command = 'envdir /etc/wal-e.d/env wal-e wal-push %p'
archive_timeout = 60
Every segment archived will be noted in the PostgreSQL log.
A base backup (via backup-push) can be uploaded at any time, but
this must be done at least once in order to perform a restoration. It
must be done again if any WAL segment was not correctly uploaded:
point in time recovery will not be able to continue if there are any
gaps in the WAL segments.
This command makes use of the “LATEST” pseudo-name for a backup, which
queries S3 to find the latest complete backup. Otherwise, a real name
can be used:
One can find the name of available backups via the experimental
backup-list operator, or using one’s remote data store browsing
program of choice, by looking at the PREFIX/basebackups_NNN/...
directory.
It is also likely one will need to provide a recovery.conf file,
as documented in the PostgreSQL manual, to recover the base backup, as
WAL files will need to be downloaded to make the hot-backup taken with
backup-push. The WAL-E’s wal-fetch subcommand is designed to be
useful for this very purpose, as it may be used in a recovery.conf
file like this:
backup-push, backup-fetch, wal-push, wal-fetch represent
the primary functionality of WAL-E and must reside on the database machine.
Unlike wal-push and wal-fetch commands, which function as described
above, the backup-push and backup-fetch require a little additional
explanation.
By default backup-push will include all user defined tablespaces in
the database backup. please see the backup-fetch section below for
WAL-E’s tablespace restoration behavior.
If the archived database did not contain any user defined tablespaces
at the time of backup it is safe to execute backup-fetch with no
additional work by following previous examples.
WAL-E expects that tablespace symlinks will be in place prior to a
backup-fetch run. This means prepare your target path by insuring
${PG_CLUSTER_DIRECTORY}/pg_tblspc contains all required symlinks
before restoration time. If any expected symlink does not exist
backup-fetch will fail.
If you are unable to reproduce tablespace storage structures prior to
running backup-fetch you can set the option flag --blind-restore.
This will direct WAL-E to skip the symlink verification process and place
all data directly in the ${PG_CLUSTER_DIRECTORY}/pg_tblspc path.
Optionally, you can provide a restoration specification file to WAL-E
using the backup-fetch--restore-spec RESTORE_SPEC option.
This spec must be valid JSON and contain all contained tablespaces
as well as the target storage path they require, and the symlink
postgres expects for the tablespace. Here is an example for a
cluster with a single tablespace:
These are commands that are not used expressly for backup or WAL
pushing and fetching, but are important to the monitoring or
maintenance of WAL-E archived databases. Unlike the critical four
operators for taking and restoring backups (backup-push,
backup-fetch, wal-push, wal-fetch) that must reside on the
database machine, these commands can be productively run from any
computer with the appropriate _PREFIX set and the necessary credentials to
manipulate or read data there.
backup-list is useful for listing base backups that are complete for a
given WAL-E context. Some fields are only filled in when the
--detail option is passed to backup-list[1].
Firstly, the fields that are filled in regardless of if --detail
is passed or not:
Header in CSV
Meaning
name
The name of the backup, which can be
passed to the delete and
backup-fetch commands.
last_modified
The date and time the backup was
completed and uploaded, rendered in
an ISO-compatible format with
timezone information.
wal_segment_backup_start
The wal segment number. It is a
24-character hexadecimal number.
This information identifies the
timeline and relative ordering of
various backups.
wal_segment_offset_backup_start
The offset in the WAL segment that
this backup starts at. This is
mostly to avoid ambiguity in event
of backups that may start in the
same WAL segment.
Secondly, the fields that are filled in only when --detail is
passed:
Header in CSV
Meaning
expanded_size_bytes
The decompressed size of the backup
in bytes.
wal_segment_backup_stop
The last WAL segment file required
to bring this backup into a
consistent state, and thus available
for hot-standby.
wal_segment_offset_backup_stop
The offset in the last WAL segment
file required to bring this backup
into a consistent state.
delete contains additional subcommands that are used for deleting
data from storage for various reasons. These commands are organized
separately because the delete subcommand itself takes options that
apply to any subcommand that does deletion, such as --confirm.
All deletions are designed to be reentrant and idempotent: there are
no negative consequences if one runs several deletions at once or if
one resubmits the same deletion command several times, with or without
canceling other deletions that may be concurrent.
These commands have a dry-run mode that is the default. The
command is basically optimized for not deleting data except in a very
specific circumstance to avoid operator error. Should a dry-run be
performed, wal-e will instead simply report every key it would
otherwise delete if it was not running in dry-run mode, along with
prominent HINT-lines for every key noting that nothing was actually
deleted from the blob store.
To actually delete any data, one must pass --confirm to wal-e delete. If one passes both --dry-run and --confirm, a dry
run will be performed, regardless of the order of options passed.
Currently, these kinds of deletions are supported. Examples omit
environment variable configuration for clarity:
before: Delete all backups and wal segment files before the
given base-backup name. This does not include the base backup
passed: it will remain a viable backup.
Example:
$ wal-e delete [--confirm] before base_00000004000002DF000000A6_03626144
retain: Leave the given number of backups in place, and delete
all base backups and wal segment files older than them.
Example:
$ wal-e delete [--confirm] retain 5
old-versions: Delete all backups and wal file segments with an
older format. This is only intended to be run after a major WAL-E
version upgrade and the subsequent base-backup. If no base backup
is successfully performed first, one is more exposed to data loss
until one does perform a base backup.
Example:
$ wal-e delete [--confirm] old-versions
everything: Delete all backups and wal file segments in the
context. This is appropriate if one is decommissioning a database
and has no need for its archives.
All assets pushed to storage are run through the program “lzop” which
compresses the object using the very fast lzo compression algorithm.
It takes roughly 2 CPU seconds to compress a gigabyte, which when
sending things to storage at about 25MB/s occupies about 5% CPU time.
Compression ratios are expected to make file sizes 50% or less of the
original file size in most cases, making backups and restorations
considerably faster.
Because storage services generally require the Content-Length header
of a stored object to be set up-front, it is necessary to completely
finish compressing an entire input file and storing the compressed
output in a temporary file. Thus, the temporary file directory needs
to be big enough and fast enough to support this, although this tool
is designed to avoid calling fsync(), so some memory can be leveraged.
Base backups first have their files consolidated into disjoint tar
files of limited length to avoid the relatively large per-file transfer
overhead. This has the effect of making base backups and restores
much faster when many small relations and ancillary files are
involved.
To encrypt backups as well as compress them, first generate a key pair
using gpg --gen-key. You don’t need the private key on the machine
to back up, but you will need it to restore. The private key may have
a password, but to restore, the password should be present in GPG
agent. WAL-E does not support entering GPG passwords via a tty device.
Once this is done, set the WALE_GPG_KEY_ID environment variable or
the --gpg-key-id command line option to the ID of the secret key
for backup and restore commands.
Here’s an example of how you can restore with a private key that has a
password, by forcing decryption of an arbitrary file with the correct
key to unlock the GPG keychain:
# This assumes you have "keychain" gpg-agent installed.
eval $( keychain --eval --agents gpg )
# If you want default gpg-agent, use this instead
# eval $( gpg-agent --daemon )
# Force storing the private key password in the agent. Here you
# will need to enter the key password.
export TEMPFILE=`tempfile`
gpg --recipient "$WALE_GPG_KEY_ID" --encrypt "$TEMPFILE"
gpg --decrypt "$TEMPFILE".gpg || exit 1
rm "$TEMPFILE" "$TEMPFILE".gpg
unset TEMPFILE
# Now use wal-e to fetch the backup.
wal-e backup-fetch [...]
# If you have WAL segments encrypted, don't forget to add
# restore_command to recovery.conf, e.g.
#
# restore_command = 'wal-e wal-fetch "%f" "%p"'
# Start the restoration postgres server in a context where you have
# gpg-agent's environment variables initialized, such as the current
# shell.
pg_ctl -D [...] start
To reduce the read load on base backups, they are sent through the
tool pv first. To use this rate-limited-read mode, use the option
--cluster-read-rate-limit as seen in wal-ebackup-push.
In certain situations, the wal-push process can take long enough
that it can’t keep up with WAL segments being produced by Postgres,
which can lead to unbounded disk usage and an eventual crash of the
database.
One can instruct WAL-E to pool WAL segments together and send them in
groups by passing the --pool-size parameter to wal-push. This
can increase throughput significantly.
Storing credentials on AWS EC2 instances has usability and security
drawbacks. When using WAL-E with AWS S3 and AWS EC2, most uses of
WAL-E would benefit from use with the AWS Instance Profile feature,
which automatically generates and rotates credentials on behalf of an
instance.
To instruct WAL-E to use these credentials for access to S3, pass the
--aws-instance-profile flag.
Instance profiles may not be preferred in more complex scenarios
when one has multiple AWS IAM policies written for multiple programs
run on an instance, or an existing key management infrastructure.
If one wishes to target WAL-E against an alternate S3 endpoint
(e.g. Ceph RADOS), one can set the WALE_S3_ENDPOINT environment
variable. This can also be used take fine-grained control over
endpoints and calling conventions with AWS.
The format is that of:
protocol+convention://hostname:port
Where valid protocols are http and https, and conventions are
path, virtualhost, and subdomain.
Example:
# Turns off encryption and specifies us-west-1 endpoint.
WALE_S3_ENDPOINT=http+path://s3-us-west-1.amazonaws.com:80
# For radosgw.
WALE_S3_ENDPOINT=http+path://hostname
# As seen when using Deis, which uses radosgw.
WALE_S3_ENDPOINT=http+path://deis-store-gateway:8888
Development is heavily reliant on the tool tox being existent within
the development environment. All additional dependencies of WAL-E are
managed by tox. In addition, the coding conventions are checked by
the tox configuration included with WAL-E.
To run the tests, one need only run:
$ tox -e py27
There are a variety of other environments tested by tox handling
old and new library versions, but -e py27 is normally the
environment one should iterate with.
To run a somewhat more lengthy suite of integration tests that
communicate with AWS S3, one might run tox like this:
Looking carefully at the above, notice the -n 8 added the tox
invocation. This -n 8 is after a -- that indicates to tox
that the subsequent arguments are for the underlying test program
pytest.
This is to enable parallel test execution, which makes the integration
tests complete a small fraction of the time it would take otherwise.
It is a design requirement of new tests that parallel execution not be
sacrificed.
The above invocation tests WAL-E with every test environment
defined in tox.ini. When iterating, testing all of those is
typically not a desirable use of time, so one can restrict the
integration test to one virtual environment, in a combination of
features seen in all the previous examples:
Coverage testing can be used by combining any of these using
pytest-cov, e.g.: tox ----cov wal_e and
tox ----cov wal_e --cov-report html; see htmlcov/index.html.
Finally, the test framework used is pytest. If possible, do not
submit Python unittest style tests: those tend to be more verbose and
anemic in power; however, any automated testing is better than a lack
thereof, so if you are familiar with unittest, do not let the
preference for pytest idiom be an impediment to submitting code.