Backing up to Amazon S3 from a CentOS Droplet with Duply and Duplicity

Set up the Amazon S3 Bucket:

To be written.

About Duply and Duplicity:

Duplicity is a python based shell application that makes encrypted incremental backups to remote storage locations. Duply is a frontend wrapper for Duplicity, designed to simplify setting up, managing, and running server backup and recovery activities.

FYI, duplicity provides the ability to create full and incremental backups. It does not let you create differential backups, however. This will result in some potentially long recovery chains, since recovery requires that all increments since the last full backup are available.Such long recovery chains translate into more potential recovery errors and much slower recoveries.

This means that Duply and Duplicity might not be the ideal solution, but my backup needs are modest so I’m okay with this situation for now. To get around ‘the long chain problem’, I will force a full backup fairly frequently.

Install Duply and Duplicity:

First lets install the necessary software onto our server (all commands are run as the root user):

# yum install duplicity duply python-boto

Generate the GPG Encryption Keys:

Next, we’ll need to generate the GPG keys so that Duply/Duplicity can encrypt and sign the backups. If you don’t already have a key, the next few commands will get you set up.

First, install a random number generator to make sure there are enough random bytes from which to generate a key. (FYI, this is needed for CentOS 6+ servers but might not be required for others.)

Type…

# yum install rng-tools

When installation is complete, open ‘/etc/sysconfig/rngd' and add...

EXTRAOPTIONS="-r /dev/urandom"

Now start the random number generator service…

# service rngd start

From this point forward, you should NOT be logged in as root and you should NOT use sudo to execute any of the GPG commands. GPG behaves strangely when sessions are nested, such as when you log in as userX and use sudo or sudo -s. (More on this below)

Next, make sure that gpg-agent is running. Type ….

$ gpg-agent -s --daemon --write-env-file --use-standard-socket

After a moment or two, you will see something like the following where username is your Network ID username and Machine is the name of your system…

GPG_AGENT_INFO=/N/u/username/Machine/.gnupg/S.gpg-agent:22743:1; export GPG_AGENT_INFO;

Now generate the key…

$ gpg --gen-key

After a few moments, something like the following will appear…

  gpg (GnuPG) 2.0.14; Copyright (C) 2009 Free Software Foundation, Inc.
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.

  gpg: keyring `/N/u/username/Machine/.gnupg/secring.gpg' created
  gpg: keyring `/N/u/username/Machine/.gnupg/pubring.gpg' created
  Please select what kind of key you want:
     (1) RSA and RSA (default)
     (2) DSA and Elgamal
     (3) DSA (sign only)
     (4) RSA (sign only)
  Your selection?

Enter 1 to select the default key.

Next, GPG will prompt you to choose a keysize (in bits). Enter 2048.

You will see…

  Requested keysize is 2048 bits
  Please specify how long the key should be valid.
           0 = key does not expire
        <n>  = key expires in n days
        <n>w = key expires in n weeks
        <n>m = key expires in n months
        <n>y = key expires in n years
  Key is valid for? (0)

Enter the value for how long the key should to remain valid. GPG will prompt for confirmation – enter y or n as appropriate.

GPG now asks for information to be used to construct a user ID that will identify the key. At the prompts, enter a name, email address, and a comment.

GPG will now prompt to confirm or correct the information provided…

  You selected this USER-ID:
      "Full Name (comment) <username@iu.edu>"

  Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit?

Enter o to accept the user ID, or to correct errors or quit the process, enter the appropriate alternative ( n , c , e, or q ).

If the user ID is accepted, GPG will prompt for a password. Choose a strong passphrase. Once the password has been added and confirmed, GPG will begin generating the key. You’ll see…

  We need to generate a lot of random bytes. It is a good idea to
  perform some other action (type on the keyboard, move the mouse,
  utilize the disks) during the prime generation; this gives the
  random number generator a better chance to gain enough entropy.

This process may take a moment to wrap up, but when it’s done you’ll see something like…

  gpg: key 09D2B839 marked as ultimately trusted
  public and secret key created and signed.

  gpg: checking the trustdb
  gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
  gpg: depth: 0  valid:   4  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 4u
  gpg: next trustdb check due at <expiration_date>
  pub   2048R/09D2B839 2015-06-25 [expires: <expiration_date>]
        Key fingerprint = 6AB2 7763 0378 9F7E 6242  77D5 F158 CDE5 09D2 B839
  uid                  Full Name (comment) <username@example.com>
  sub   2048R/7098E4C2 2015-06-25 [expires: <expiration_date>]

In this example, 09D2B839 is the name of the main signing key, and we have one subkey 7098E4C2 for encryption. These keys are enough to create encrypted and signed backups.

Make a copy of this data as well as the passphrase and store it somewhere safe. You won’t be able to decrypt the backups without these.

Create a Revocation Certificate:

Next, we’ll create a way of invalidating the GPG key pair in case of a security breach or loss of the secret key.

This should be done as soon as the key pair is made, not later. Keep the revocation key in a secure, separate location in case the server is compromised or becomes inoperable.

(FYI: This won’t work if you created the key-pair as ‘root’, such as when using ‘sudo’ to execute GPG. GPG won’t prompt for the required password if you’re in a ‘nested session’. In short, you should create the keys and the revocation certificate when logged in as a non-root user – some folks even recommend creating a specific user just for this purpose.)

Type…

$ gpg --gen-revoke your_email@address.com

Choose any of the available options, although since this is being done ahead of time, some specifics won’t be available. You will then be asked to provide a comment, and finally, to confirm the selections.

A revocation certificate will be generated to the screen…

Revocation certificate created.

Please move it to a medium which you can hide away; if Mallory gets
access to this certificate he can use it to make your key unusable.
It is smart to print this certificate and store it away, just in case
your media become unreadable.  But have some caution:  The print system of
your machine might store the data and make it available to others!
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: A revocation certificate should follow

iQIfBCABAgAJBQJSTxNSAh0AAAoJEIKHahUxGx+E15EP/1BL2pCTqSG9IYbz4CMN
bCW9HgeNpb24BK9u6fAuyH8aieLVD7It80LnSg/+PgG9t4KlzUky5sOoo54Qc3rD
H+JClu4oaRpq25vWd7+Vb2oOwwd/27Y1KRt6TODwK61z20XkGPU2NJ/ATPn9yIR9
4B10QxqqQSpQeB7rr2+Ahsyl5jefswwXmduDziZlZqf+g4lv8lZlJ8C3+GKv06fB
FJwE6XO4Y69LNAeL+tzSE9y5lARKVMfqor/wS7lNBdFzo3BE0w68HN6iD+nDbo8r
xCdQ9E2ui9os/5yf9Y3Uzky1GTLmBhTqPnl8AOyHHLTqqOT47arpwRXXDeNd4B7C
DiE0p1yevG6uZGfhVAkisNfi4VrprTx73NGwyahCc3gO/5e2GnKokCde/NhOknci
Wl4oSL/7a3Wx8h/XKeNvkiurInuZugFnZVKbW5kvIbHDWJOanEQnLJp3Q2tvebrr
BBHyiVeQiEwOpFRvBuZW3znifoGrIc7KMmuEUPvA243xFcRTO3G1D1X9B3TTSlc/
o8jOlv6y2pcdBfp4aUkFtunE4GfXmIfCF5Vn3TkCyBV/Y2aW/fpA3Y+nUy5hPhSt
tprTYmxyjzSvaIw5tjsgylMZ48+qp/Awe34UWL9AWk3DvmydAerAxLdiK/80KJp0
88qdrRRgEuw3qfBJbNZ7oM/o
=isbs
-----END PGP PUBLIC KEY BLOCK-----

Copy and paste this key to a secure location, or print it for later use.

Create & Customize Backup Config Files:

Next, we need to create a ‘configuration profile’ (aka. ‘backup config’) for each backup job that we want done.

By default, this backup config file will be stored as ‘~/.duply/profile_name‘, where ~ is the current user’s home directory. Instead, we want to store these in the folder ‘/etc/duply’. If this folder exists prior to creating a config file, Duply will automatically create profiles for the super user root there instead of the user’s home folder.

Check to see if this folder already exists. If not, go to the root directory and type…

# mkdir /etc/duply/

Now we can go ahead and create a new backup config profile for each backup job we want to run…

# duply profile-name create

This will create a generic backup config file at ‘/etc/duply/profile_name/conf’ that can now be tweaked. (It also creates a file called ‘exclude’ which I’ll discuss later.)

Type…

# cd /etc/duply/profile-name/conf
# nano conf

Now look for the statement, GPG_KEY and add your public key and password…

GPG_KEY='_KEY_ID_'
GPG_PW='_GPG_PASSWORD_'

Next, look for GPG_OPTS and set the compression defaults….

GPG_OPTS='--compress-algo=bzip2 --bzip2-compress-level=9'

Next, search for TARGET=’scheme://user[:password]@host[:port]/[/]path’, and modify it to point to your S3 bucket’s ‘endpoint’, like so…

TARGET='s3://host/bucket_name[/prefix]'

Although you can include the user and password in the S3 URL, this is not considered safe. Instead, uncomment the lines for TARGET_USER and TARGET_PASS, and modify like so…

TARGET_USER='your_AWS_ACCESS_KEY_ID'
TARGET_PASS='your_AWS_SECRET_ACCESS_KEY'

Now look for the line, SOURCE=’/path/to/source’, and change it as follows…

SOURCE='/'

Next search for and uncomment the line, #MAX_AGE=1M, and set the interval value (to be used by the purge command) like so…

MAX_AGE=2M

You can use any of several formats to specify MAX_AGE:

  1. The string “now” (refers to the current time)
  2. A sequence of digits, like “123456890” (indicating the time in seconds after the epoch)
  3. A string like “2002-01-25T07:00:00+02:00” in datetime format
  4. An interval, which is a number followed by one of the characters s, m, h, D, W, M, or Y (indicating seconds, minutes, hours, days, weeks, months, or years respectively), or a series of such pairs.  In this case the string refers to the time that preceded the current time by the length of the interval.  For instance, “1h78m” indicates the time that was one hour and 78 minutes ago. The calendar here is unsophisticated: a month is always 30 days, a year is always 365 days, and a day is always 86400 seconds.
  5. A date format of the form YYYY/MM/DD, YYYY-MM-DD, MM/DD/YYYY, or MM-DD-YYYY, which indicates midnight on the day in question, relative to the current time zone settings.  For instance, “2002/3/5”, “03-05-2002”, and “2002-3-05” all mean March 5th, 2002.

Next, search for and uncomment the line, #MAX_FULL_BACKUPS=1, and set the maximum number of full backups to keep…

MAX_FULL_BACKUPS=2

Now find and uncomment the statements, ‘#VOLSIZE=50’ and DUPL_PARAMS=”$DUPL_PARAMS –volsize $VOLSIZE “, then reset the VOLSIZE to something larger than the 25MB default, like this…

VOLSIZE=50
DUPL_PARAMS="$DUPL_PARAMS --volsize $VOLSIZE "

Create PRE and POST Scripts:

Duply lets you use ‘pre' and ‘post' scripts when doing backups. The pre script is executed just before the backup, while the post script executes immediately after the backup completes.

These scripts allow you to do such things as create MySQL database dumps that can be included in the backup. The pre and post files must be included in ‘/etc/duply/profile_name/’ directory together with the conf and exclude files.

The folowing example pre and post scripts create MySQL database dumps, place these into the /tmp folder before the backup procedure starts, and then deletes the dump files once the backup is finished.

… Pre Script Example

/usr/bin/mysqldump --all-databases -u root -l> /tmp/sqldump-$(date '+%F')

-- or --

/usr/bin/mysqldump --databases blog engine live mysql -u root -l > /tmp/sqldump-$(date '+%F')

-- or --

/usr/bin/mysqldump --databases blog -u root -l > /tmp/sqldump-blog--$(date '+%F')
/usr/bin/mysqldump --databases engine -u root -l > /tmp/sqldump-engine-$(date '+%F')
/usr/bin/mysqldump --databases live -u root -l > /tmp/sqldump-live-$(date '+%F')
/usr/bin/mysqldump --databases mysql -u root -l > /tmp/sqldump-mysql-$(date '+%F')

In the first example, all databases are dumped into a single file. In the second example, only the selected databases are dumped into a single file. In the third, each selected database is dumped into it’s own file. In all the examples, the ‘-l’ parameter locks the database tables while the dump is being performed.

… Post Script Example

/bin/rm /tmp/sqldump-$(date '+%F')

-- or --

/bin/rm /tmp/sqldump*

In the above examples, the dump files are deleted from ‘/tmp’ directory.

It’s also important to backup the configuration for your profile – this is needed in order to recover the backup files. Technically this only needs to be done once, usually right after the configuration file is finished.

Some schools of thought believe it best to backup the conf file each time a backup is performed. You can do this automatically by add the following commands to the post script to create a tar of the profile immediately after each backup is finished…

 

#!/bin/bash

profile_name=$(basename $CONFDIR)

time=$(date +%s)
backup_file="etc/duply/$profile_name/duply-$profile_name-"$time".tar.gz"

# Archive the profile in the ~/.duply directory.
tar -cvzf $backup_file -C /etc/duply $profile_name
chmod 600 $backup_file

You will need to copy the *.tar.gz file to a secure storage location, preferably portable offline storage such as a CD, DVD, or USB thumb-drive. This way backups can be recovered from another machine or server in the event the current unit is lost or destroyed.

(NOTE: The above script still needs some work – as is, it copies over the entire directory each time, including the tar files that were left from last time. The trick is to use the script once each time the configuration changes, then delete the tar file from the profile_name directory. Once you’re done, just comment out this script.)

Create the Backup Whitelist:

Next, we want to tell Duply which folders and files should be backed up. We do this by editing the file, /etc/duply/profile_name/exclude. Here’s an example:

+ /tmp/sqldump*
+ /var/www/blog
+ /var/www/site/app/webroot/resources
- **

The format is simple. Duplicity checks each file against the rules in this file, starting from the top, until it finds a rule that matches. If the rule is preceded by a ‘+’ the file will be backed up, and if it is preceded by a ‘-‘ it will be ignored. The last rule ‘- **’ tells duplicity to ignore all files that didn’t match an earlier rule.

Don’t forget the ‘- **’ at the bottom – this tells Duply to stop looking for more files to include in the backup.

Let’s Go!

Okay, we’re finally ready to test Duply!

At the command prompt, type…

# duply profile_name backup

FYI, if you have a large amount of data, this could take a while to run.

Once the backup has completed, go to the AWS S3 control panel to check that files have in fact been uploaded to the right place.

To make sure things worked properly from the server side, type this at the prompt to get a complete listing of all the files and folders that were backed up…

# duply profile_name list

Backup & File Recovery:

Naturally, the point of doing a backup in the first place is so that files and folders can be recovered if for any reason the originals are corrupted or lost.

There two ways to do this. The first is to ‘fetch’ an individual file, the second is to ‘restore’ a complete backup:

  • Typing ‘duply fetch <src_path> <target_path> [<age>]’ will restore a single file/folder from backup [as it was at <age>].
  • Typing ‘duply restore <target_path> [<age>]’ will restore the complete backup to <target_path> [as it was at <age>].

Here’s an example of how to fetch a specific file:

# duply profile_name fetch var/www/live/app/webroot/resources/95/work/opening.jpg /mnt/restore/opening.jpg

And here’s an example of how to restore a complete backup:

# duply profile_name restore /mnt/restore

For either of these to work, you’ll need to make sure the target destination (e.g. /mnt/restore) already exists – Duply won’t create new directories for you.

Automate Backups:

All that’s left now is to set up a CRON job to execute the backup task automatically. But first, we need to take a deeper look at Duply parameters.

Duply uses a variety of command line parameters for backup, maintenance, recovery of data. (Check out Duply’s man page for details). You can ‘chain’ several parameters within a single command by separating them with an underscore (_).

For example, duply /root/.duply/test full_verify_purge --force creates a full backup and deletes any old backups. Backups where MAX_AGE is exceeded are listed by purge and deleted via the additional option --force.

This involves editing the ‘/etc/crontab file’ (or whatever file you use to manage crontabs) by adding…

#
15 4 * * * root /usr/bin/duply profile_name backup_cleanup_purge --force > /dev/null
#

This example will run the profile_name backup job at 4:15am each day of each month. It will also clean up and delete any files that have gone past their expiry date.

Note that it’s important to make sure there’s a line return after the last cron job in the list – I put a comment hash to make sure this is the case.

That’s it!

Resources:

https://wiki.archlinux.org/index.php/Backup_programs

http://mindfsck.net/incremental-backups-amazon-s3-centos-using-duplicity/

http://duplicity.nongnu.org/

http://duply.net/

https://zetta.io/en/help/articles-tutorials/backup-linux-duply/

https://andsk.se/2014/06/12/how-to-set-up-a-simple-backup-system-using-duplyduplicity/

https://wiki.archlinux.org/index.php/Duply

https://www.thomas-krenn.com/en/wiki/Backup_on_Linux_with_duply

http://serverfault.com/questions/471412/gpg-gen-key-hangs-at-gaining-enough-entropy-on-centos-6

https://kb.iu.edu/d/awio

https://www.digitalocean.com/community/tutorials/how-to-use-gpg-to-encrypt-and-sign-messages-on-an-ubuntu-12-04-vps

http://bitflop.com/tutorials/gnupg-tutorial.html

http://www.cyberciti.biz/faq/define-cron-crond-and-cron-jobs/

https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-autotasks.html

http://typesofbackup.com/incremental-vs-differential-vs-full-backup/

https://www.youtube.com/watch?v=4Icg3MYZZqI

https://www.digitalocean.com/community/tutorials/how-to-schedule-routine-tasks-with-cron-and-anacron-on-a-vps