Backup MongoDB to AWS S3

April 2020

Backing up to S3

I don't think I need to tell you how important it is to back up your production database. Having said this, a replica set is NOT what I mean when I say "backing up". A replica set is always synced, so if something bad happens to your data, the bad things are synced to the replica. What you need is a scheduled backup for the worst case scenario. And what could be better suited to store a database dump than S3? So here we're going to learn how to backup a MongoDB to S3, but you could basically use this knowledge to backup any database (MySQL, Postgres, ...).

This article aims at people that are hosting their database themselves. People that rent servers, install MongoDB on those servers and maintain the setup themselves. There are also managed MongoDB setups, like for example Atlas, that would take care of those concerns for you.

The backup script

So let's get started with the final script and then let's disassemble it.

#!/bin/bash
    
# Set up necessary variables
backup_name=backup_`date +%Y-%m-%d-%H%M`
backup_path=~/s3-backups/$backup_name
log_path=~/s3-backups/$backup_name.log
s3_location=s3://my-backups/$backup_name

# Dump the database
mongodump --out $backup_path &> $log_path

# Upload to S3
aws s3 cp $backup_path $s3_location --recursive &>> $log_path

# Send parts of the logs by email to check if everything went well
grep -hnr "done dumping" $log_path | mail -s "Backup Status: Dumped Collections" youremail@example.com
aws s3 ls $s3_location --recursive | wc -l | mail -s "Backup Status: Upload" youremail@example.com

# Cleanup
rm -rf $backup_path

So what we're doing here is we:

  1. Set up the varibles we need
  2. Run mongodump. If you're using another database system, this will be a different command
  3. Upload it to S3
  4. Send an email, to verify that everything worked (optional)
  5. Remove the dump, so you're not running out of disk space.

The steps needed to make the script work

To make this work, we still have a few missing parts. You will need to:

  1. Create an S3 bucket "my-backups".
  2. Create a Lifecycle Rule for your bucket (optional but recommended). You can read more about creating life cycle rules at: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html. I've created a rule, that archives backups to AWS Glacier DeepArchive after one week and permanently erases them after two months.
  3. In AWS IAM, create a new policy:
    Show Policy
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::my-backups"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::my-backups/*"
                ]
            }
        ]
    }
    Also create a user in IAM and attach the created policy to that user.
  4. Install the aws-cli. You could just run sudo apt-get install aws to do so. Then run aws configure to grant the scripts access to AWS. Configure with the Access Key and Secret you obtained for the user created previously.
  5. Install the email client (optional). I've described how this works in a separate article here: Sending Emails from Ubuntu

You can now check if everything is working correctly by running your script: ./backup-s3.sh. (after you've ran chmod +x ./backup-s3.sh)

The cron job

Now there is one missing piece to the puzzle. Your script needs to be scheduled! Here, a cronjob comes in handy. You can set up a new cronjob by running crontab -e. Then you can insert the following script:

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

0 * * * * ./backup-s3.sh

The first line is important, otherwise the cronjob doesn't have access to your aws cli. This is the script for an hourly backup. You can adapt it to your needs.

Well, that's it! Now you have backups, and you'll also be informed about the status of the backups. Of course, the status update once an hour might get a bit annoying. You can change the backup script, such that it just sends the mails once a day:

if [[ "$backup_name" == *"0600" ]]; then
  # send the mail
fi

This would just send the logs generated at 6 am.

Final Notes

You should run this script on the replica server and not the primary database server. This takes the load off of the critical server. See for example this Stackoverflow discussion.

Interested in web development?