DRBD on Amazon’s Linux

Note: this was done more as an experiment than for something I intended to use in production – so consider it to be more a compilation of notes than a full out procedure.

DRBD – Distributed Replicated Block Device – is a kernel level storage system that replicates data across a network. It uses TCP – and typically runs on port(s) starting at 7788. A typical setup will pair DRBD with Heartbeat/Corosync, so that in the event of the failure of a node, the other node can be promoted to primary (or will use a dual-primary setup), and a network filesystem so that both nodes can access the data simultaneously.

The setup described below will only allow one node to access the data at any given time and requires a manual failover to promote the secondary node to primary.

For the following, I am using 2 up-to-date instances running Amazon’s Linux AMI 2011.09 (ami-31814f58) – which is derived from CentOS/RHEL. Both are in the same security group, and these are the only two instances in that security group. Also the hostnames of both instances are unchanged from their default – this is only relevant if you try to use the script included below – if you manually setup the configuration, the hostnames can be whatever you wish.

I have attached one EBS volume to each instance (in addition to the root volume), at /dev/sdf (which is actually /dev/xvdf on Linux).

Install DRBD

Note: all steps in the section are to be performed on both nodes

This AMI already includes the DRBD kernel module in its default kernel. You can verify this with the following:

modprobe -l | grep drbd
kernel/drivers/block/drbd/drbd.ko

Likewise, to find the version of the kernel module, you can use:

modinfo drbd | grep version
version:        8.3.8

It is typically preferable to have the version of the kernel module match the version of the userland binaries. DRBD is no longer included in the CentOS 6 repository – and is not in either the amzn or EPEL repositories. The remaining options, are to therefore use another repository or to build from source – I’d favour the former.

ElRepo – which contains primary hardware related packages – maintains up to date binaries for CentOS and its derivatives – we can either install a specific RPM or simply use the latest copy from the repository.

rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org

From RPM (for 32-bit version):

rpm -Uvh http://elrepo.org/linux/elrepo/el5/i386/RPMS/drbd83-utils-8.3.8-3.el5.elrepo.i386.rpm

From Repository (current version 8.3.12 – doesn’t match installed kernel version 8.3.8):

rpm -Uvh http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm
yum install drbd83-utils

Load the kernel module with:

modprobe -v drbd

Setup meta-data storage

Note: all steps in the section are to be performed on both nodes

DRBD can store meta-data internally or externally. Internal storage tends to be easier to recover, while external storage tends to offer better latency. Moreover, for EBS volumes using an XFS filesystems with existing data, external meta-data is required (since there is typically no place to store the meta-data on the disk – XFS can’t shrink, and EBS can’t be enlarged directly).

According to the DRBD User Guide, meta-data size, in sectors, can be calculated with:

echo $(((`blockdev --getsz /dev/xvdf`/32768)+72))

However, for external meta data disks, it appears that you need 128MiB per index (disk). Creating a smaller disk will result in the error “Meta device too small”.

To create our meta-data storage (/var/drbd-meta – change as desired) – initially zeroed out – we will use dd, with /dev/zero as an input source and then mount the file on a loopback device.

dd if=/dev/zero of=/var/drbd-meta bs=1M count=128
losetup /dev/loop0 /var/drbd-meta

Configure DRBD

The default DRBD install creates /etc/drbd.conf – which includes /etc/drbd.d/global_common.conf and /etc/drbd.d/*.res. You will want to make some changes to global_common.conf – for performance and error handling, but for now I am just using the default.

You will need to know the hostname and IP address of both instances in your cluster to setup a resource file. It is important to note that DRBD uses IP address of the local machine to determine which interface to bind to – therefore, you must use the private IP address for the local machine.

You can of course, use an elastic IP as the public IP address. The default port used by DRBD is 7788, and I have used the same, below – you need to open this port (TCP) in your security group.

Setup a resource file /etc/drbd.d/drbd_res0.res.tmpl (on both nodes):

resource drbd_res0 {
syncer {rate 50M;}
device     /dev/drbd0;
disk       /dev/xvdf;
meta-disk  /dev/loop0[0];
on @@LOCAL_HOSTNAME@@ {
    address    @@LOCAL_IP@@:7788;
}
on @@REMOTE_HOSTNAME@@ {
    address    @@REMOTE_IP@@:7788;
}
}

The above ‘resource’ defines the basic information about the disk and the instances. Note: you should change the ‘disk’ to match the device name you attached your EBS volume as, and ‘meta-disk’ should correspond to the device setup above (or use internal).

If you manually replace the template placeholders, above, you must use the private IP address for the LOCAL_IP, however, you can use either the public or private IP for the REMOTE_IP. The LOCAL_HOSTNAME and REMOTE_HOSTNAME values should match the output of the hostname command on each system. Keep in mind that if you are using a public IP address, you may incur data transfer charges (also keep in mind that an elastic IP maps to the private IP address at times, which will save on data transfer charges). Also the file extension should be .res (not .tmpl) if you make the replacement manually.

A typical setup would have identical resource files on both the local and remote machines. If we wish to use the public IP addresses, this is not possible (since the public IP is not associated with an interface in EC2). Therefore, I used the following script to setup the correct values in the above file (note, you need to setup your private key and certificate in order to use the API tools):

#!/bin/sh

export EC2_PRIVATE_KEY=/path/to/pk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.pem 
export EC2_CERT=/path/to/cert-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.pem

REMOTE_INFO=$(ec2-describe-instances --filter instance-state-name=running --filter group-name=$(curl -s http://169.254.169.254/latest/meta-data/security-groups) | grep INSTANCE | grep -v $(curl -s http://169.254.169.254/latest/meta-data/instance-id) | awk '{sub(/\..*/, "", $5);print $5, $14}')

REMOTE_HOSTNAME=$(echo $REMOTE_INFO | cut -d ' ' -f1)
REMOTE_IP=$(echo $REMOTE_INFO | cut -d ' ' -f2)
LOCAL_HOSTNAME=$(hostname)
LOCAL_IP=$(ifconfig eth0 | grep "inet addr" | cut -d':' -f2 | cut -d' ' -f1)

sed -e "s/@@LOCAL_HOSTNAME@@/$LOCAL_HOSTNAME/g" \
-e "s/@@LOCAL_IP@@/$LOCAL_IP/g" \
-e "s/@@REMOTE_HOSTNAME@@/$REMOTE_HOSTNAME/g" \
-e "s/@@REMOTE_IP@@/$REMOTE_IP/g" \
/etc/drbd.d/drbd_res0.res.tmpl > /etc/drbd.d/drbd_res0.res

Of course, there are a few shortcomings to the above – it will only handle two instances (the local and one remote) in the group and it expects the hostname to be unchanged (i.e. the value derived from ec2-describe-instances). The above script uses the security group to determe the servers in the. As such, it requires both instances to be in the same security group and will only work if that security group has exactly two instances in it. (It would be trivial to modify it to use something other than security group – for instance a specific tag, but handling more than two instances matching the criteria would take a bit more effort).

At this point you should have an /etc/drbd.d/drbd_res0.res file on both nodes, with the appropriate information filled in (either manually or using a script) – it is worth mentioning that the filename doesn’t actually matter (as long as it ends in .res – which is what /etc/drbd.conf is setup to look for).

Final steps

We are just about done at this point – everything is configured, and DRBD is setup on each instance. We now need to actually create the meta-data disk for our specific resource (run on both nodes):

drbdadm create-md drbd_res0

Finally, we start DRBD (on both nodes):

service drbd start

We can find the status of our nodes, either by using service drbd status, drbd-overview, or cat /proc/drbd:

version: 8.3.8 (api:88/proto:86-94)
srcversion: 299AFE04D7AFD98B3CA0AF9
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:1048576

At this point, we have not actually defined which node is to be the primary node – both are therefore classed as secondary, something we will resolve momentarily.

Up until this point, all steps have been done on both instances. Without a dual-primary/network file system setup, the DRBD files will only be accessible to one instance at a time. The primary node will be able to read and write to the volume, but the secondary node will not. In a failover scenario, we would promote the secondary node to primary, and it will then have full access to the volume.

We must now promote one node to primary. It is important to note that you cannot promote a node to primary if the nodes are inconsistent (see the status above). To do so, initially, you will need to use the --overwrite-data-of-peer option. Be careful, as this option will completely overwrite the data on the other node:

drbdadm -- --overwrite-data-of-peer primary drbd_res0

If the nodes are UpToDate, you can use:

drbdadm -- primary drbd_res0

Checking the status of our nodes, will now reveal, that one is primary, and if necessary, a sync may be in progress:

version: 8.3.8 (api:88/proto:86-94)
srcversion: 299AFE04D7AFD98B3CA0AF9
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:88968 nr:0 dw:0 dr:97432 al:0 bm:5 lo:5 pe:17 ua:248 ap:0 ep:1 wo:b oos:960128
        [>...................] sync'ed:  9.0% (960128/1048576)K delay_probe: 0
        finish: 0:00:32 speed: 29,480 (29,480) K/sec

Wait for the sync to finish before proceeding – at which point there should be 0 bytes out of sync (oos:0), and both nodes should be UpToDate:

version: 8.3.8 (api:88/proto:86-94)
srcversion: 299AFE04D7AFD98B3CA0AF9
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
    ns:1048576 nr:0 dw:0 dr:1049240 al:0 bm:64 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Filesystem and Mounting

At this point, we are ready to use our DRBD device. We start by setting up a filesystem. My preference is XFS:

yum install xfsprogs
mkfs.xfs /dev/drbd0

(Note: both nodes should have xfsprogs installed if you use XFS as your filesystem – but you will only format the device on the primary node).

We now create a mountpoint and mount the device (again, only on the primary node):

mkdir /data
mount /dev/drbd0 /data

Hopefully, at this point everything is setup and operational – any data we save to /data should now be replicated over the network to our secondary node.

A Quick Test

The most basic test involves the following – create a test file on the primary node, manually failover, and check for the file on what was originally the secondary node:

On the primary node:

echo "This is a test" > /data/test.txt
umount /data
drbdadm secondary drbd_res0

On the secondary node:

drbdadm primary drbd_res0
mkdir /data
mount /dev/drbd0 /data
cat /data/test.txt

To be able to simultaneously access the data on both nodes, we need to setup both nodes as primary, and use a network file system – such as OCFS2 or GFS2 (instead of XFS), in order to minimize the risk of inconsistencies. That, however, is an experiment for a future date. (Of course, there are other alternatives to DRBD – my personal preference being GlusterFS on EC2, which, while having a bit of additional overhead, is simpler to setup and has quite a few more features).

References

By cyberx86

Just a random guy who dabbles with assorted technologies yet works in a completely unrelated field.

1 comment

Leave a comment

Your email address will not be published. Required fields are marked *