the definitive guide to growing your EBS volumes

This week I had an issue that my Graphite instance was falling apart.

Every query I tried, every dashboard I loaded was stuck, I couldn’t get anything done.

Now, I use graphite every day, every technical decision I make now is based on a graph, every problem and alert I get for the servers, I look at the graphs first, so naturally, this was not a good place to be.

From the server collectd stats, I saw that the EBS drive exploded, it just spiked to 100% and it was slowing everything down.

Now, I won’t go into Graphite, Carbon or any of these here, but I just want to go through how I solved it step by step, since every single post I read about it was partial, incomplete and inaccurate.

First, lets set out the goals for replacing the drive on your EC2 instance

  1. Minimal downtime
  2. Minimal data loss
  3. Fast (no copy data)

ok, so lets start…

First, you need to snapshot the drive

Snapshot the drive

You don’t have to stop the instance, you don’t need to stop any service, the server can keep running while this is happening.

For a full (100%) 500G drive, it took Amazon around 2 hours, which was agonizingly slow, but the server kept running collecting stats, so I didn’t really mind it so much.

After you have the snapshot, you just create a new drive from it

Create drive from snapshot

You can of course configure everything just like a normal drive, you can configure iops, you can configure the size and region, just like you would a brand new one.

The filesystem is already there, your data is intact and the sun keep shining :)

Keep in mind, the drive HAS to be in the same region as your instance, or you will not be able to attach it.

It takes around 30s-1m for the drive to be available, then you just need to attach it to your machine

Attaching the drive

Then you need to select where you want to attach it

Select attach point

At this point, every other post I read failed to explain it clearly, so I will try really hard to be clearer.

Now, you have two drives /dev/xvdl for example and the new one at /dev/xvdp. /dev/xvdl is mounted to /mnt and the new one is not mounted yet, it’s just attached to the server.

Now, you have two options

Option #1

sudo vim /etc/fstab

You will see this line:

/dev/xvdl /mnt xfs noatime,nobootwait 0 2

As you can see, /dev/xvdl is mounted to /mnt like I said earlier, you can just replace it with /dev/xvdp and restart the machine.

Your new line should look like this:

/dev/xvdp /mnt xfs noatime,nobootwait 0 2

Then you have to reboot the machine

Option #2

Stop all services that write to this disk

This is super important step, you HAVE to stop all services that write to this mounted drive, or it will just not work, Linux won’t let you unmount it if there are write or read processes.

I just stopped Graphite and relevant services and then ran

sudo umount /mnt

This will unmount /mnt so you can continue

After the old drive is unmounted, you will need to do sudo mount /dev/xvdp /mnt and then edit the /etc/fstab file, just like is step #1

Then start the services again

Next step

Now if you follow the steps, you probably say to yourself it didn’t work, because the drive still shows up as 500G at 100%.

This is where you just need to run sudo xfs_growfs -d /mnt, which will just bump the space to the drive’s capacity.

Summing up

When I did it, I had about 10 minutes of downtime to my stats machine, which didn’t take anything else down since everything is writing over UDP, for this sort of maintenance it seemed acceptable.

I didn’t lose any data except those 10 minutes where the stats server was down.

Feel free to comment or five feedback on anything