If you are reading this, you are an unwitting participant in my latest experiment: clustering my blog on Amazon EC2 – thanks! You will be connecting to my blog on one of several Amazon EC2 micro instances, cobbled together in a quick and dirty solution that was more knee-jerk reaction to some downtime, than well thought out project.
This post serves as a chance for me to test if the cluster works, and a summary of the architecture I have set up using several EC2 micro instances and WordPress. It’s a quick and dirty little WordPress cluster using spot-request micro instances at $0.007/hour – how fun!
A bit of background
I’m pretty sure I’m not the only computer nerd who wakes up in the morning and checks their emails first thing. This weekend just gone I had a rather rude awakening because my site was down and I had a slew of down alerts in my inbox. There is little more embarrassing for a computer nerd than a dead blog. Not to mention the implications of a search engine bot finding a broken site.
The problem with a single micro instance was my site was going down when it came under sustained load on seldom visited pages. The CPU get’s capped after some (short) amount of time. This hardly ever happens when normal humans are visiting because they tend to visit pages that other people visit, thus they get served a fast cached version. But when a (arguably badly behaved) crawler starts hitting really old posts from the archive, one after the other, several times/second – things get bogged down, and once the CPU cap comes down, there’s no recovery as the requests just pile up.
This is not the first time it’s happened and I really wanted to fix it in a future proof, albeit quick and dirty manner.
In hindsight the cheapest option would have been to just bump the size of my instance type up to the next level and pay a lot more money each month, given how long I spent mucking around with this. But I wanted to try and keep using the micro instances because I really like the concept – and I had wanted to try the spot instance requests for a while too. So this was a chance to do all of that.
The requirements were quite simple really:
- Keep the filesystem in sync, for example when I upload a screenshot, or change a stylesheet
- Keep the DB in sync, when a user posts a comment on node 1, people on node 2,3,…,N need to see it
- Allow nodes to be added/removed without affecting the site
- Fallback to status quo, or bigger instance if it fails entirely
- Tolerate the spot-instances terminating if the price gets high
- Update: Basic PHP session support – some plugins start PHP sessions, even though core WordPress doesn’t use them.
- Very high fault tolerance; down time is OK, data loss of a few hours would be OK – no need for master/slaving mysql.
- Complete automation (at least not initially); I need to get a feel for how stable it is and where it breaks first.
- Redundant points of failure; I’m hosting it 100% on Amazon, if they let me down, they let me down.
Sessions – WP doesn’t use them, as such so no need to replicate them (unlike Magento).
So here’s what I did:
Before I started I drew a crude picture of how the cluster will look:
It helped me to know what various technology would be required, and it might help you to understand the basic layout of the cluster.
This is all possible thanks to the Amazon ELB. It balances traffic evenly across the nodes in the cluster, and if it detects unhealthy ones, it will move them out of the cluster until they return to health. This is the vital part of it, because it means if one of the micro instances suffers the CPU cap of doom, it will get taken out of the cluster, where it can recover with no traffic for a minute and then automatically be put back in.
This turned out to be very easy, I changed mysql to bind on the internal interface of the master node(instead of localhost), but locked down access to it to only EC2 instances in the cluster. All the nodes in the slave simply point the WordPress install to the elastic IP of the master node, and when an EC2 server resolves an elastic IP, they get the internal network interface, not the external one.
Update: Heiko quite rightly noted this section is a bit vague, here’s some more info:
Each elastic IP in AWS has it’s public IP and also a private internal network IP.
If I resolve the IP from my mac laptop at the office:
(Note: we’ll use
dig output is too messy for my blog post!)
~ ashley$ ping ec2-50-17-231-95.compute-1.amazonaws.com PING ec2-50-17-231-95.compute-1.amazonaws.com (220.127.116.11): 56 data bytes
But if I resolve it from an AWS server:
~$ ping ec2-50-17-231-95.compute-1.amazonaws.com PING ec2-50-17-230-94.compute-1.amazonaws.com (10.202.51.192) 56(84) bytes of data.
So knowing that, and knowing that the elastic DNS never changes, even if we stop and start and fiddle with the master node – then in our wordpress DB config, when it asks for the Host name of the DB server, instead of using the local IP
10.202.51.192, we use the Elastic DNS
Your WordPress config then might look like this:
//<snip> define('DB_NAME', 'dbname'); // The name of the database define('DB_USER', 'dbuser'); // Your MySQL username define('DB_PASSWORD', 'dbpasswd'); // ...and password define('DB_HOST', 'ec2-50-17-231-95.compute-1.amazonaws.com'); // 99% chance you won't need to change this value //</snip>
Now note this is the config on the master node’s web root, because it’s exported to all the child nodes – so all copies of the wordpress site use the same config. Change it on the master and they all update.
The same config setup works for EC2 sites running Magento, or any other app too. Hope that makes this section a bit clearer.
It is important that each slave node in the cluster is looking at the same files, for example if I added a new screenshot through the admin while on one of the nodes, the others would all need to see it (at least until CloudFront adds it to the CDN network I use).
My solution here is NFS, I know that’s not the highest performance option, but it really made things simple, and seeing as the goal was quick and dirty, I went with it. Security is provided thanks to the EC2 security groups, so only nodes in the ELB can connect to the NFS server.
I suggest using the soft mount option on the clients, so that if the master hiccups, once it’s responding again the clients pick up the connection easily. I had trouble with hard mounting the NFS, the clients become unresponsive, even when the server was running properly again.
I use Eric Hammonds awesome EC2 consistent snapshot as a cron job, this takes a snapshot of the master node every N hours (where N is the amount of data I’m willing to lose). This way the most recent snapshot can simply be used to start up a replacement master node. Yes initially that would cause downtime, but it _could_ be scripted if it happens regularly, it shouldn’t.
The spot instances can disappear if the bid price gets too high, so I needed the cluster to be able to handle that. Firstly, the load balancer will stop routing traffic to the nodes if they go down, but it meant I couldn’t rely only on spot instances – if they all went away there would be none in the cluster.
My solution was to stump up the 2c/hour for one on-demand slave node, think of it as the ‘last stand’. If the spot price goes too high and all the spot instances terminate, this one lone ranger has to hold out until I either bid more, or the price comes back down and the spot instances restart.
Update: Oops, seems like core WordPress does not need session support, but some plugins, e.g. wp-ecommerce, do. So it’s best if you use any such plugins, or don’t want to risk it that you also allow sessions to be shared among the nodes. Thankfully this is easy.
PHP by default stores it’s sessions on disk, in
/tmp. We simply tell it, on a per-site basis, or globally, to store it’s sessions inside our NFS mounted area like so.
On a per-site basis use the .htaccess file:
php_value session.save_path "/path/to/your/nfs-share/tmp"
Or for all sites on the server use the php.ini:
session.save_path = "/path/to/your/nfs-share/tmp"
The master node is a slightly specialized version of the slave nodes, so it does have Apache on it. If it all turns to custard – I’ll simply stop the master, change it to a large instance and fire it back up behind the ELB – I’ll be back to where I started just with a bigger instance type.
I’m monitoring the micro-cluster closely over the next few days, and will update here with what I find. I created a couple of cloudwatch alarms so I can check on things. If these fire too often I’ll need to add autoscaling to terminate and restart the spot instances if the become unhealthy. For now I’m hoping it will occur infrequently or not at all.
I know the NFS connection is flakey, but it’s a very simple way of keeping things like css/plugins/cached files in sync across the nodes. If anyone has a better suggestion, I’d like to hear it.
For performance testing, I used a slightly modified version of my magento performance testing site to test the new WordPress cluster set-up. It occurred to me that the performance testing I do on magespeedtest.com could be good for other CMS’s – any feedback on the idea of a wordpress/joomla/$cms speed test tool?
Anyway, the results were pleasing:
129.98 trans/sec @ 0.28 secs each
If I get that much real traffic, I’ll be signing up for adwords before fixing my blog!
In conclusion, I hope this little guide to WordPress clustering on EC2 is interesting, I can’t with a straight face recommend you go and implement this yourself just yet, but once I have more experience with it, I’ll let you know. I would love any ideas or feedback on improving this setup.
One thing that has become clear, EC2 is a powerful infrastructure building block, which really underpins this entire project – it has really blazed a trail and I look forward to more innovation from the AWS team.
First experience update – after 1 day: Well, I lost one node on the first day. A random patch of NFS flakiness took it right out mid-way through reading, Apache stopped responding, reboots would not complete, it was well dead. Solution: terminate it, remove it from the cluster and the spot request will fire up a new one, which get’s added to the cluster.
If this happens frequently I will write a short script to automatically terminate long-unhealthy instances, and add their replacements to the cluster.
Second experience update – after 1 month: Geez has it been a whole month?! The cluster has been rock solid the whole time, no downtime – not even a single node failure since that first one the day after I launched it. I checked the CPU usage across the cluster and it hardly ever spikes for more than a few seconds. There is a lot of network traffic between the nodes, due to the NFS – but with EC2 the between instances data transfer is free and fast. So far, so good.
Months later update : I wrote a follow up to this, autoscaling the cluster. It makes the cluster a lot more reliable, and entirely hands off. You should read it.