I’ve had all of our servers running beneath the Xen hypervisor for a little over two years now, and all in all, I’ve got to say I don’t think I’d ever consider going back to the old way. Xen makes it easy to add new VM’s, and moving servers between physical hardware nodes isn’t too difficult either. Being able to boot a virtual server from the SAN means that you can boot that server on any node that can get to the SAN. This has enormous benefits from a management perspective.
Of course, virtualization isn’t all rosy. The configuration process is much more complex, system failure is harder to detect and troubleshoot, and performance is slightly lower than it would be on actual hardware. In my mind, these are all reasonable trade-offs once you consider the vastly increased flexibility you’ll achieve with a virtualized operating environment.
With virtualization finally taking off, and the relative familiarity with it in the industry, perhaps it’s time to take another look at the world of thin clients.
Over the past day or so, I’ve been working on the absolute worst kind of tech problem, intermittent. Basically, some time ago I deployed another web server, and recently my monitoring service started notifying me that about once an hour it would be inaccessible from the internet for a few minutes. The strange part was, I couldn’t detect any problem accessing the site locally, and most of the time it was accessible from outside the network.
I initially suspected that the culprit was my Mongrel installation, so I got to work troubleshooting that. I spent over an hour working on mongrel before I realized that Mongrel probably wasn’t the problem. Indeed it wasn’t, my first mistake was not attacking the problem at the beginning. This server is deployed behind a Linux bridging firewall, and is physically a Sun Microsystems box running Xen. This particular server is virtualized on that box along with several other systems, and runs an Apache load balancer for the mongrel cluster. In other words, there are many layers where something could go wrong. The correct approach would have been to verify network connectivity internally and externally during an outage period with a tool like ping. After I realized that Mongrel was running just fine, that’s what I did.
After I enabled ICMP packets to the host in the firewalls, I was able to determine that nothing was getting to that host intermittently. I inspected the routing table of the router, and was able to get packets to flow again whenever a disruption started. Obviously I couldn’t manually reset the routing whenever the connection was lost, so I set out to find the real problem.
After examining my xen configuration files, I started to look into the mac addresses that the bridge was seeing. I quickly discovered that my servers mac address wasn’t listed because after periods of inactivity it would effectively time out. Upon closer inspection of the servers network configuration files, I was finally able to determine that while the server had static public IP addresses and a gateway set, it was also pulling a private IP and a gateway from the DHCP server. This interface was acting as the primary interface, which was causing the communications errors. Once I disabled DHCP and configured a static private IP address, I no longer experienced issues.
When dealing with complex environments and intermittent problems like this, it is important to isolate the problem and address it from a bottom up manner. If I had started at the base of the problem and checked the network connectivity to begin with instead of assuming Mongrel was at fault, I would have saved quite a bit of time. Additionally, when dealing with intermittent problems it is important to track down the source and either fix it outright, or force it to fail totally so that the issue can actually be addressed. Once I realized the issue was a routing problem somewhere along the chain, and was able to make it fail predictably, isolating the cause of the issue was much easier, because I no longer had to second guess if my actions were only making the problem worse.
If you follow these two rules when dealing with complex and intermittent issues, you’ll save yourself a great deal of time and trouble and ultimately come up with an exact solution. In other words, handle bugs in complex environments as orderly as possible, otherwise your actions may exacerbate the situation.