Over the past day or so, I’ve been working on the absolute worst kind of tech problem, intermittent. Basically, some time ago I deployed another web server, and recently my monitoring service started notifying me that about once an hour it would be inaccessible from the internet for a few minutes. The strange part was, I couldn’t detect any problem accessing the site locally, and most of the time it was accessible from outside the network.
I initially suspected that the culprit was my Mongrel installation, so I got to work troubleshooting that. I spent over an hour working on mongrel before I realized that Mongrel probably wasn’t the problem. Indeed it wasn’t, my first mistake was not attacking the problem at the beginning. This server is deployed behind a Linux bridging firewall, and is physically a Sun Microsystems box running Xen. This particular server is virtualized on that box along with several other systems, and runs an Apache load balancer for the mongrel cluster. In other words, there are many layers where something could go wrong. The correct approach would have been to verify network connectivity internally and externally during an outage period with a tool like ping. After I realized that Mongrel was running just fine, that’s what I did.
After I enabled ICMP packets to the host in the firewalls, I was able to determine that nothing was getting to that host intermittently. I inspected the routing table of the router, and was able to get packets to flow again whenever a disruption started. Obviously I couldn’t manually reset the routing whenever the connection was lost, so I set out to find the real problem.
After examining my xen configuration files, I started to look into the mac addresses that the bridge was seeing. I quickly discovered that my servers mac address wasn’t listed because after periods of inactivity it would effectively time out. Upon closer inspection of the servers network configuration files, I was finally able to determine that while the server had static public IP addresses and a gateway set, it was also pulling a private IP and a gateway from the DHCP server. This interface was acting as the primary interface, which was causing the communications errors. Once I disabled DHCP and configured a static private IP address, I no longer experienced issues.
When dealing with complex environments and intermittent problems like this, it is important to isolate the problem and address it from a bottom up manner. If I had started at the base of the problem and checked the network connectivity to begin with instead of assuming Mongrel was at fault, I would have saved quite a bit of time. Additionally, when dealing with intermittent problems it is important to track down the source and either fix it outright, or force it to fail totally so that the issue can actually be addressed. Once I realized the issue was a routing problem somewhere along the chain, and was able to make it fail predictably, isolating the cause of the issue was much easier, because I no longer had to second guess if my actions were only making the problem worse.
If you follow these two rules when dealing with complex and intermittent issues, you’ll save yourself a great deal of time and trouble and ultimately come up with an exact solution. In other words, handle bugs in complex environments as orderly as possible, otherwise your actions may exacerbate the situation.