Linuxjournal.com Down: Incident Report
Linuxjournal.com has experienced a few growing pains. This newly deployed PHPNuke site encountered some hours of operation at reduced capacity and some downtime after having a story linked by Slashdot. Changes are now in place that, it is hoped, will allow the site to weather similar loads in the future. Some hardware has been upgraded, some software optimizations have been installed, and further hardware upgrades are being considered.
At 3:00 yesterday afternoon as I was waiting in line at the Post Office, a call on the cell phone alerted me of problems on our web site. Returning to the office, I found my colleagues poring over mysql errors.
A quick look revealed that the site was running its maximum allowed number of Apache and MySQL processes, too much memory paged out to swap files, and had up to 50 processes in backlog. In a word, thrashing.
Linux isn't famous for thrashing conditions. On the test bench I've tried many times to provoke these by pushing a system to its limits, with not much luck. Notwithstanding, I had to deal with the evidence at hand, and thrashing is what it looked like.
Hoping to avoid at least the mysql errors, we shut down Apache and MySQL daemons, then reduced the number of Apache processes run concurrently. While possibly failing to satisfy some incoming requests, the plan was to keep the server out of its thrashing condition. Initially this looked good. At 5:00 in the afternoon the system was serving pages at a rapid clip, while keeping indices of performance such as process backlog within acceptable levels.
The system was still running significant swapfile activity, but we judged it stable and adjourned to the bar.
Unfortunately something happened late in the evening that upset the web server. Even with the new and lower maximums of web and SQL server processes, the system did indeed thrash, serving up more mysql errors than pages. This condition persisted through the low-demand period of the night, and I became aware of it due to an early-morning phone call.
The next obvious thing to try was a RAM upgrade. A phone call to Rackspace.com, our co-location host, set that one in motion.
After upgrading to the full rated amount of RAM the server would no longer boot. The symptoms suggested hard drive failure. Some time was spent testing and substituting components, including the hard drive and the RAM, before discovering a solution: installing less additional RAM. At this point the BIOS was able to recognize the hard drive and the server was able to boot successfully again. Rackspace technicians indicated this could represent a motherboard defect. Hardware changes were suggested.
Up and running with twice its previous amount of RAM, the server now chugged happily along with 0K bytes swapped. Load as measured in pages viewed per minute was comparable to that observed yesterday.
Back at the ranch, the gang was busy looking through PHPNuke. Before the site roll out we'd added code to implement partial page caching, yielding a large reduction in the number of SQL queries per page. Now we reviewed session management, which produced further improvements. Overall we've managed to reduce, in the last couple of days, the number of queries involved in delivering a page by a little less than half. On top of the previous page caching arrangement this leaves us with at least some hope that the site may be stable under heavy load.
Some of our changes are LJ specific. We are investigating the possibility of submitting the remainder to the PHPNuke project.
Bottom line: for what we're doing with this site, we might well be using a heavier server. For at the next few days we won't be. We've upgraded memory, and are now concentrating on further optimizing PHPNuke. MySQL parameters will also come in for close inspection.
I remain unsatisfied with the thrashing hypothesis. The site went into an error condition some time last night and stayed there through several hours of reduced load. That is not very Linux-like. It doesn't add up. We're still looking for a missing piece.
To be continued.