the pulse

You've bought a bunch of servers, they're all setup at a facility, but how are they doing? How do you know whether your application is slow because there are lots of users or because you've got a disk going bad, or because your indexes no longer fit into memory? How do you know when Apache has died?

We identify the two separate types of questions above.

know when things die, break, or are close to breaking
get the pulse of our systems to see trends, problems, bottlenecks

First, we need a tool to notify a responsible party when services die, disks get full, databases stop accepting connections, servers load spikes, or switch starts dropping packets. We've chosen Nagios to fill this void. Nagios allows us to monitor all of the above on each of our servers through plugins including redundancy. Alerts can be sent to a secondary party if the first neither fixes a problem nor acknowledges it within a certain amount of time. In addition to a highly-flexible set of bundled plugins, it's easy to add new plugins to monitor custom application services and verify things are in working order.

Second, we need a tool to allow us to see trends in load, memory, disk space, network traffic, database queries, mail queues, and application metrics in graphical format. At the workshop in New York, I was turned on to Ganglia which monitors and graphs metrics just like these using RRDTool (by the author of MRTG). Ganglia monitors clustered systems by using multicast to communicate amoung the servres. We now track trends in our web, database, mail, and supporting servers' trends in a nice web interface. In addition, I threw together a PHP script to monitor MySQL metrics in a matter of a few hours.

Not only can we get immediate notification (to a pager) when things break, but we can now diagnose more abstract problems like bottlenecks and hardware problems before they become critical.

[tags]monitoring, nagios, ganglia, linux[/tags]