We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2.
Fork Rate. A strange configuration issue caused processes to be created at a rate of several hundred a second rather than the expected 1-10/second.
Flow control packets. A network configuration that honors flow control packets and isn’t configured to disable them, can temporarily cause dropped traffic.
Swap In/Out Rate. Measure the right thing. It's the rate memory is swapped in/out that can impact performance, not the quantity.
Server Boot Notification. Use an init script to capture when servers are dying. Servers do die, but are they dying too often?
NTP Clock Offset. If you are not checking one of you servers is probably not properly time synced.
DNS Resolutions. This is a key part of your infrastructure that often goes unchecked. It can be the source of a lot of latency and availability problems. On the Internal DNS check quantity, latency, and availability. Also verify External DNS servers give the correct answers and are available.
SSL Expiration. Don't let those certificates expire. Set up an expiration check.
DELL OpenManage Server Administrator (OMSA). Monitor the outputs from OMSA to know when failures have occurred.
Connection Limits. Do you know how close you are to your connection limits?
Load Balancer Status. It's important to have visibility into your load balancer status by making the health stats visible.