I am trying to define a site/network status dashboard. It should show a heatmap of the health of the systems at a site or in the whole network. But what defines a healthy or “sick” system? What (small) set of metrics would be sufficient?
USE/RED:
SLO:
This is a really difficult question. some people like to see their sites idling, so less than 10% resource usage, some sites like to see their money working, so 70-80% usage. It depends on your philosophy, in most cases having a ceiling is going to be the limiting factor for the metric.
For example, if you have a cpu, 100% utilization is a bottleneck. Think of an engine with a rev counter or tachometer or tach. The gauge tells you when the engine is operating normally, by using green bands on the tach, same when it is rev’ing higher, the tach shows yellow, and when it is “redlining” it is operating much faster, higher than normal. In a daily driver, for fuel efficiency, longevity etc. you want the engine to be in the green, but racecar drivers, might want the engine in the red most of time, so it depends on your use case. But the max revolutions the engine can go without exploding, will be your limit. Then you can set your other thresholds below that. 90-100% cpu in the red, 70-80% in the orange, 60-70% yellow, and below that green, again depending on your use case. That way when you check your dashboard, you can see by the colours what is operating outside your parameters and acceptable thresholds. Same for networks, if you have a 1gb internet line, 100% saturation should indicate red, and you can scale down everything from there.
The four golden signals are latency, traffic, errors and saturation.
Latency, how long does it take to do something. Sign on to an application, consume a webservice, download a file etc. gather these metrics for a period of time, get a baseline or benchmark and configure your thresholds to indicate red when higher than normal, or green when normal or below.
Traffic, how much demand there is, read and write on your network switches, read and write on your disks, how many people logged in, how many transactions per second etc. get a feel for normal and indicate when you are above or below the norm.
Errors, how many errors are “normal”, how many errors can you tolerate, what classifies as an error, are errors being tracked, recorded or actioned in anyway. You can monitor switches, app servers, applications, os logs etc. to get the errors.
Saturation, how full is your system. Full as in terms of resource utilization, does your system use 100% cpu, 100% disk throughput, network usage etc. How much disk is being used, and based on the current trend of usage, alert when it’s filling up to 80% or more depending on the disk size.
Going back to a car, the engine is both very complicated and very simple at the same time, but the user of the car only gets presented with a speedometer, a rev counter, an oil pressure gauge, an engine temperature gauge and a check engine light, which is a catchall for a lot of other things. So just with a couple of gauges, you can determine the health of a very complicated system. It really depends on your use case…