Kubernetes restartCount

Hi,

We’re collecting kubernetes stats via telegraf and influxDB. One of the key metrics I’m trying to visualise is the number of container restarts for each pod. I can see there’s a tag key with the value “io.kubernetes.container.restartCount” but because it’s a tag I can’t figure out how to plot the value of restart counts for each kube pod.

Can anyone provide me with some pointers, please? This strikes me as one of the most important metrics available in kube and I’ve so far been unable to get it graphed, either via prometheus or influxdb.

Thanks for any help you can offer!

Rich

Kubestate metrics saves the restart count for a pod if you are using Prometheus: https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/pod-metrics.md

I haven’t seen anything similar for InfluxDB unfortunately as you are correct in thinking that it is one of the most important metrics to monitor.

Hi, thanks for the response. When we were first using kube we had Prom installed and still could get these metrics out. Either it wasn’t there or we weren’t trying hard enough. We actually found Prom pretty difficult to get anything useful from which is why we switched to InfluxDB, having used that in the past.

I have just found the kube-state-metrics container from the kube project which promises to provide the stats we’re looking for, which we can collect via the Prometheus input.

So far we’ve only succeeded in upsetting the influx process by trying it, but I’ll let everyone know how we get on.

Yes, monitoring Kubernetes without kube-state-metrics is quite meaningless. I wrote a version of it for our graphite monitoring of our Kubernetes clusters. Unless something similar now exists for InfluxDB, I think you would be better off going back to Prometheus (in my opinion). The standard metrics from Heapster were quite useless (again my opinion).

I plan on using the telegraf prom input to scrape the metrics from kube-state-metrics and emit them to influx.

I really don’t want to use prom again, for so many reasons.

Hi richarww.

I made the same analyse, by using prom and telegraf.
It seems that it’s more convinient to use telegraf for collect data on prom.
Do you have some documentation for doing this or some source for guiding me ?

Best regards