Grafana agent health check

kennytungph · September 6, 2023, 12:36pm

Grafana OSS v10.1, Loki, Prometheus has been installed and running properly.
Basic necessary settings (such as datasources, dashboard, alert rules etc) has been complete and working normally.
There are over 1000+ servers (Windows / Linux) installed Grafana agent and reporting metrics/logs to prometheus and Loki
But I hit issues on monitor Grafana agent healthiness.
When Grafana agent stopped / crash unexpectedly due to what ever reason, seems cannot monitor Grafana agent healthiness.
May I know how to monitor agent healthiness in proper ways?

Blockquote

integrations:
node_exporter:
enabled: true
scrape_interval: 60s
scrape_timeout: 30s
disable_collectors:
- ipvs
- btrfs
- infiniband
netclass_ignored_devices: “^(veth.|cali.|[a-f0-9]{15})$”
netdev_device_exclude: “^(veth.|cali.|[a-f0-9]{15})$”
filesystem_fs_types_exclude: “^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$”
metric_relabel_configs:
- action: drop
regex: node_scrape_collector_.+
source_labels: [name]
relabel_configs:
- replacement: tbchostname.rhidm.net
target_label: instance
prometheus_remote_write:

url: https://kenntun-monitor.asuscomm.com:9090/api/v1/write
remote_timeout: 30s
basic_auth:
username: padmin
password: XXX
tls_config:
insecure_skip_verify: true
queue_config:
batch_send_deadline: 60s
agent:
enabled: true
relabel_configs:
- action: replace
  source_labels:
  - agent_hostname
    target_label: instance
- action: replace
  target_label: job
  replacement: “integrations/agent”
  metric_relabel_configs:
- action: keep
  regex: (prometheus_target_.|prometheus_sd_discovered_targets|agent_build.|agent_wal_samples_appended_total|process_start_time_seconds)
  source_labels:
  - name

Blockquote