Hi All,
So we are using Grafana for the past two years now and we are also alerting with it which normally works perfectly fine.
At some customers we have some really old hardware which only supports SNMP traps, as this is not supported by Prometheus directly I though why not build a SNMPtrapD service and scrape the logs with Promtail where I already filtered and format the logs based on Timestamp, Hostname and Trap message so its easier to find the alerts for the specific devices.
Now I come to the issue I have the following query: (Example on a Idrac device I have in my testlab at home)
count_over_time({filename=“/var/log/snmptrapd/snmptrapd.log”} |= 674.10892.5.3.2.5.0
| pattern <_> <message_part> duration<_>
[5m])
Alert is firing as soon as it finds a Log with the string (Pending = None)
Notification is set over a notification policy I specifically built for the SNMP job following the details for the Policy
Contact Point: My Email Addrsss
Continue matching subsequent sibling nodes = yes
Group by = Disabled (…)
Override general timings
Group Wait = 10s
Group interval = 30s
Repeat interval = 5m
As you see I already chose very low timings because I though this and the Disable (…) would fix the issue but it did not.
(Main issue is that the alert gets exactly sent once and not any more after this one time)
If I check the logs I see the following:
logger=ngalert.state.manager rule_uid=edwe0v4h1xibkd org_id=1 t=2024-09-02T08:24:47.150341639+02:00 level=info msg=“Detected stale state entry” cacheID=“[["alert_rule_namespace_uid","admes6eidqb5sb"],["alert_rule_uid","edwe0v4h1xibkd"],["alertname","IDRAC SNMP Test Alert"],["filename","/var/log/snmptrapd/snmptrapd.log"],["grafana_folder","Sysman"],["hostname","testhost06.testdomain.lab"],["job","snmptrapd_logs"],["message_part",".iso.org.dod.internet.mgmt.mib-2.system.sysUpTime.sysUpTimeInstance = Timeticks: (146287383) 16 days, 22:21:13.83\t.iso.org.dod.internet.snmpV2.snmpModules.snmpMIB.snmpMIBObjects.snmpTrap.snmpTrapOID.0 = OID: .iso.org.dod.internet.private.enterprises.674.10892.5.3.2.5.0.10395\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.1.0 = STRING: \"TST001\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.2.0 = STRING: \"The iDRAC generated a test trap event in response to a user request.\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.3.0 = INTEGER: 3\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.4.0 = STRING: \"11234568\”\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.5.0 = STRING: \"testhost06.testdomain.lab\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.6.0 = \"\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.7.0 = STRING: \"N/A\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.8.0 = \"\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.9.0 = STRING: \"11234568\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.10.0 = STRING: \"Main System Chassis\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.11.0 = STRING: \"testhost06\""],["message_part_extracted","06:13:19 testhost06.testdomain.lab [UDP: [192.168.64.14]:36807-\u003e[192.168.64.2]:162]:\n.iso.org.dod.internet.mgmt.mib-2.system.sysUpTime.sysUpTimeInstance = Timeticks: (146287383) 16 days, 22:21:13.83\t.iso.org.dod.internet.snmpV2.snmpModules.snmpMIB.snmpMIBObjects.snmpTrap.snmpTrapOID.0 = OID: .iso.org.dod.internet.private.enterprises.674.10892.5.3.2.5.0.10395\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.1.0 = STRING: \"TST001\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.2.0 = STRING: \"The iDRAC generated a test trap event in response to a user request.\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.3.0 = INTEGER: 3\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.4.0 = STRING: \"11234568\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.5.0 = STRING: \"testhost06.testdomain.lab\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.6.0 = \"\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.7.0 = STRING: \"N/A\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.8.0 = \"\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.9.0 = STRING: \"11234568\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.10.0 = STRING: \"Main System Chassis\"\t.iso.org.dod.internet.private.enterprises.674.10892.5.3.1.11.0 = STRING: \"testhost06\""],["snmp","snmp"],["timestamp","2024-09-02 06:13:19"]]" state=Normal reason=NoData
What I see is the stale state which I dont understand as I have a log message and this message changes everytime with a timetamp which makes it unique. As these are SNMP traps, I only get a log entry if something is broken which means no value = OK
Is there anything I am doing wrong?