Possible False Alarms with Synthetics

I seem to be getting the odd false positive for Synthetics. I currently have some simple Synthetics pointing at both my personal website and my employer’s (as a demo for my boss)

Last night (at 15:25 UTC on the 18th) I got an alert for both being down for a few minutes. Checking it doesn’t appear to be the case and it is unlikely that both my site and my employer’s would be down at exactly the same time. They are on completely different Infrastructure and hosting companies.

Also I can’t see any sign of an error using the default Synthetic Dashboard. I’ve been getting similar errors every week or so. I am not sure how to find alert history using the alert interface.

Just wondering if anybody is getting similar or could suggest how to further trace the problem?

Hi @simona1d6!

There are a few ways to troubleshoot something like this. You’ve already stated that each site is on different infra and hosting companies, so it’s probably not an issue close to them. How many probes are you using to monitor each site? If one, it is possible that there may have been an issue with the transit early on in the route.

Something else to throw out there, Traceroute checks were just released earlier this week. Those could be used to troubleshoot any potential routing issues.

So what happened is I have 3 synthetic monitors pointing at each site:

Work - London, North California, Sydney
Home - Amsterdam, North California, Sydney

Interval is 5 minutes

At 15:25 on the 18th (UTC) four out of the six went down for a single check interval.

Amsterdam → Home
Sydney → Home
London → Work
North California → Work

While these stayed up:

Sydney → work
North California → Home

I am wondering if perhaps there was a reboot or something at some of the Synthetic endpoints?

I also need to tune my alerts to not go off on a single error. I think I put it to that for testing.

Hiya @simona1d6

I don’t see that we had any issues with those probes at that time. I think one of the factors @itsjustjoe mentioned is probably more likely.

This is happening to us consistently. Every so often we will get an alert that will be resolved after exactly 5 minutes (which is unrelated to any of the query parameters we are using).

We don’t need the multiple geographic locations, so are mostly using Frankfurt, with mutliple alerts on multiple targets hosted on separate infrastructure.

When looking at the data backing the alert after the alert closes, there does not seem to be any downtime at all. It seems like there might be some kind of data delay?

This makes the alerts too noisy to use for us right now :confused: