Alert after N periods
In order to avoid false alarms, it would be great to have a per-check variable to define how many failed pings to wait before alerting.
We already do 2 double checks before notifying, which is enough to eliminate any false positive. If you want to ignore real but small downtimes, you should increase the check period.
Scott Wheeler commented
Just jumping in as a company that's currently evaluating updown, and is switching away from Pingdom. We have on-server monitoring systems that try to restart things when there's a problem. Our preferred behavior is getting email alerts on single failures, but waiting 3 minutes (i.e. three checks) before getting SMSes. Basically, we treat text alerts as "Stop what you're doing, or wake up and fix this..." vs. the lower priority notifications being something that lets us keep an eye on general app health. A drop down next to the different notification types with a threshold for each notification is something we'd consider useful.
You're totally right! that's why our double checks are not at the same time. When your check frequency is 30s, after a first down check, the two double checks will be performed at t+15s (from another server) and t+30s (from any server). This way you're sure to have a downtime confirmed from at least two locations and lasting at least 30s. That's why I said that if you want to be more tolerant to small downtime (say 2 min) you have to set a check frequency of at least 2min. And that will also save you a lot of money ;)
We think that's our job to provide the fastest and most accurate monitoring and that's why don't allow you to tweak the settings but instead give you the best setup so you don't have to worry about this.
Miguel Medina commented
Thanks Adrien for your quick answer. In my experience, a double-check with another location is not the same than waiting for a second or third failed check. The best example I find is a timeout due to temporary slow performance. Sometimes apps fail to answer a ping because there's a usage peak at the exact moment of the check. Depending on the SLA, that could be perfectly fine as long as it gets ok after 2 minutes.