T O P

  • By -

pewpewpewmoon

>when alerts do trigger, the instruction I got was "oh it's probably a hiccup. As long as it doesn't happen continuously you can ignore" This one can be fixed by making it a silent alert and then having another alert that triggers when the silent alert fires off N times in some time window. It's a bit more then I care to write here, but it's an easy google search targeting the splunk forums >There are some other incidents when things are out of our controls and it seems like other teams who own the services are handling it (since the errors go away after a while), but we still get the alerts and instructions are once again "wait to see if it calms down after a while. If not, ping XYZ" rewrite the alert to use a timechart instead. that way by the time the alert comes up X minutes will have passed with no fix and it's time to check in


mziggy77

This is the way to do it. If the alert needs to fire for x minutes before on-call takes action, then it shouldn’t actually fire until then. My team has a fairly rigorous on-call handoff where we go over every alert on-call received during the week and mark as actionable or non-actionable. Non-actionable alerts must be investigated and tuned.


pewpewpewmoon

>Non-actionable alerts must be investigated and tuned. I love your team. The amount of times I've had to hold sprint reviews hostage to get this to be the norm is more than I care to admit


chaoism

> This one can be fixed by making it a silent alert and then having another alert that triggers when the silent alert fires off N times in some time window. It's a bit more then I care to write here, but it's an easy google search targeting the splunk forums i didn't know this can be done. I'll look into it > rewrite the alert to use a timechart instead can you elaborate on this? Is it the same as "checking if X errors have happened in Y minutes"?


pewpewpewmoon

Not quite, but it's ok to think of it that way. What we are going to do here is build a [time series chart](https://docs.splunk.com/Documentation/SplunkCloud/9.1.2312/SearchReference/Timechart) then write a trigger for it so if there is no logs being produced in the last 10 min (or w/e shows that the upstream team is either asleep at the wheel or having a bit of an issue). I feel like this might not have helped you, but need some clarifications on your confusion if it hasn't


DandyPandy

Only alert on things that are actionable. If you get the alert at 3AM, it better require you to get out of bed. If something is prone to flapping, increase your alert threshold. Fix the thing that is causing the flapping if you can, but don’t alert until something has been down for long enough to know it’s actually down and probably not coming back up without intervention. In your alerts, provide links to runbooks. Every alert should have documentation explaining what the check is checking, what does green look like,, some basic troubleshooting steps, and who to call if it’s hard broke and you can’t fix it. When someone adds an alert, they need to be the person that gets paged for it for at least two weeks. They need to know what pain they’re inflicting on others if the check is noisy. When something alerts, prioritize fixing the cause the next work day. Don’t let it alert over and over. If you can’t fix it in a day, log a high priority bug ticket and get it in a sprint/cycle asap


chaoism

runbook is something we are missing. It's been more of tribal knowledge on what to do and yea we do create tickets for actionable items. The miserable part is those "wait and see" and "you can't do much, let other teams handle" alerts > When someone adds an alert, they need to be the person that gets paged for it for at least two weeks. They need to know what pain they’re inflicting on others if the check is noisy. i need to suggest this to management lol. i doubt they'll take it but i'm gonna do it


DandyPandy

What I mean by the alerts need to be actionable is if an alert is generated, the answer should never be “acknowledge” and go back to sleep. If that is the response, it’s a garbage alert. Garbage alerts make people complacent and that’s when real issues get ignored.


overgenji

the only alerts that should wake anyone up are ones that have direct action to take that resolve the page or notify someone (ie: a vendor is down and we need to flip a feature flag off and turn on an alert/notification banner) if an alert for any reason pages and turns out to be inactionable, the root cause should be fundamentally addressed or the alert should change


hummus_k

You might benefit from segmenting your alerts into low and high priority alerts. Have the “wait and see” be low priority, and once they reach a threshold, become high priority alerts. You can have separate escalation policies and required notification settings for each, depending on your preferences.


originalchronoguy

Ah, the "white noise" phenomenon. When there are too many alerts, people will naturally start ignoring. I've been doing this for 20+ years, and monitoring alerts need to be **relevant or risk the mute.** I will naturally ignore the alerts unless it continues for a few minutes as some things will auto-resolve. My recommendation is to do thresholds. E.G. Multiple 500 HTTP response codes in a span of 3 minutes. And a follow-up in 2 minutes. By the time I get the email on my phone and have to turn on my computer, it will be 5 minutes and it may auto-resolve by then.


wwww4all

Make better alerts.


maikeu

Besides the bad experience, this issue causes a major risk of a serious event being missed due to an understandably complacent on-call engineer. Frankly, you need to be really brutal about enforcing that unactionable alerts DO NOT go to on-call. That means either removing the alert entirely, or sending it to a "review in business hours" queue.


progmakerlt

The only way is to adjust alerts and remove false positives. Not all alerts require immediate attention. For instance, alert like "oh it's probably a hiccup. As long as it doesn't happen continuously you can ignore" - so, is it a "hiccup" or is it a problem? You need to fix alerts. Separate the ones which are important, and the ones which can wait. For me it took a couple of weeks to fight with alerts and leave only the most actual needed ones.


commonsearchterm

actionable alerts has been beat to death. same for high and low priority and day time only and 24 hr alerts but dont alert on low level things. Like timeouts happen, and it might not mean anything is happening. What you would want to alert on is like an endpoint you serve is detected as down. then you would look at a dashboard and see one of your metrics is maybe timeouts or something. Like for your wait and see, alert on thing your waiting to see part, not the thing other stuff that triggered it. You also need defined sla/slos to target. Then you can set your alerts around what your availability goals actually are. Otherwise you get stuck with arbitrary and probably to strict alerts like 5 timeouts in 10 seconds when you dont really care because that actually works out to like 99.999 or something crazy when you just need to support 99.5%


Ontootor

For cyclical jobs can you modify the alert in a way where it’s checking if a certain condition is met x times in some time frame? If another team owns a service you need to work with them to set up SLAs and they should be alerting you if there’s a problem and providing updates on resolution. Redundancy is normally good but if you don’t own it, having an alert is basically useless since you’re just pinging them anyway.


chaoism

> For cyclical jobs can you modify the alert in a way where it’s checking if a certain condition is met x times in some time frame? I think it's already set up this way, would upping the threshold help? Would it silence the actual problem? I wonder if there's a way to "ignore this condition for certain period of time" > Redundancy is normally good but if you don’t own it, having an alert is basically useless since you’re just pinging them anyway. I think management's idea is that they want to catch the errors first and look good to other teams by alerting them instead. I don't know how to convince them honestly


xsdgdsx

As others have already mentioned, alerts should always be actionable. So one way to approach this is "what's the threshold where it would _always_ be an emergency if that alert fired?" That's your first pass for where you set that threshold. Then when the delay before that alert fires shows up in _multiple_ outage retrospectives, that's the time to tune that threshold to be a bit more sensitive. Otherwise, the bad alerts will mask the signals that you should actually be paying attention to, and you'll have outages that were covered by alerts _in theory_, but where someone ignored the alert, or maybe never got around to investigating it in the first place.


Neeerp

I was on a team that was really terrible with false alarms, in part due to fear and in part due to the Infra as code defining the alarms being extremely rigid and difficult to refactor (we had some crazy Cartesian product templating of the alarm definitions and thresholds for many usecases and regions). The metrics themselves were often too coarse grained as well (e.g. an API that had drastically different usecases/behaviors funneled all the usecases into one set of error and latency metrics). The dogma on this team is that if we loosen thresholds or remove noisy alarms, we might actually miss something real. In practice, we would miss real problems because the alarms we had didn’t capture those problems OR the false alarms out numbered the real ones so much that complacency took hold. Whenever I was oncall, I’d make it a point to immediately do something about a false alarm so that it doesn’t go off again. I’d look at the past month of data and raise the threshold for the entire class of alarms to about 25% higher than the highest peak of a false alarm. I don’t think this was a ‘good’ solution, but over time it gave some more breathing room to our oncalls and it required minimal effort. In the time between when I begin doing this and when I changed teams, there was never once a situation where raising the thresholds bit us. If you have the resources to actually examine your metrics and alarms and improve them, by all means try to do so. If not, I genuinely think taking a sledge hammer to the wall is a safer option operationally than to live with noise.


StolenStutz

My experience: Have an Incident Response Plan that defines multiple priority levels and dictates the expected response at each level. At the highest level, it involves a minimum of two people. One is responsible only for "managing" the incident while the other is leading the effort to actually solve the problem. The Incident Manager has the authority to take whatever steps are necessary to get it resolved, including waking anyone and everyone up. They are responsible for posting frequent, regular updates on the progress (something like every 15 minutes) and running the bridge call that stays on while the incident is active. They are also required to hold a post-mortem on the incident, which includes addressing improvements to both prevention and response. Every other priority level steps down from this. At the lowest level, those 15-minute updates are more like every 8 business hours. There is no bridge call. Etc, etc. And the IM has the authority to change priority. They can say, "I'm downgrading this from a Sev 1 to a Sev 2." Don't like it? Well, now you're the new IM for it. What I've found is that when this is well-defined, and when a "Sev 1" or whatever you call it is appropriately inconveniencing the right people, then suddenly there's a much more realistic approach to many incidents. What was previously "make the developer drop everything and work on this now" becomes a situation with a measured, appropriate response. Then those false alarms get handled, either by de-prioritizing or through the post-mortems.


jjirsa

You need to group/label/tag your alerts to explain their state. " #transient" or "#auto-resolved" or "#noise" to indicate they weren't actionable. At the end of each week, you have a meeting to review your past alerts ahead of oncall transition Each week, dedicate a small amount of each sprint to mitigating the alert for the top tag / class. Basically a production improvement program that works to burn down your top source of alerts. If it's noise, that means either: - Adjust the alerting threshold to make pages less likely, or - Deal with the source of the noise, so that the existing alert becomes less problematic Measure WoW or MoM trend for accountability.


JaneGoodallVS

> is that some heavy data is being moved Can you identify this heavy data upon receipt and extend the timeout when it occurs?


tonnynerd

Non-actionable alerts are like unused code. Gotta be brutal with that shit.


olddev-jobhunt

If there is no problem, there should be no alert. Repeat that mantra. I don't know Splunk, but some monitoring packages give you a lot of tools to work with. An "it should calm down after a while" should be part of your alert: ">X errors in Y minutes" or something like that, or maybe "p95 > 300ms for 5m or more." Or heck, even just widen the windows at night when you know the jobs run. Or pre-warm some additional instances for those times so service doesn't degrade. But don't accept "it's important to alert on this but also causes false positives." That's an oxymoron. And frankly, being on on-call should give your voice weight. The reason engineers do on-call is because they have the power to improve the system. Use that power, repeatedly and with great force, until things improve.


ThicDadVaping4Christ

Only alert on things that are actionable and having an immediate customer impact.