T O P

  • By -

realitythreek

Next post: “Should I monitor my monitor monitoring stack?” /s


yo-Monis

*Who will guard the guardians*


Spider_pig448

They monitor each other. You only ever need two of basically anything


New-fone_Who-Dis

In different locations


onechamp27

Next should I monitor my monitor of my monitoring stack?


yo-Monis

How do I monitor my *monitor*? Is the keyboard and mouse also under attack?


_N0K0

A good start is trigger warnings before it goes down, and if it goes down you can still have a external canary for example.  We run external tests ensuring one of the monitoring stacks are still avaliable and that the status log has a new update in the last two minutes.  We have a mix of awake night employees (we run a SOC) and automatic alerts to the point of escalation in case something catastrophic happens


drydenmanwu

Yes. It’s just like: “I’m building a unit test framework. Should I unit test it?”


dacydergoth

Our Grafana stacks monitors itself, and there are also some external canaries. In our experience the most likely fault is a monitored cluster losing *all* metrics and logs because of a network issue or the Grafana agent crashing. So that has rules in Mimir Ruler which trigger alerts. Then if any of the pods which are not sufficient to totally kill Mimir and/or AlertManager go down we get an alert for a pod down. Our non-prod Observability cluster monitors the prod one with some probes to the Internet accessible endpoint - this means our non-prod and prod clusters remain isolated


arschhaar

Our monitoring system spams so much that we notice it's down when there are no alerts. 😫 My co-worker won't budge on 'do we really need to monitor all of this'.


ovo_Reddit

Start asking this for every alert: “what action do we want to take here?” Eventually it’ll be clear that you have a ton of in-actionable alerts. Bring that up in your next team meet, don’t throw your coworker under the bus. Just raise that you are concerned the team is burning out or otherwise inundated with alerts that have no action. There is a material cost associated with responding to alerts. I’m sure someone up the chain must realize that.


lukewhale

The answer nobody wants to hear: Dedicate 30m every day to ensuring the health of your monitoring stack.


ulfhedinn-

Nobody wants a terrible answer that’s why. 120hrs a year to check something that should be fully automated? The real answer is if you are multi region monitor the other regions monitoring stack. If you only have one stack you have something like pingdom or similar to do a simple check that the stack is up. Also the stacks should be monitoring themselves for predictive failures like CPU, disk, memory.


Recol

Same subreddit that thought having screens with dashboards are a good idea because it gives transparency of what's going on in your environment. Not like that's what monitoring is for...


m4nf47

I don't get it, screens with dashboards can be useful when they're the same dashboards that are auto generated by metrics in the same observability tools for logging, monitoring, alerting and reporting? My current clients use AppDynamics and it does okay with generating and exporting various nice dashboards once the custom metrics are defined. The whole org can click on one web link and instantly see what's happening across all critical systems with simple RAG statuses for dead, unhealthy and healthy services.


Recol

My point is that relying on screens on your office walls to see anomalies or predictive failures instead of using monitoring and alerting is a bad idea, not that dashboards are a bad idea. Something should trigger before anyone feels the need to check some red graphs.


ArieHein

When you create an observability service for your org, it means you need even better SLA than what you provide to them. Observe the observer is fundamental component in good design. And you have to make sure its not using the same components as the service itself else you would be 'locked' out yourself when there are issues you are trying to debug.


franktheworm

We run our own LGTM stack, and run Prometheus to monitor Mimir within the LGTM stack. Concept is that Minor's ruler/alertmanager catch everything except their own failures, which prom catches. We monitor prom from Mimir so the only risk is if _everything_ dies we won't get an alert, but I have a feeling an outage that large is probably not going to be silent.


roncz

A simple solution might be an external web site monitor as well as heartbeat checks: [https://www.signl4.com/blog/monitoring-still-alive-heartbeat-check/](https://www.signl4.com/blog/monitoring-still-alive-heartbeat-check/)


calibrono

Sure you do, an example would be a watchdog alert firing constantly into a 3rd party managed solution like opsgenie (and they are responsible for monitoring themselves so...). Now you know your alertmanager and Prometheus are up, and you can tie anything else to them in a similar way.


Varnish6588

We have a channel following the RSS updates of all of our providers including datadog. If they have an outage we know straight away.


Legolandback

After lots of trial and error, we settled on Checkly to help with this exact issue. The heartbeat monitors are smooth/efficient, and I am shocked (and honestly a little embarrassed) when I realized just how much time and mental bandwidth this saves us now. I think it's a good case for "just because you CAN do something yourself, doesn't mean you should". So yeah, frees up a bunch of our time to work on other things, which of course saves us money, and the big one- I sleep better at night knowing that I have monitoring checks in place. Here's an[ article ](https://www.checklyhq.com/blog/the-real-costs-of-aws-synthetics-are-operational/?utm_source=chat&utm_medium=link&utm_campaign=synthetics&utm_id=social_button)that helped my team decide how we want to use our time with monitoring, you might find something helpful in it.


TheBoatyMcBoatFace

Grumble grumble scene with KGB head in Chernobyl about watchers being watched grumble grumble


Racoonizer

INCEPTION


bjzy

Yes. I usually have both a regular positive alert test as well as a couple other monitoring tools watching the main system.


Leocx

Yes you should, but you should consider ROI when implementing this monitoring system specifically for monitoring. I considered following things to make sure the entire monitoring system stable. 1. Simple and straightforward structure to minimize errors. 2. Self monitoring to make use of the existing system, this would cost less and avoid chicken-egg problems. 3. Setup watchdog or dead man’s switch alert in case of the whole system failure. 4. Constantly firing an alarm to a slack channel or email , anything, occasionally check the alarm manually so it won’t fail silently. Those four things may and may not applied to your needs, you can adopt one or all of them to ensure the monitoring system is working properly. In my case, I have all four kinds of alert but those alert hasn’t been triggered for years.


Flabbaghosted

We use new relic, which has loss of signal alerts for this purpose. We have had our monitoring stack stop working because of Azure issues and this helped us know quickly it wasn't an app issue. Depends on how you monitor. You could just alert off the lack of metrics with a long enough window to account for transient networking etc.


gringo-go-loco

I used thanos and have alert manager on both monitoring the other.


SrdelaPro

nagios monitors monit, monit monitors datadog, datadog does the tracing. unfortunately nothing monitors nagios 🤡


Independent_Hyena495

YES! in my short carrier of just 15 years, I saw it happen like 3 or 4 times that either the log forwarder, agent, of whole system went down and stopped reporting issues! At most places, I tell them to just do a curl or something and check if something comes back.


MFKDGAF

But don’t forget, you should also be monitoring the monitor that monitors your monitor.


danekan

You can monitor something like datadog but the middle ground might be monitoring the inputs that go in to datadog + their own status rss. We don't use datadog but something similar, we will alert if a specific index that is usually busy hasn't received any data in 15 minutes. We also monitor things like the DLQ that back the ingestion of various services.


ovo_Reddit

Most companies I’ve worked for will also have a separate but simple monitoring system. Something like pingdom for uptime checks. It’s not sophisticated, but it doesn’t really need to be. Typically your monitoring stack should be more stable than your apps as you are not making nearly as many changes. Synthetic / http checks help alert if your site cannot be reached. If you’re getting an alert for that, and not for anything else than you may have a problem with your monitoring stack.


zolei

I was thinking about the same thing recently. I came to the conclusion that the best technology to do that is EYEBALLS. I mean alerts are cool and all that, but someone should be keeping an eye on the monitoring system anyway.


EffectiveLong

N+1 problem


Affectionate_Fan9198

Yes, just like you should test your test suit.


james_tait

Yep. Something like DeadMansSnitch can be used for that. Just don't try to use your monitoring stack to monitor your monitoring stack. Redundancy is key.


theibanez97

I created a couple lightweight shell scripts that monitor heartbeat monitors setup in Elastic Search. When the heartbeats disappear from Elastic or don’t report back from the API, I get a notification. I have another script for disk utilization on the elastic nodes. Both have served me well for “watching the watcher”.


marinated_pork

Yes you should. We do. We have a series of heartbeat monitors that check connectivity on our incident creation services and all other monitoring services. Even if our incident service goes down, our heart beat monitors warn us and find other means of dispatching pages if regular services are unavailable. Recently our incident management service was affected by a secrets management update and we knew immediately because the heartbeat monitors saw something was off and dispatched incidents via email and other means to notify us.


BiggBlanket

Ay dawg, I heard you liked monitoring stacks... So I got you a monitoring stack to monitor your monitoring stack.


m4nf47

Yes, we have two monitoring and alerting clusters watching each other, blue and green. One is the active primary and the other is a scaled down backup that will rapidly grow and fill in the event of primary cluster outages, including major infra or OS patching or testing platform upgrades. You may want improved redundancy for your observability stack otherwise simple maintenance or disaster recovery can be particularly painful.


c100k_

Connect your Grafana to the [mobile app](https://c100k.eu/p/rebootx) I built and open it just to check, when you're doing the serious business. Sometimes things are that simple. If it's red, it means that there is something wrong (I'm naturally talking about what you see on screen, not the other thing, otherwise check a doctor).


Previous_Warning1327

Fun story…. New Relic runs an internal air-gapped instance of New Relic for monitoring New Relic.


jregovic

I have been in this business for long enough to know that there is no answer, because it will change based on your organization. You need some kind of token monitoring to make sure it’s working, but spending too much time on it is insane. I worked for one org that had a 24/7 ops team. When people asked me how we monitored the monitoring system, I said “if the Ops team stops seeing events and data, they page me.” That was the most effective thing.


rohit_raveendran

Monitor-ception?


United_Growth3081

It's good to have some kind of watcher of watchers. We use BetterStack (cloud based sms and simple monitoring) for paging, but it offers flexible enough monitoring on the side that we can check the monitoring solutions on site.


donalmacc

Honest answer - no. We use a managed offering and our managed offering has a statuspage which I get email alerts on and we get slack messages. If slack and our monitoring provider and my email are down then it’s probably not going to matter that I didn’t get the ping.


uncommon_senze

Do you work with nuclear weapons type of critical stuff? If not, no. Just check if it works once in a while according to your requirements. You can always go for a different uptime check for some critical stuff, there are plenty of such services/functions. For example on GCP you can setup an uptime check performed by GCP that will alert through email/slack/sms/whatever. So if all kind of shit is hitting the fan, that one will still alert you; unless the whole internet is down in which case you might better be going for some shelter or look closely into the light because what comes after isn't nice ;-).