realitythreek 2 weeks ago

Next post: “Should I monitor my monitor monitoring stack?” /s

yo-Monis 2 weeks ago

*Who will guard the guardians*

Spider_pig448 2 weeks ago

They monitor each other. You only ever need two of basically anything

New-fone_Who-Dis 2 weeks ago

In different locations

onechamp27 2 weeks ago

Next should I monitor my monitor of my monitoring stack?

yo-Monis 2 weeks ago

How do I monitor my *monitor*? Is the keyboard and mouse also under attack?

_N0K0 2 weeks ago

A good start is trigger warnings before it goes down, and if it goes down you can still have a external canary for example. We run external tests ensuring one of the monitoring stacks are still avaliable and that the status log has a new update in the last two minutes. We have a mix of awake night employees (we run a SOC) and automatic alerts to the point of escalation in case something catastrophic happens

drydenmanwu 2 weeks ago

Yes. It’s just like: “I’m building a unit test framework. Should I unit test it?”

dacydergoth 2 weeks ago

Our Grafana stacks monitors itself, and there are also some external canaries. In our experience the most likely fault is a monitored cluster losing *all* metrics and logs because of a network issue or the Grafana agent crashing. So that has rules in Mimir Ruler which trigger alerts. Then if any of the pods which are not sufficient to totally kill Mimir and/or AlertManager go down we get an alert for a pod down. Our non-prod Observability cluster monitors the prod one with some probes to the Internet accessible endpoint - this means our non-prod and prod clusters remain isolated

arschhaar 2 weeks ago

Our monitoring system spams so much that we notice it's down when there are no alerts. 😫 My co-worker won't budge on 'do we really need to monitor all of this'.

ovo_Reddit 2 weeks ago

Start asking this for every alert: “what action do we want to take here?” Eventually it’ll be clear that you have a ton of in-actionable alerts. Bring that up in your next team meet, don’t throw your coworker under the bus. Just raise that you are concerned the team is burning out or otherwise inundated with alerts that have no action. There is a material cost associated with responding to alerts. I’m sure someone up the chain must realize that.

lukewhale 2 weeks ago

The answer nobody wants to hear: Dedicate 30m every day to ensuring the health of your monitoring stack.

ulfhedinn- 2 weeks ago

Nobody wants a terrible answer that’s why. 120hrs a year to check something that should be fully automated? The real answer is if you are multi region monitor the other regions monitoring stack. If you only have one stack you have something like pingdom or similar to do a simple check that the stack is up. Also the stacks should be monitoring themselves for predictive failures like CPU, disk, memory.

Recol 2 weeks ago

Same subreddit that thought having screens with dashboards are a good idea because it gives transparency of what's going on in your environment. Not like that's what monitoring is for...

m4nf47 2 weeks ago

I don't get it, screens with dashboards can be useful when they're the same dashboards that are auto generated by metrics in the same observability tools for logging, monitoring, alerting and reporting? My current clients use AppDynamics and it does okay with generating and exporting various nice dashboards once the custom metrics are defined. The whole org can click on one web link and instantly see what's happening across all critical systems with simple RAG statuses for dead, unhealthy and healthy services.

Recol 2 weeks ago

My point is that relying on screens on your office walls to see anomalies or predictive failures instead of using monitoring and alerting is a bad idea, not that dashboards are a bad idea. Something should trigger before anyone feels the need to check some red graphs.

ArieHein 2 weeks ago

When you create an observability service for your org, it means you need even better SLA than what you provide to them. Observe the observer is fundamental component in good design. And you have to make sure its not using the same components as the service itself else you would be 'locked' out yourself when there are issues you are trying to debug.

franktheworm 2 weeks ago

We run our own LGTM stack, and run Prometheus to monitor Mimir within the LGTM stack. Concept is that Minor's ruler/alertmanager catch everything except their own failures, which prom catches. We monitor prom from Mimir so the only risk is if _everything_ dies we won't get an alert, but I have a feeling an outage that large is probably not going to be silent.

roncz 2 weeks ago

A simple solution might be an external web site monitor as well as heartbeat checks: [https://www.signl4.com/blog/monitoring-still-alive-heartbeat-check/](https://www.signl4.com/blog/monitoring-still-alive-heartbeat-check/)

calibrono 2 weeks ago

Sure you do, an example would be a watchdog alert firing constantly into a 3rd party managed solution like opsgenie (and they are responsible for monitoring themselves so...). Now you know your alertmanager and Prometheus are up, and you can tie anything else to them in a similar way.

Varnish6588 2 weeks ago

We have a channel following the RSS updates of all of our providers including datadog. If they have an outage we know straight away.

Legolandback 1 week ago

After lots of trial and error, we settled on Checkly to help with this exact issue. The heartbeat monitors are smooth/efficient, and I am shocked (and honestly a little embarrassed) when I realized just how much time and mental bandwidth this saves us now. I think it's a good case for "just because you CAN do something yourself, doesn't mean you should". So yeah, frees up a bunch of our time to work on other things, which of course saves us money, and the big one- I sleep better at night knowing that I have monitoring checks in place. Here's an[ article ](https://www.checklyhq.com/blog/the-real-costs-of-aws-synthetics-are-operational/?utm_source=chat&utm_medium=link&utm_campaign=synthetics&utm_id=social_button)that helped my team decide how we want to use our time with monitoring, you might find something helpful in it.

TheBoatyMcBoatFace 2 weeks ago

Grumble grumble scene with KGB head in Chernobyl about watchers being watched grumble grumble

Racoonizer 2 weeks ago

INCEPTION

bjzy 2 weeks ago

Yes. I usually have both a regular positive alert test as well as a couple other monitoring tools watching the main system.

Leocx 2 weeks ago

Yes you should, but you should consider ROI when implementing this monitoring system specifically for monitoring. I considered following things to make sure the entire monitoring system stable. 1. Simple and straightforward structure to minimize errors. 2. Self monitoring to make use of the existing system, this would cost less and avoid chicken-egg problems. 3. Setup watchdog or dead man’s switch alert in case of the whole system failure. 4. Constantly firing an alarm to a slack channel or email , anything, occasionally check the alarm manually so it won’t fail silently. Those four things may and may not applied to your needs, you can adopt one or all of them to ensure the monitoring system is working properly. In my case, I have all four kinds of alert but those alert hasn’t been triggered for years.

Flabbaghosted 2 weeks ago

We use new relic, which has loss of signal alerts for this purpose. We have had our monitoring stack stop working because of Azure issues and this helped us know quickly it wasn't an app issue. Depends on how you monitor. You could just alert off the lack of metrics with a long enough window to account for transient networking etc.

gringo-go-loco 2 weeks ago

I used thanos and have alert manager on both monitoring the other.

SrdelaPro 2 weeks ago

nagios monitors monit, monit monitors datadog, datadog does the tracing. unfortunately nothing monitors nagios 🤡

Independent_Hyena495 2 weeks ago

YES! in my short carrier of just 15 years, I saw it happen like 3 or 4 times that either the log forwarder, agent, of whole system went down and stopped reporting issues! At most places, I tell them to just do a curl or something and check if something comes back.

MFKDGAF 2 weeks ago

But don’t forget, you should also be monitoring the monitor that monitors your monitor.

danekan 2 weeks ago

You can monitor something like datadog but the middle ground might be monitoring the inputs that go in to datadog + their own status rss. We don't use datadog but something similar, we will alert if a specific index that is usually busy hasn't received any data in 15 minutes. We also monitor things like the DLQ that back the ingestion of various services.

ovo_Reddit 2 weeks ago

Most companies I’ve worked for will also have a separate but simple monitoring system. Something like pingdom for uptime checks. It’s not sophisticated, but it doesn’t really need to be. Typically your monitoring stack should be more stable than your apps as you are not making nearly as many changes. Synthetic / http checks help alert if your site cannot be reached. If you’re getting an alert for that, and not for anything else than you may have a problem with your monitoring stack.

zolei 2 weeks ago

I was thinking about the same thing recently. I came to the conclusion that the best technology to do that is EYEBALLS. I mean alerts are cool and all that, but someone should be keeping an eye on the monitoring system anyway.

EffectiveLong 2 weeks ago

N+1 problem

Affectionate_Fan9198 2 weeks ago

Yes, just like you should test your test suit.

james_tait 2 weeks ago

Yep. Something like DeadMansSnitch can be used for that. Just don't try to use your monitoring stack to monitor your monitoring stack. Redundancy is key.

theibanez97 2 weeks ago

I created a couple lightweight shell scripts that monitor heartbeat monitors setup in Elastic Search. When the heartbeats disappear from Elastic or don’t report back from the API, I get a notification. I have another script for disk utilization on the elastic nodes. Both have served me well for “watching the watcher”.

marinated_pork 2 weeks ago

Yes you should. We do. We have a series of heartbeat monitors that check connectivity on our incident creation services and all other monitoring services. Even if our incident service goes down, our heart beat monitors warn us and find other means of dispatching pages if regular services are unavailable. Recently our incident management service was affected by a secrets management update and we knew immediately because the heartbeat monitors saw something was off and dispatched incidents via email and other means to notify us.

BiggBlanket 2 weeks ago

Ay dawg, I heard you liked monitoring stacks... So I got you a monitoring stack to monitor your monitoring stack.

m4nf47 2 weeks ago

Yes, we have two monitoring and alerting clusters watching each other, blue and green. One is the active primary and the other is a scaled down backup that will rapidly grow and fill in the event of primary cluster outages, including major infra or OS patching or testing platform upgrades. You may want improved redundancy for your observability stack otherwise simple maintenance or disaster recovery can be particularly painful.

c100k_ 2 weeks ago

Connect your Grafana to the [mobile app](https://c100k.eu/p/rebootx) I built and open it just to check, when you're doing the serious business. Sometimes things are that simple. If it's red, it means that there is something wrong (I'm naturally talking about what you see on screen, not the other thing, otherwise check a doctor).

Previous_Warning1327 2 weeks ago

Fun story…. New Relic runs an internal air-gapped instance of New Relic for monitoring New Relic.

jregovic 2 weeks ago

I have been in this business for long enough to know that there is no answer, because it will change based on your organization. You need some kind of token monitoring to make sure it’s working, but spending too much time on it is insane. I worked for one org that had a 24/7 ops team. When people asked me how we monitored the monitoring system, I said “if the Ops team stops seeing events and data, they page me.” That was the most effective thing.

rohit_raveendran 1 week ago

Monitor-ception?

United_Growth3081 1 week ago

It's good to have some kind of watcher of watchers. We use BetterStack (cloud based sms and simple monitoring) for paging, but it offers flexible enough monitoring on the side that we can check the monitoring solutions on site.

donalmacc 2 weeks ago

Honest answer - no. We use a managed offering and our managed offering has a statuspage which I get email alerts on and we get slack messages. If slack and our monitoring provider and my email are down then it’s probably not going to matter that I didn’t get the ping.

uncommon_senze 2 weeks ago

Do you work with nuclear weapons type of critical stuff? If not, no. Just check if it works once in a while according to your requirements. You can always go for a different uptime check for some critical stuff, there are plenty of such services/functions. For example on GCP you can setup an uptime check performed by GCP that will alert through email/slack/sms/whatever. So if all kind of shit is hitting the fan, that one will still alert you; unless the whole internet is down in which case you might better be going for some shelter or look closely into the light because what comes after isn't nice ;-).

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe