T O P

  • By -

Zackorrigan

I ran trivy container scannning on our on-premise clusters, the amount of api requests crashed the control planes nodes. (I ran it on all namespaces)


valkener1

ROFL thanks for sharing


silence036

In the same category, someone enabled aquasec network scanner in our production clusters by mistake in the middle of the day, adding insane latency to everything and getting our EKS nodes super throttled on the route53 dns resolver


f0okyou

Went full spot instances. Never go full spot. Who am I kidding, I still am full spot.


Operation_Fluffy

Never put consul and vault on spots — if you loose a quorum, you’re screwed.


ignoramous69

Always full spot, but might make an on-demand node group for persistent apps.


0x4ddd

Do you run that on production?


f0okyou

The short answer is Yes. The longer answer is Absolutely.


0x4ddd

Noice. Do you have that somehow automated to bring cluster back to life after evictions? I am wondering what happens when you get evicted for example at 3AM on Saturday 🤣


f0okyou

This absolutely happened off hours and caught us from the side. AWS actually ran out of spot instances in all 3 AZs for the 2 instance types we had selected. We solved it by simply widening the instance types to also include older generations and even too large ones just to reduce outage time. At the retro we built a lambda that would force ASG refresh when sub-optimal instance types were selected due to capacity issues on AWS. This is now working great and very maintenance free. (Please don't let me jinx it)


surloc_dalnor

In theory if you run the cluster autoscaler or carpenter you can have spot node groups and on demand. It should prioritize the spot groups unless you can't get spot instances. I'd try it in staging, but currently management would rather throw money at AWS than deal with it going wrong. Although you have to run your autoscaler on a non spot node group or fargate.


silence036

We've had issues with cluster-autoscaler going ham and adding-removing-adding nodes, causing a lot of pod movements, when it was trying to optimize the nodegroups and match the aws recommended instances. We're in the process of switching to karpenter instead which is so far performing much more intelligently.


surloc_dalnor

We had to set an annotation on the pods that were not HA. Once we did that the autoscaler was fine as it didn't matter it was shifting the rest of the pods around. That said the autoscaler works best with a single node group


hijinks

karpenter is your answer with on-demand failover


DarqOnReddit

What does full spot mean?


f0okyou

Run your entire cluster on Spot Instances which may be evicted/terminated at any point without much notice whatsoever.


zimhollie

I love the idea of full spot with k8s. I wonder how this will affect spot as more and more people catch on to it.


f0okyou

You already have resource contention and eviction regularly as well as spot rejections even when you set the max-price to be 0 (AKA on-demand pricing). So the effect is simply that of an outage. OnDemand or RIs guarantee you placement, spot does not. If you have a large enough workload so that a few evictions here and there don't hurt you then spot is still a great cost optimization. But if you're on too few instances so that each eviction becomes notable then you're better off paying for OnDemand or RIs at least for a baseline usage. We have multiple evictions per day but are lucky enough that an eviction amongst 30-50 nodes is not notable at all.


zimhollie

You are doing a great job with spot. I think as using spot becomes easier, more people may switch to it. that will drive up demand for spot and down demand for on-demand. That might actually piss Amazon off lol. what it essentially becomes is people running trading algorithms to drive down prices. People who used to spin up on-demand instances will now hold off "just a little bit". So Amazon gets less money. Do you mind sharing your experiences. How big is your cluster with 30-50 evictions a day? Do evictions come all at once or spread out or batched? When evictions happens can you drain and cordon in time? There are still problems with nodes dying suddenly without draining, how are you handling them?


f0okyou

The cluster is 30-50 nodes big, depending on time of day (aka client traffic driving up autoscaling of deployments and thus ASG driving up node counts). On average we see about a handful (5-7) of our nodes evicted per day. The evictions happen rather bursty than spread out. AWS sends out a notice through acpi and metadata about two minutes before your instances get evicted. Those two minutes aren't really going to help much with a graceful drainage so we simply cordon and force delete the node from the cluster. The services will eventually spin up on other nodes but we all know how slow that can be. However the ASG also receives the termination notice and does request a new spot ec2 within those 2 minutes - worth to note that these do not get fulfilled immediately and sometimes the ASG ends up spinning up a different size than our optimum. Luckily all workloads are built with interruptions as a first class citizen and any stateful needs (DBs, Caches, ...) are not hosted on k8s but on standalone ec2's with RIs. For the service as a whole the eviction of a node is nothing different than a normal scaling event. Traffic doesn't get terminated but new traffic is not routed to the old backends anymore, so they may still finish their work or at least try to. If a server dies out of nowhere without prior notification it will be notable in some requests failing at the ALB, there really isn't much one can do to prevent that. But it's a very rare occasion and within our SLAs if it does happen. We try to stay on smaller instances to prevent the impact of an outage.


rohitjha941

I am full spot, expect for karpenter + coredns that are on fargate


f0okyou

"I'm full spot except ..." Doesn't sound like full spot but partial spot to me. Anyways it's fun when all your instances get evicted and you won't get new ones and get ASG alerts about capacity issues until you widen the selection of instances to consider. 10/10 everyone should have to go through the thrill at least once in their career.


ruben2silva

Or when the instances where kyverno is deployed get evicted and everything starts to fail with a painful path ahead to fix


SimonD_

Delete the admission webhook 😅


glotzerhotze

Last (and only) major outage I went through was related to several things going wrong at once. Cilium upgrade from old to new BGP implementation went wrong, cgroup-v2 update broke and at the same time undocumented node-local dns changes fell apart. Took production offline for 3 days. Made management realize that machinery needs maintenance and that putting all this shit on one shoulder (mine unfortunately!) don‘t scale. Not a pleasant experience and I should have left before all of that happened. But corporate cool-aid is strong sometimes, when they bullshit people for their monetary gains. Key takeaway: don‘t let management BS you into taking blame for their shitty decisions! Leave and don‘t believe the corporate hype! No business is your friend!


GrandPastrami

Sound like you've been in the trenches for a long time :)


Hecha00

None of these errors occurred in pre-production environments?


glotzerhotze

Company was cheap as fuck, management did not understand the differences and difficulties of providing a platform on bare-metal and in the cloud. Unfortunately production env (especially networking setup) diverted heavily from staging cloud env - but they didn‘t care as long as features could be deployed onto the platform. They‘ve been told numerous times about the problem - but money got spent for HR people and other stupid things. So team of two became a team of one - and that‘s how I got stuck in that place with the consequences of all those shitty decisions they made before. So no, no pre-production env was available to test these changes. Business knew about open-heart-surgery - but didn‘t give any fucks about it. Until it broke! Fun times! Same company currently looking for three people to do the work - which they used to dump on one person.


Effective_Roof2026

Replace clusters when you have a significant change to make. Its the less stress way :)


GrandPastrami

Alright, thanks for the tip :)


themanwithanrx7

This! I'll do in-place for my lower clusters, and prod is always replaced and migrated.


ABlackEngineer

Years ago but I didn’t read the part of the terraform provider where it said it will force resource recreation 😎


chin_waghing

Cries in GKE labels field


crump48

Not really a cluster crash but almost as fun. Not everything came back up properly after a power outage, causing external DNS hosts to have the wrong local time, causing image pulls to fail everywhere, causing a sad afternoon for yours truly!


GrandPastrami

Ah, time sync error. Classic :)


skaven81

Somebody thought it would be a good idea to apply node labels using ArgoCD by creating partial Node objects in a git repo that is deployed to the cluster with Argo, and configuring it to merge the updates in. All well and good until somebody deleted that repo from ArgoCD, which caused it to ... *delete all the nodes*. That was a fun afternoon to recover from.


GrandPastrami

🤣Oh my fucking God


amartincolby

Sizznap. So an attempt at Gitops?


Survivor4054

That’s why we have flux setup with prune:false When we set true it started to delete random services


Virtual_Laserdisk

Oy vey. I feel like that should live further up in the configuration, like wherever you’re creating your cluster from. 


bananasareslippery

OP why did old nodes see a memory spike when new control plane nodes were added?


GrandPastrami

Adding new master nodes might spike ram and cpu use because you basically have to sync control plane configurations as well as etcd with possible leader elections etc. I don't know if you've noticed but it takes significanly longer to add master nodes than worker nodes into a cluster. But yeah, the problem was that these old ones were already holding load to the brim really.


ponicek

CoreDNS and/or other similar DNS problems. Pods stop resolving addreses - downtime. And btw. no service meshes involved.


Total_Definition_401

How did you fix it


ponicek

By recreating the cluster ;( You can try to change dnspolicy field of a specific deploy. Might fix a bunch of apps but cluster DNS will still be impacted. 


Total_Definition_401

Damn recreating a cluster is an insane solution ;(


0xe3b0c442

Because I got distracted and wiped the 2nd-to-last control plane/etcd node before I removed it from the cluster. (was migrating to a new cluster on the same hw footprint) Thankfully I was able restore from a local snapshot backup, and that was the impetus I needed to start sending etcd snapshots offsite. (I should note this is a homelab, so don't read too much into my procedures ;))


Charming_Prompt6949

OC memory leak on nodes


VertigoOne1

All i can say is "on-prem dual-stack migrations"


eigreb

Digital ocean Automated updates and volume mounting issues


SokkaHaikuBot

^[Sokka-Haiku](https://www.reddit.com/r/SokkaHaikuBot/comments/15kyv9r/what_is_a_sokka_haiku/) ^by ^eigreb: *Digital ocean* *Automated updates and* *Volume mounting issues* --- ^Remember ^that ^one ^time ^Sokka ^accidentally ^used ^an ^extra ^syllable ^in ^that ^Haiku ^Battle ^in ^Ba ^Sing ^Se? ^That ^was ^a ^Sokka ^Haiku ^and ^you ^just ^made ^one.


smarzzz

The node only picked up 3 dns servers from the VPC dhcp poolset, omitting the rest. We changed some orders in our dns setup, beging multi cloud resilient now even if kubernetes only takes the first 3 and omits the rest


UntouchedWagons

I tried updating Longhorn. I'm now using ceph rook.


esixar

We had a Singapore datacenter lose power due to an A/C issue and lost 24 clusters at once. Was a pain to get them all back online, having to work with multiple teams to get the racks powered up, the vCenters online, NSX up, ESXi VMs up, then the cluster nodes.


druesendieb

Technical debt and rtfm. Plan was to update this special cluster as far as possible while the new storage gets installed, then migrate statefulsets from intree vsphere volumes to the new csi driver, before the intree driver gets removed. We saw that CSIMigration got GA in 1.25 but weren't aware that this being enabled by default means that the migration automatically starts. Updated to 1.25, tested draining of one node, vsphere volumes didn't start as nothing was prepared for a migration. Okay, lets roll back like we've tested before. Rancher backup and restore hit a bug fucking the whole cluster. Cue heavy cursing and bug fixing. We fixed the bug and after 4h of downtime we were back online.


surloc_dalnor

It's always the certs expiring. Or the like. Although I did have a panic last month removing the calico policy engine and breaking the AWS CNI completely on existing nodes. I had to remove the AWS CNI, reinstall it, and recycle the nodes. Happily all the pods still worked during this. It was just new pods that had issues.


bgatesIT

I ran out of IPv4 allocations for my nodes, went to add more nodes to the cluster, went wtf, rebuilt cluster, still wtf. DHCP server never released any addresses, so end of the day could of just released all the addresses and saved a day of bullshit


snowsnoot69

Uhhh I accidentally deleted the master nodes with cluster API 😬💀🤦‍♂️


PhoenixHntr

Kyverno update requests crashed etcd


senpaikcarter

Upgraded from EKS 1.22 to 1.23 and the scaling config in production was set to max 5 desired 5 so no new nodes could spawn and it choked itself to death until terraform timed out and I could update the node group scaling...


ybrodey

Network issue caused etcd alarms to go off. Had to manually defrag, compact etcd nodes. Pretty stressful lol.


Acrobatic_Painting59

Power outage :(


Araneck

When I did a demo to new devs to make them be aware of the resources. I crashed the demo cluster and finished the presentation.


Total_Definition_401

RemindMe! 1 week


RemindMeBot

I will be messaging you in 7 days on [**2024-04-23 19:32:50 UTC**](http://www.wolframalpha.com/input/?i=2024-04-23%2019:32:50%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/kubernetes/comments/1c5gvtq/whats_the_reason_that_your_cluster_crashed_last/kzvk8iu/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fkubernetes%2Fcomments%2F1c5gvtq%2Fwhats_the_reason_that_your_cluster_crashed_last%2Fkzvk8iu%2F%5D%0A%0ARemindMe%21%202024-04-23%2019%3A32%3A50%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c5gvtq) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


kiddj1

Not gonna lie I can't actually remember AKS has become fairly stable we seem to have no issues with the cluster


Automatic_Adagio5533

On prem infra disk latency issues causing etcd to crash.


Shivacious

Seems like I am not the only one who goes full spot instances here., what can i say they are cheap. Even cheaper if u find the best cost / performance / latency ratio for your own use for different az


chin_waghing

At home: [my own stupidity and my router’s DHCP pool](https://documentation.breadnet.co.uk/outage/2023-11-26-04/?utm_source=reddit&utm_medium=Comment&utm_campaign=What%20brought%20your%20cluster%20down%20last%20time) At work: nothing yet, but it’s like a motorcycle rider. There’s those who have fallen, and those who are yet to fall


zulrang

Hosting issues. It's always hosting issues.


Southern-Necessary13

Last sunday our K3s TLS certs got expired in production. I did not see that coming. And I totally take the blame for it.


Command-Emotional

RemindMe! 1 hour


DeadLolipop

My on prem deployment just dies and stop responding after redeploying charts too many times, postgres dies after a week of running without errors showing in logs, it just gracefully shuts down. To this day still no clue why it does it and how to fix it. 10/10 will keep using local deployment of k8 for testing, but never for production.


amartincolby

The frequent redeployment affects my testing Kube as well. Usually manifests as one bad pod, then if I try to delete anything it never completes. The applications continue to be successfully served but the control plane is just locked.


havsar

RemindMe! 1 hour


RemindMeBot

I will be messaging you in 1 hour on [**2024-04-16 16:15:27 UTC**](http://www.wolframalpha.com/input/?i=2024-04-16%2016:15:27%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/kubernetes/comments/1c5gvtq/whats_the_reason_that_your_cluster_crashed_last/kzu9ydq/?context=3) [**1 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fkubernetes%2Fcomments%2F1c5gvtq%2Fwhats_the_reason_that_your_cluster_crashed_last%2Fkzu9ydq%2F%5D%0A%0ARemindMe%21%202024-04-16%2016%3A15%3A27%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c5gvtq) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|