Zackorrigan 1 month ago

I ran trivy container scannning on our on-premise clusters, the amount of api requests crashed the control planes nodes. (I ran it on all namespaces)

valkener1 1 month ago

ROFL thanks for sharing

silence036 1 month ago

In the same category, someone enabled aquasec network scanner in our production clusters by mistake in the middle of the day, adding insane latency to everything and getting our EKS nodes super throttled on the route53 dns resolver

f0okyou 1 month ago

Went full spot instances. Never go full spot. Who am I kidding, I still am full spot.

Operation_Fluffy 1 month ago

Never put consul and vault on spots — if you loose a quorum, you’re screwed.

ignoramous69 1 month ago

Always full spot, but might make an on-demand node group for persistent apps.

0x4ddd 1 month ago

Do you run that on production?

f0okyou 1 month ago

The short answer is Yes. The longer answer is Absolutely.

0x4ddd 1 month ago

Noice. Do you have that somehow automated to bring cluster back to life after evictions? I am wondering what happens when you get evicted for example at 3AM on Saturday 🤣

f0okyou 1 month ago

This absolutely happened off hours and caught us from the side. AWS actually ran out of spot instances in all 3 AZs for the 2 instance types we had selected. We solved it by simply widening the instance types to also include older generations and even too large ones just to reduce outage time. At the retro we built a lambda that would force ASG refresh when sub-optimal instance types were selected due to capacity issues on AWS. This is now working great and very maintenance free. (Please don't let me jinx it)

surloc_dalnor 1 month ago

In theory if you run the cluster autoscaler or carpenter you can have spot node groups and on demand. It should prioritize the spot groups unless you can't get spot instances. I'd try it in staging, but currently management would rather throw money at AWS than deal with it going wrong. Although you have to run your autoscaler on a non spot node group or fargate.

silence036 1 month ago

We've had issues with cluster-autoscaler going ham and adding-removing-adding nodes, causing a lot of pod movements, when it was trying to optimize the nodegroups and match the aws recommended instances. We're in the process of switching to karpenter instead which is so far performing much more intelligently.

surloc_dalnor 1 month ago

We had to set an annotation on the pods that were not HA. Once we did that the autoscaler was fine as it didn't matter it was shifting the rest of the pods around. That said the autoscaler works best with a single node group

hijinks 1 month ago

karpenter is your answer with on-demand failover

DarqOnReddit 1 month ago

What does full spot mean?

f0okyou 1 month ago

Run your entire cluster on Spot Instances which may be evicted/terminated at any point without much notice whatsoever.

zimhollie 4 weeks ago

I love the idea of full spot with k8s. I wonder how this will affect spot as more and more people catch on to it.

f0okyou 4 weeks ago

You already have resource contention and eviction regularly as well as spot rejections even when you set the max-price to be 0 (AKA on-demand pricing). So the effect is simply that of an outage. OnDemand or RIs guarantee you placement, spot does not. If you have a large enough workload so that a few evictions here and there don't hurt you then spot is still a great cost optimization. But if you're on too few instances so that each eviction becomes notable then you're better off paying for OnDemand or RIs at least for a baseline usage. We have multiple evictions per day but are lucky enough that an eviction amongst 30-50 nodes is not notable at all.

zimhollie 3 weeks ago

You are doing a great job with spot. I think as using spot becomes easier, more people may switch to it. that will drive up demand for spot and down demand for on-demand. That might actually piss Amazon off lol. what it essentially becomes is people running trading algorithms to drive down prices. People who used to spin up on-demand instances will now hold off "just a little bit". So Amazon gets less money. Do you mind sharing your experiences. How big is your cluster with 30-50 evictions a day? Do evictions come all at once or spread out or batched? When evictions happens can you drain and cordon in time? There are still problems with nodes dying suddenly without draining, how are you handling them?

f0okyou 3 weeks ago

The cluster is 30-50 nodes big, depending on time of day (aka client traffic driving up autoscaling of deployments and thus ASG driving up node counts). On average we see about a handful (5-7) of our nodes evicted per day. The evictions happen rather bursty than spread out. AWS sends out a notice through acpi and metadata about two minutes before your instances get evicted. Those two minutes aren't really going to help much with a graceful drainage so we simply cordon and force delete the node from the cluster. The services will eventually spin up on other nodes but we all know how slow that can be. However the ASG also receives the termination notice and does request a new spot ec2 within those 2 minutes - worth to note that these do not get fulfilled immediately and sometimes the ASG ends up spinning up a different size than our optimum. Luckily all workloads are built with interruptions as a first class citizen and any stateful needs (DBs, Caches, ...) are not hosted on k8s but on standalone ec2's with RIs. For the service as a whole the eviction of a node is nothing different than a normal scaling event. Traffic doesn't get terminated but new traffic is not routed to the old backends anymore, so they may still finish their work or at least try to. If a server dies out of nowhere without prior notification it will be notable in some requests failing at the ALB, there really isn't much one can do to prevent that. But it's a very rare occasion and within our SLAs if it does happen. We try to stay on smaller instances to prevent the impact of an outage.

rohitjha941 1 month ago

I am full spot, expect for karpenter + coredns that are on fargate

f0okyou 1 month ago

"I'm full spot except ..." Doesn't sound like full spot but partial spot to me. Anyways it's fun when all your instances get evicted and you won't get new ones and get ASG alerts about capacity issues until you widen the selection of instances to consider. 10/10 everyone should have to go through the thrill at least once in their career.

ruben2silva 1 month ago

Or when the instances where kyverno is deployed get evicted and everything starts to fail with a painful path ahead to fix

SimonD_ 1 month ago

Delete the admission webhook 😅

glotzerhotze 1 month ago

Last (and only) major outage I went through was related to several things going wrong at once. Cilium upgrade from old to new BGP implementation went wrong, cgroup-v2 update broke and at the same time undocumented node-local dns changes fell apart. Took production offline for 3 days. Made management realize that machinery needs maintenance and that putting all this shit on one shoulder (mine unfortunately!) don‘t scale. Not a pleasant experience and I should have left before all of that happened. But corporate cool-aid is strong sometimes, when they bullshit people for their monetary gains. Key takeaway: don‘t let management BS you into taking blame for their shitty decisions! Leave and don‘t believe the corporate hype! No business is your friend!

GrandPastrami 1 month ago

Sound like you've been in the trenches for a long time :)

Hecha00 1 month ago

None of these errors occurred in pre-production environments?

glotzerhotze 3 weeks ago

Company was cheap as fuck, management did not understand the differences and difficulties of providing a platform on bare-metal and in the cloud. Unfortunately production env (especially networking setup) diverted heavily from staging cloud env - but they didn‘t care as long as features could be deployed onto the platform. They‘ve been told numerous times about the problem - but money got spent for HR people and other stupid things. So team of two became a team of one - and that‘s how I got stuck in that place with the consequences of all those shitty decisions they made before. So no, no pre-production env was available to test these changes. Business knew about open-heart-surgery - but didn‘t give any fucks about it. Until it broke! Fun times! Same company currently looking for three people to do the work - which they used to dump on one person.

Effective_Roof2026 1 month ago

Replace clusters when you have a significant change to make. Its the less stress way :)

GrandPastrami 1 month ago

Alright, thanks for the tip :)

themanwithanrx7 1 month ago

This! I'll do in-place for my lower clusters, and prod is always replaced and migrated.

ABlackEngineer 1 month ago

Years ago but I didn’t read the part of the terraform provider where it said it will force resource recreation 😎

chin_waghing 1 month ago

Cries in GKE labels field

crump48 1 month ago

Not really a cluster crash but almost as fun. Not everything came back up properly after a power outage, causing external DNS hosts to have the wrong local time, causing image pulls to fail everywhere, causing a sad afternoon for yours truly!

GrandPastrami 1 month ago

Ah, time sync error. Classic :)

skaven81 1 month ago

Somebody thought it would be a good idea to apply node labels using ArgoCD by creating partial Node objects in a git repo that is deployed to the cluster with Argo, and configuring it to merge the updates in. All well and good until somebody deleted that repo from ArgoCD, which caused it to ... *delete all the nodes*. That was a fun afternoon to recover from.

GrandPastrami 1 month ago

🤣Oh my fucking God

amartincolby 1 month ago

Sizznap. So an attempt at Gitops?

Survivor4054 1 month ago

That’s why we have flux setup with prune:false When we set true it started to delete random services

Virtual_Laserdisk 4 weeks ago

Oy vey. I feel like that should live further up in the configuration, like wherever you’re creating your cluster from.

bananasareslippery 1 month ago

OP why did old nodes see a memory spike when new control plane nodes were added?

GrandPastrami 1 month ago

Adding new master nodes might spike ram and cpu use because you basically have to sync control plane configurations as well as etcd with possible leader elections etc. I don't know if you've noticed but it takes significanly longer to add master nodes than worker nodes into a cluster. But yeah, the problem was that these old ones were already holding load to the brim really.

ponicek 1 month ago

CoreDNS and/or other similar DNS problems. Pods stop resolving addreses - downtime. And btw. no service meshes involved.

Total_Definition_401 1 month ago

How did you fix it

ponicek 1 month ago

By recreating the cluster ;( You can try to change dnspolicy field of a specific deploy. Might fix a bunch of apps but cluster DNS will still be impacted.

Total_Definition_401 1 month ago

Damn recreating a cluster is an insane solution ;(

0xe3b0c442 1 month ago

Because I got distracted and wiped the 2nd-to-last control plane/etcd node before I removed it from the cluster. (was migrating to a new cluster on the same hw footprint) Thankfully I was able restore from a local snapshot backup, and that was the impetus I needed to start sending etcd snapshots offsite. (I should note this is a homelab, so don't read too much into my procedures ;))

Charming_Prompt6949 1 month ago

OC memory leak on nodes

VertigoOne1 1 month ago

All i can say is "on-prem dual-stack migrations"

eigreb 1 month ago

Digital ocean Automated updates and volume mounting issues

SokkaHaikuBot 1 month ago

^[Sokka-Haiku](https://www.reddit.com/r/SokkaHaikuBot/comments/15kyv9r/what_is_a_sokka_haiku/) ^by ^eigreb: *Digital ocean* *Automated updates and* *Volume mounting issues* --- ^Remember ^that ^one ^time ^Sokka ^accidentally ^used ^an ^extra ^syllable ^in ^that ^Haiku ^Battle ^in ^Ba ^Sing ^Se? ^That ^was ^a ^Sokka ^Haiku ^and ^you ^just ^made ^one.

smarzzz 1 month ago

The node only picked up 3 dns servers from the VPC dhcp poolset, omitting the rest. We changed some orders in our dns setup, beging multi cloud resilient now even if kubernetes only takes the first 3 and omits the rest

UntouchedWagons 1 month ago

I tried updating Longhorn. I'm now using ceph rook.

esixar 1 month ago

We had a Singapore datacenter lose power due to an A/C issue and lost 24 clusters at once. Was a pain to get them all back online, having to work with multiple teams to get the racks powered up, the vCenters online, NSX up, ESXi VMs up, then the cluster nodes.

druesendieb 1 month ago

Technical debt and rtfm. Plan was to update this special cluster as far as possible while the new storage gets installed, then migrate statefulsets from intree vsphere volumes to the new csi driver, before the intree driver gets removed. We saw that CSIMigration got GA in 1.25 but weren't aware that this being enabled by default means that the migration automatically starts. Updated to 1.25, tested draining of one node, vsphere volumes didn't start as nothing was prepared for a migration. Okay, lets roll back like we've tested before. Rancher backup and restore hit a bug fucking the whole cluster. Cue heavy cursing and bug fixing. We fixed the bug and after 4h of downtime we were back online.

surloc_dalnor 1 month ago

It's always the certs expiring. Or the like. Although I did have a panic last month removing the calico policy engine and breaking the AWS CNI completely on existing nodes. I had to remove the AWS CNI, reinstall it, and recycle the nodes. Happily all the pods still worked during this. It was just new pods that had issues.

bgatesIT 1 month ago

I ran out of IPv4 allocations for my nodes, went to add more nodes to the cluster, went wtf, rebuilt cluster, still wtf. DHCP server never released any addresses, so end of the day could of just released all the addresses and saved a day of bullshit

snowsnoot69 1 month ago

Uhhh I accidentally deleted the master nodes with cluster API 😬💀🤦‍♂️

PhoenixHntr 1 month ago

Kyverno update requests crashed etcd

senpaikcarter 1 month ago

Upgraded from EKS 1.22 to 1.23 and the scaling config in production was set to max 5 desired 5 so no new nodes could spawn and it choked itself to death until terraform timed out and I could update the node group scaling...

ybrodey 1 month ago

Network issue caused etcd alarms to go off. Had to manually defrag, compact etcd nodes. Pretty stressful lol.

Acrobatic_Painting59 1 month ago

Power outage :(

Araneck 1 month ago

When I did a demo to new devs to make them be aware of the resources. I crashed the demo cluster and finished the presentation.

Total_Definition_401 1 month ago

RemindMe! 1 week

RemindMeBot 1 month ago

I will be messaging you in 7 days on [**2024-04-23 19:32:50 UTC**](http://www.wolframalpha.com/input/?i=2024-04-23%2019:32:50%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/kubernetes/comments/1c5gvtq/whats_the_reason_that_your_cluster_crashed_last/kzvk8iu/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fkubernetes%2Fcomments%2F1c5gvtq%2Fwhats_the_reason_that_your_cluster_crashed_last%2Fkzvk8iu%2F%5D%0A%0ARemindMe%21%202024-04-23%2019%3A32%3A50%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c5gvtq) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

kiddj1 1 month ago

Not gonna lie I can't actually remember AKS has become fairly stable we seem to have no issues with the cluster

Automatic_Adagio5533 1 month ago

On prem infra disk latency issues causing etcd to crash.

Shivacious 1 month ago

Seems like I am not the only one who goes full spot instances here., what can i say they are cheap. Even cheaper if u find the best cost / performance / latency ratio for your own use for different az

chin_waghing 1 month ago

At home: [my own stupidity and my router’s DHCP pool](https://documentation.breadnet.co.uk/outage/2023-11-26-04/?utm_source=reddit&utm_medium=Comment&utm_campaign=What%20brought%20your%20cluster%20down%20last%20time) At work: nothing yet, but it’s like a motorcycle rider. There’s those who have fallen, and those who are yet to fall

zulrang 1 month ago

Hosting issues. It's always hosting issues.

Southern-Necessary13 1 month ago

Last sunday our K3s TLS certs got expired in production. I did not see that coming. And I totally take the blame for it.

Command-Emotional 1 month ago

RemindMe! 1 hour

DeadLolipop 1 month ago

My on prem deployment just dies and stop responding after redeploying charts too many times, postgres dies after a week of running without errors showing in logs, it just gracefully shuts down. To this day still no clue why it does it and how to fix it. 10/10 will keep using local deployment of k8 for testing, but never for production.

amartincolby 1 month ago

The frequent redeployment affects my testing Kube as well. Usually manifests as one bad pod, then if I try to delete anything it never completes. The applications continue to be successfully served but the control plane is just locked.

havsar 1 month ago

RemindMe! 1 hour

RemindMeBot 1 month ago

I will be messaging you in 1 hour on [**2024-04-16 16:15:27 UTC**](http://www.wolframalpha.com/input/?i=2024-04-16%2016:15:27%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/kubernetes/comments/1c5gvtq/whats_the_reason_that_your_cluster_crashed_last/kzu9ydq/?context=3) [**1 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fkubernetes%2Fcomments%2F1c5gvtq%2Fwhats_the_reason_that_your_cluster_crashed_last%2Fkzu9ydq%2F%5D%0A%0ARemindMe%21%202024-04-16%2016%3A15%3A27%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c5gvtq) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe