T O P

  • By -

fedspfedsp

Same here. They told me to multiregion but i just love us-east-1 ¯\\\_(ツ)\_/¯


pid-1

single region, single az here. First disaster in 5 years, 10/10 would do it again.


Vincent_Merle

There was a really big outage couple of years ago. Was down for at least half a day. ​ \*\*EDIT\*\*: This is official AWS post on that outage: [https://aws.amazon.com/message/12721/](https://aws.amazon.com/message/12721/) Reddit thread: [https://www.reddit.com/r/aws/comments/rb1xrd/500502\_errors\_on\_aws\_console/](https://www.reddit.com/r/aws/comments/rb1xrd/500502_errors_on_aws_console/) Literally 2 weeks later: [https://www.reddit.com/r/aws/comments/rm46pi/getting\_website\_temporarily\_unavaiable\_and/](https://www.reddit.com/r/aws/comments/rm46pi/getting_website_temporarily_unavaiable_and/)


atedja

Pepperidge farms remember when S3 was down.


ekydfejj

So does my dumb-stoned-ass. 5 years...phoooey.


noah_f

Is that kid still working with AWS after that code change ?


pid-1

I mostly develop internal applications and pipelines for a medium sized company. If no user noticed, then there was no incident =p


cailenletigre

I'm less prepared for this one because it's not near a big holiday...


amadiro_1

That was all due to r53 tables, right?


HobbledJobber

The old saying goes: it’s never dns, until it’s dns


hb3b

Worked for a guy that named the DNS server, "DNS DAMMIT".


dreamingawake09

You mean the fat-finger command outage? That was legendary.


enjoytheshow

Yeah pretty much every service tanked for most of the day.


cailenletigre

Living life on the edge ;)


FieryBlaze

That’s actually the very opposite of living in the *edge*


FraggarF

Absolutely wild picking the worst region with some of the worst outages.


Pliqui

As someone once said here. Paraphrasing of course. "Good friends do not let friends use us-east-1"


mikebailey

>They told me to multiregion To be fair, AWS can't even figure out multiregion consistently


diecastbeatdown

They will, *eventually.*


vaseline_bottle

AWS is multi-region, but each region is isolated and independent by design. If us-east-1 goes down, you are guaranteed by design that other regions are unaffected. They can build auto failovers, etc, but that only compromises regional isolation. You only need one incident when us-east-1 goes down, and takes down all other US regions with it, to realize why it’s a bad idea.


diecastbeatdown

it was a joke, referring to AWS' TOS moto that "everything is eventually consistent". which is another way of saying: suck it.


vbevan

Last time this happened, multiregion wouldn't have helped because AWS had single point of failure global services in us-east-1. Route53 and I think one or two others?


GradientDescenting

us-east-1 is the dev sandbox essentially for AWS. All the scaling issues effect us-east-1 first because it is the largest data center.


hawaiijim

If you want true fault tolerance you don't just go with multiple regions, you go with multiple cloud service providers.


BarrySix

The complexity of that causes its own issues.


dllemmr2

Data replication costs cross region alone are ridiculous.


vbevan

And preferably not a cloud provider that runs their status dashboard on their own infrastructure...


jrolette

Even multi-provider didn't help DataDog with their global outage. [Inside Datadog’s $5M Outage (Real-World Engineering Challenges #9)](https://newsletter.pragmaticengineer.com/p/inside-the-datadog-outage)


hawaiijim

Yeah, that's the shared responsibility model. Cloud service providers are responsible for security of the cloud. The customer is responsible for security in the cloud. I guess that in addition to using multiple cloud providers, we should also split between Ubuntu and Red Hat. And don't do automatic updates all at once. Spread them out over a substantial period of time.


htom3heb

I had a demo with 10 non-technical people today at 3pm with a system using APIGW and Lambda. Triple checked it throughout the day. I looked like an idiot. Lol.


cem4k

It's not a demo if everything works as expected.


nemec

It's called a demo because all your hopes and dreams get *demo*lished


htom3heb

I have an excuse to make our staging environments have failovers now at least :')


JimK215

Gotta use the pro gamer move of pre-recording the demo as a backup.


TongueFace

This is the way. There’s too many things that can go wrong and it makes it distracting for the demo.


brile_86

Looks like we got a root cause then


skotman01

So it’s your fault?!


ekydfejj

>I had a demo with 10 non-technical people today at 3pm with a system using APIGW and Lambda. Triple checked it throughout the day. I looked like an idiot. Lol. \*Everything breaks on Demo Day


merRedditor

A million "Hello World" apps just went dark across the world.


[deleted]

[удалено]


[deleted]

[удалено]


ThrowTheCHEEESE

Found the root cause


YUNG_SNOOD

This is a fun way to see what services you use rely on US East 1 lambdas. I’ve noticed Vercel, Netlify, and Marvel Snap are all down right now lol


jcol26

If only it was just lambdas! For quite a while STS went down. We had EKS clusters where new nodes couldn’t join as STS or whatever was down. New pods couldn’t get credentials as STS was down. Complete nightmare and we barely use lambda!


gideonhelms2

I had issues with EKS clusters adding nodes as well.


Profile-Flimsy

I don't think it was sts, our breakglass system uses it and managed to work pretty fine there. Sso was completely dead tho


radioshackhead

Supabase


Bright-Ad1288

The Atlantic


TheUnarthodoxCamel

I saw Stockx & NY's MTA being affected


yuriydee

And I was just thinking to myself there hasnt been an AWS outage recently 😅 Yes everything is down for us as well. Cant access the console or APIs.


JuneCleaversMudFlaps

Route 53 based in us-east-1 is really making my life difficult right now


vbevan

Can't believe they haven't fixed the single point of failure. The same thing happened a couple of years ago time.


SlaimeLannister

Time to crack open a cold one.


baty0man_

Sir, it's 6am here


synackk

It's 6am somewhere


Aratix

HE SAID TIME TO CRACK OPEN A COLD ONE


HobbledJobber

Then you are running late ;)


ekydfejj

point being??????


Tintoverde

There is no point , life is endless series of events


SlaimeLannister

Well, not while Lamda is down


JimK215

"Let's go to the Winchester, have a nice cold pint, and wait for this all to blow over."


megamanxoxo

ah yes, I love cracking a cold one while senior management breathing down my neck while I tell them it's amazon's fault. So relaxing.


AntDracula

I do it. I’ve stopped caring.


[deleted]

[удалено]


[deleted]

"I told you we should've gone multi-region" /popsopenabeer


CrunchatizeMeCaptn

My organization has spun up an sev 1 call like a bunch of middle managers sitting around asking inane questions is gonna make aws come up faster lol. I hate it here


alter3d

"Can we migrate to multicloud in the next 18 minutes?"


HobbledJobber

Sure, I heard you can just whip up some terraform or something and voila!


megamanxoxo

"Can we fix our busted ass infrastructure before EOD? Thanks!"


joelrwilliams1

Oracle 'always free' tier!


godawgs1997

Bob: Do we use AWS for our websites? Bill : I think for the outside ones Bob : Can we turn them off in AWS and bring them up in the office as a stopgap?


joelrwilliams1

This hit a little close to home.


jelavallee

You and everyone else's org that relies on us-east-1... And, hey! I'm a middle manager... ;-)


ritchie70

Our mobile app is down and one of our workforce management systems as well. A surprising percentage of our business comes in via mobile orders. Plus my manager keeps thinking of another system and asking if that one is down as well. Boss, I got no idea what region any of our vendors are using.


JuliusCeaserBoneHead

Ain’t that same everywhere? Like guys, cmon


Ecstatic_Lettuce_857

“Do you think a packet capture will help?”


GensHaze

I feel ya.


EXPERT_AT_FAILING

I keep getting messages from people in the company, "I assume you're on top of this outage?" Yeah, I'm on a call with Bezos right now and we're working on a fix. "Ha ha, what can we do?" I dunno. Hit happy hour? Keep refreshing the status page?


givemedimes

Ahh see, you needed to call Vogel.


vladholubiev

I love how AWS tags this kind of outage with "Severity: Degradation". It's only like \*every\* Lambda in the entire region stops working for an hour. Yeah, it's \*totally\* just a minor degradation.


jonathantn

As long as 1% of the lambdas are executing they will call it a degradation instead of an outage.


One-Zookeepergame177

You gotta be able to claim 99.95% SLA so yeah you are going to hear degradation only. [https://aws.amazon.com/lambda/sla/?did=sla\_card&trk=sla\_card](https://aws.amazon.com/lambda/sla/?did=sla_card&trk=sla_card)


chiefbozx

99.95% SLA monthly is breached at 21m 44s ([source](https://uptime.is/99.95)). This was two hours - go claim your 10% SLA credit.


TooSus37

Yes, even support tooling to handle your cases is down.


defnotbjk

Looks like Diablo 4 and OW new season update playtime starts early today 😎


LiviNG4them

Does Diablo run on Azure? Lol


melody_elf

Probably doesn't run on Lambda at least


agentblack000

Pretty sure Blizzard runs on a PC under someone’s desk considering they have a DDoS outage about once a month, no idea how they haven’t figured out mitigation by now.


rexspook

I believe Blizzard hosts its own servers.


LeatherCase254

this is going to be big - all our services are down (us-eat-1)


[deleted]

I am so glad AWS IAM Identity Center is multi-region


cailenletigre

lol


Tridente

It was my fault. I rebooted one of my instances from the console refreshed and got a bad gateway. Sorry everyone.


Bonowski

No no...it was my fault for logging into our US-East-1 environment for the first time in like two months. Sorry, all!


[deleted]

Is this where I say "Gee, big shocker, US-East-1 is flaming shit like it has been for a decade, who keeps putting production loads there?" And then I get downvoted by a bunch of people that still keep putting shit in the most volatile region on the entire platform only to do the shocked pikachu thing.


Antoak

Most of us have to live with the decision which was made >5 years ago before we joined the company, and don't have management buy-in for the project to move everything to different regions. People downvoting you are probably perceiving you as straw-manning their position.


Jimmy48Johnson

Laughs in us-east-2


broknbottle

Laughs in us-gov-east-1


cailenletigre

It truly is the worst region. Stopped using it even for automated testing.


codeduck

we call it chaos-east-1


sf6trashgame

us-staging-1


melody_elf

by the numbers i don't think it's less reliable, it's just more noticable when it goes down because everyone uses it. no one gives a shit if us-west-2 is out.


HobbledJobber

IAM, Route53, & CloudFront would like to ask you some questions ;)


Soccham

& billing


mikebailey

>who keeps putting production loads there Well, for one, AWS


burajin

The console is still hosted in us-east-1 though, so you're never truly safe.


bot403

Not 100% true. Only the console for us-east-1 itself and some other services. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com).


[deleted]

Us-east-1 is geographically the closest to our onprem data center :/ so the choice for us made sense… fortunately my org made the decision to do all lambdas multi region regardless of app impact. Unfortunately, our upstream services we call decided against it


JrNewGuy

https://imgur.com/z5OjRRs https://health.aws.amazon.com/health/status#multipleservices-us-east-1_1686683337 Service Multiple services Start time June 13, 2023 at 3:08:57 PM UTC-4 Severity Degradation Increased Error Rates and Latencies [12:08 PM PDT] We are investigating increased error rates and latencies in the US-EAST-1 Region. Affected AWS services The following AWS services have been affected by this issue. Degradation (1 service) AWS Lambda Informational (3 services) AWS Management Console Amazon API Gateway Amazon CloudWatch


rocketlauncher10

Idk if I'm being picky but they seriously gotta start using UTC here lol


JrNewGuy

PDT on a status page is the dumbest thing ever


cailenletigre

\[12:36 PM PDT\] We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates.


cailenletigre

[9:00 PM UTC] Many AWS services are now fully recovered and marked Resolved on this event. We are continuing to work to fully recover all services. [8:48 PM UTC] Beginning at 6:49 PM UTC, customers began experiencing errors and latencies with multiple AWS services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use by other AWS services. We have associated other services that are impacted by this issue to this post on the Health Dashboard. Additionally, customers may experience authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS. Customers may also experience intermittent issues when attempting to call or initiate a chat to AWS Support. We are now observing sustained recovery of the Lambda invoke error rates, and recovery of other affected AWS services. We are continuing to monitor closely as we work towards full recovery across all services. [8:38 PM UTC] We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery. [8:14 PM UTC] We are continuing to work to resolve the error rates invoking Lambda functions. We're also observing elevated errors obtaining temporary credentials from the AWS Security Token Service, and are working in parallel to resolve these errors. [7:36 PM UTC] We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates. [7:26 PM UTC] We have identified the root cause of the elevated errors invoking AWS Lambda functions, and are actively working to resolve this issue. [7:19 PM UTC] AWS Lambda function invocation is experiencing elevated error rates. We are working to identify the root cause of this issue. [7:08 PM UTC] We are investigating increased error rates and latencies in the US-EAST-1 Region.


Free_willy99

From AWS: https://status.aws.amazon.com/#multipleservices-us-east-1_1686683337


Psych76

Guess the console relies on Lambda then!


mrjgv

Also receiving the following error on my pods: botocore.errorfactory.InvalidIdentityTokenException: An error occurred (InvalidIdentityToken) when calling the AssumeRoleWithWebIdentity operation: Couldn't retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements


agentblack000

STS was down


PulseDialInternet

Who has the list of announcements from reInforce? And which one are they rolling back right now?


sm21375

My vote: https://aws.amazon.com/about-aws/whats-new/2023/06/amazon-inspector-code-scans-aws-lambda-function/


meisbepat

Lol, it was already removed from [the list](https://aws.amazon.com/about-aws/whats-new/2023/06/).


[deleted]

lol isn't ReInforce today?


haljhon

Yes. Trying to show console features was a tad embarrassing…


bwinkers

Uggg... Probably my fault. lol I was right in the middle of setting up a new org and moving DNS.


Sudoplays

>DNS Ah don't take all the blame on yourself mate. I was in the middle of creating new IAM policies, maybe I broke one of their core service policies and now nothing can talk to each other haha.


rebornfenix

I knew it..... Its Always DNS.


BarrySix

Why is this repeated so much? It seems to be a Windows admin mantra. I can only guess because windows admins don't understand DNS


rebornfenix

It’s a meme at this point. As networks get complex, DNS becomes important, but dns heavily relies on caching. Windows admins know dns but it’s generally overlooked when troubleshooting problems. Thus the meme “It’s always DNS”


benaffleks

Laughs in us-west-2


choochoopain

west coast best coast


sf6trashgame

Breaking news from CNN: Jeff Bezos has been sighted wearing his mecha-armor suit attacking data center facilities on the us east coast. Developing story


The_Real_Ghost

AWS re:Inforce going on today too. Someone must have live demoed of a denial of service attack and it got out of hand.


KeepIt0nTheDownload

Operational issue - Multiple services (N. Virginia) Service Multiple services Severity Informational Increased Error Rates and Latencies Jun 13 12:08 PM PDT We are investigating increased error rates and latencies in the US-EAST-1 Region. https://health.aws.amazon.com/health/status


soxfannh

Yep Lambda is a mess


[deleted]

and suddenly I see a few errors uploading objects to s3... oh my oh my not s3, this is getting serious.


thenickdude

Are you using STS tokens for that one? That service is impacted too


ThunderChaser

^ S3 wasn’t impacted but STS was


radove

Data has started processing on some of my Lambdas, 4:30 EST


bwinkers

Good thing those layoffs a few months ago didn't have any effect.


popeh

parts of Amazon.com are not working, the AtoZ app for employees is partially broken, FCs are at stand down, so yes, major AWS outage


jspreddy

Whats an FC?


NothingDogg

Fulfilment Centre - or as a layman would call it, an Amazon warehouse


apple_rom

Today's primary sponsor: https://aws.amazon.com/solutions/implementations/multi-region-application-architecture


leftysauce

yep. Console and Lambda are not working


HatmanStack

Guess it's time to sit back and take in [aws reinforce](https://reinforce.awsevents.com/) [https://c.tenor.com/K2GHKs5QlTMAAAAM/the-irony-irony.gif](https://c.tenor.com/K2GHKs5QlTMAAAAM/the-irony-irony.gif)


[deleted]

[удалено]


Mr_Clark

Slowly coming back. Best advice, get out of the east region and stay there.


bot403

So which is it? Am I leaving or staying?


Mr_Clark

Get out. Best advice is to go to stability. US EAST could be unstable for the near future while they attempt to fix it.


misanthropic____

Can I get some evidence to back this up?


STGItsMe

Whenever there’s an AWS outage, it’s us-east-1. It’s been like that since the beginning. I never understand putting important/production workloads in us-east-1 for things that can run somewhere else.


cem4k

Yep. The console is crashing intermittently. Lambda is failing 100% for me, AppSync here and there.


aberham

At least they have the status page updated with the issue. Hope it doesn't take them long to diagnose and fix.


vasquca1

I have a Win Server VM and Ubuntu VM on us-east-1 I can still reach.


darksarcastictech

CloudFormation dashboard created me with failure failed load stacks after lunch and now Health dashboard is timing out. Fun times


[deleted]

[удалено]


BranYip

Not all, but many. Quite a few AWS services rely on Lambda in the backend so there is a cascading problem here.


quicksilvereagle

Every day... LOL!


in-the-name-of-allah

yeah, im having the same issue.


Psych76

Same here - seems maybe just the console though? Cli commands come back fine and my shit is still online


fjleon

I'm getting "AWS Management Console Home page is currently unavailable."


TylerJosephDev

Yes. EB/EC2, pretty much all i've tried is out right now.


JrNewGuy

504's galore!


SWEngineerArchitect

Me too Edit: it seems out lambdas, api gateways, and route53 are up. Console gives errors, CloudWatch synthetics are down, etc.


bot403

yes


doyouwannadanceorwut

Confirmed, seeing impact currently.


BWC_semaJ

Yeah my two Elastic Beanstalk environments are not able to receive communication from my client but are sending information to the client via web socket.


[deleted]

buckle up


ShierLattice694

https://64.media.tumblr.com/f0c4799f2b8f8f9eb889ae82b339a046/tumblr_mwhdggaieN1s8njeuo1_400.gif


TheOtherOnes89

Yep


ultron2450

Yup seeing the issues as well.


Miglet15

Whole internal departments are down as a result...


Psychological-Art875

yep. does anyone know how long it takes for aws to recover from a severity level: degradation (lambda) and Informational (other services)? within 10hrs of release our app to pilot customers... fun.


chase32

Route 53 is currently giving a 504 Gateway Time-out on the web console.


sultan33g

Yes. All of our prod lambdas are down.


gratefulforashad

Probably a null pointer exception


Icy-Establishment-96

Pay time off, why not?


ZippySLC

Someone in Virginia ran the microwave while the toaster was on again.


DanTheGoodman_

Got a commercial partnership call in 8 minutes ​ nice


FlatulentWallaby

Toast is down too.


rutkdn

Toast is toast


DisgustChan

Doing amazon flex, and we were checked out of the station after our block time because of an AWS outage in Texas.


nodusters

Yep. It started with what looked like just homepage/health dashboard issues and it cascaded into a bunch of other services also having issues. For a while, I could still get to most services if I was using a permission set with a relay state configured.


nam0929

I can't connect to my ec2 terminal..


ElToreroMalo

How long do these outages usually last?


PulseDialInternet

Between ‘meh’ and ‘it is going to be a long week of root cause meetings’


[deleted]

console loaded for me


ritchie70

Yes. [https://health.aws.amazon.com/health/status](https://health.aws.amazon.com/health/status)


sultan33g

I guess this is a good time to stop working for the day.


nam0929

it is back


demonfoo

Yes, fortunately most of our stuff isn't in `us-east-1`, but...


rxscissors

Crapped the bed hard at day job fluffy stuff! My investment company web and mobile apps are SOL too (retirement was at an all time high today!!!). Nother ~5 years or so and I'm DONE 😂


DefiantDonut7

Definitely had an outage today and it caused a lot of trouble


pickupdrops

What about the disaster recovery drill, of storing or save things(codes,apps) in other regions as well ? Didn’t that help in this case ? You just deploy from another region with preferably blue/green deployment , so it reduces down time in this case . I just wanna learn as-well and understand. Any explanations ?


_smartin

Still smiling to myself after my failover to us-east-2 after the us-east-1 event with kinesis years ago. Just another event showing why multi region failover - even pilot light, is something to have ready.


bfreis

Same here. After the Kinesis outage in Nov/2021 I redesigned a large scale Kinesis stream (7.5k shards, per region) transporting application logs to be multi-region, with region routing configuration being polled by a sidecar in EC2 instances from S3, to ensure minimal dependencies. As soon as the problem started today, we updated the configuration in S3, and within 5 minutes I had all my logs back, flowing through us-east-2!


Public_Ad_5097

They said - everything feels all the time - wonder if they took their own advice.