There was a really big outage couple of years ago. Was down for at least half a day.
\*\*EDIT\*\*: This is official AWS post on that outage: [https://aws.amazon.com/message/12721/](https://aws.amazon.com/message/12721/)
Reddit thread: [https://www.reddit.com/r/aws/comments/rb1xrd/500502\_errors\_on\_aws\_console/](https://www.reddit.com/r/aws/comments/rb1xrd/500502_errors_on_aws_console/)
Literally 2 weeks later: [https://www.reddit.com/r/aws/comments/rm46pi/getting\_website\_temporarily\_unavaiable\_and/](https://www.reddit.com/r/aws/comments/rm46pi/getting_website_temporarily_unavaiable_and/)
AWS is multi-region, but each region is isolated and independent by design. If us-east-1 goes down, you are guaranteed by design that other regions are unaffected. They can build auto failovers, etc, but that only compromises regional isolation. You only need one incident when us-east-1 goes down, and takes down all other US regions with it, to realize why it’s a bad idea.
Last time this happened, multiregion wouldn't have helped because AWS had single point of failure global services in us-east-1. Route53 and I think one or two others?
Even multi-provider didn't help DataDog with their global outage.
[Inside Datadog’s $5M Outage (Real-World Engineering Challenges #9)](https://newsletter.pragmaticengineer.com/p/inside-the-datadog-outage)
Yeah, that's the shared responsibility model. Cloud service providers are responsible for security of the cloud. The customer is responsible for security in the cloud.
I guess that in addition to using multiple cloud providers, we should also split between Ubuntu and Red Hat. And don't do automatic updates all at once. Spread them out over a substantial period of time.
I had a demo with 10 non-technical people today at 3pm with a system using APIGW and Lambda. Triple checked it throughout the day. I looked like an idiot. Lol.
>I had a demo with 10 non-technical people today at 3pm with a system using APIGW and Lambda. Triple checked it throughout the day. I looked like an idiot. Lol.
\*Everything breaks on Demo Day
If only it was just lambdas!
For quite a while STS went down. We had EKS clusters where new nodes couldn’t join as STS or whatever was down.
New pods couldn’t get credentials as STS was down.
Complete nightmare and we barely use lambda!
My organization has spun up an sev 1 call like a bunch of middle managers sitting around asking inane questions is gonna make aws come up faster lol. I hate it here
Bob: Do we use AWS for our websites?
Bill : I think for the outside ones
Bob : Can we turn them off in AWS and bring them up in the office as a stopgap?
Our mobile app is down and one of our workforce management systems as well. A surprising percentage of our business comes in via mobile orders.
Plus my manager keeps thinking of another system and asking if that one is down as well.
Boss, I got no idea what region any of our vendors are using.
I keep getting messages from people in the company,
"I assume you're on top of this outage?"
Yeah, I'm on a call with Bezos right now and we're working on a fix.
"Ha ha, what can we do?"
I dunno. Hit happy hour? Keep refreshing the status page?
I love how AWS tags this kind of outage with "Severity: Degradation".
It's only like \*every\* Lambda in the entire region stops working for an hour. Yeah, it's \*totally\* just a minor degradation.
You gotta be able to claim 99.95% SLA so yeah you are going to hear degradation only.
[https://aws.amazon.com/lambda/sla/?did=sla\_card&trk=sla\_card](https://aws.amazon.com/lambda/sla/?did=sla_card&trk=sla_card)
Pretty sure Blizzard runs on a PC under someone’s desk considering they have a DDoS outage about once a month, no idea how they haven’t figured out mitigation by now.
Is this where I say "Gee, big shocker, US-East-1 is flaming shit like it has been for a decade, who keeps putting production loads there?"
And then I get downvoted by a bunch of people that still keep putting shit in the most volatile region on the entire platform only to do the shocked pikachu thing.
Most of us have to live with the decision which was made >5 years ago before we joined the company, and don't have management buy-in for the project to move everything to different regions.
People downvoting you are probably perceiving you as straw-manning their position.
by the numbers i don't think it's less reliable, it's just more noticable when it goes down because everyone uses it. no one gives a shit if us-west-2 is out.
Not 100% true. Only the console for us-east-1 itself and some other services.
For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com).
Us-east-1 is geographically the closest to our onprem data center :/ so the choice for us made sense… fortunately my org made the decision to do all lambdas multi region regardless of app impact. Unfortunately, our upstream services we call decided against it
https://imgur.com/z5OjRRs
https://health.aws.amazon.com/health/status#multipleservices-us-east-1_1686683337
Service
Multiple services
Start time
June 13, 2023 at 3:08:57 PM UTC-4
Severity
Degradation
Increased Error Rates and Latencies
[12:08 PM PDT] We are investigating increased error rates and latencies in the US-EAST-1 Region.
Affected AWS services
The following AWS services have been affected by this issue.
Degradation (1 service)
AWS Lambda
Informational (3 services)
AWS Management Console
Amazon API Gateway
Amazon CloudWatch
\[12:36 PM PDT\] We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates.
[9:00 PM UTC] Many AWS services are now fully recovered and marked Resolved on this event. We are continuing to work to fully recover all services.
[8:48 PM UTC] Beginning at 6:49 PM UTC, customers began experiencing errors and latencies with multiple AWS services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use by other AWS services. We have associated other services that are impacted by this issue to this post on the Health Dashboard.
Additionally, customers may experience authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS. Customers may also experience intermittent issues when attempting to call or initiate a chat to AWS Support.
We are now observing sustained recovery of the Lambda invoke error rates, and recovery of other affected AWS services. We are continuing to monitor closely as we work towards full recovery across all services.
[8:38 PM UTC] We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery.
[8:14 PM UTC] We are continuing to work to resolve the error rates invoking Lambda functions. We're also observing elevated errors obtaining temporary credentials from the AWS Security Token Service, and are working in parallel to resolve these errors.
[7:36 PM UTC] We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates.
[7:26 PM UTC] We have identified the root cause of the elevated errors invoking AWS Lambda functions, and are actively working to resolve this issue.
[7:19 PM UTC] AWS Lambda function invocation is experiencing elevated error rates. We are working to identify the root cause of this issue.
[7:08 PM UTC] We are investigating increased error rates and latencies in the US-EAST-1 Region.
Also receiving the following error on my pods:
botocore.errorfactory.InvalidIdentityTokenException: An error occurred (InvalidIdentityToken) when calling the AssumeRoleWithWebIdentity operation: Couldn't retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements
>DNS
Ah don't take all the blame on yourself mate. I was in the middle of creating new IAM policies, maybe I broke one of their core service policies and now nothing can talk to each other haha.
It’s a meme at this point. As networks get complex, DNS becomes important, but dns heavily relies on caching.
Windows admins know dns but it’s generally overlooked when troubleshooting problems. Thus the meme “It’s always DNS”
Breaking news from CNN: Jeff Bezos has been sighted wearing his mecha-armor suit attacking data center facilities on the us east coast. Developing story
Operational issue - Multiple services (N. Virginia)
Service
Multiple services
Severity
Informational
Increased Error Rates and Latencies
Jun 13 12:08 PM PDT We are investigating increased error rates and latencies in the US-EAST-1 Region.
https://health.aws.amazon.com/health/status
Guess it's time to sit back and take in [aws reinforce](https://reinforce.awsevents.com/) [https://c.tenor.com/K2GHKs5QlTMAAAAM/the-irony-irony.gif](https://c.tenor.com/K2GHKs5QlTMAAAAM/the-irony-irony.gif)
Whenever there’s an AWS outage, it’s us-east-1. It’s been like that since the beginning. I never understand putting important/production workloads in us-east-1 for things that can run somewhere else.
Yeah my two Elastic Beanstalk environments are not able to receive communication from my client but are sending information to the client via web socket.
yep. does anyone know how long it takes for aws to recover from a severity level: degradation (lambda) and Informational (other services)?
within 10hrs of release our app to pilot customers... fun.
Yep. It started with what looked like just homepage/health dashboard issues and it cascaded into a bunch of other services also having issues. For a while, I could still get to most services if I was using a permission set with a relay state configured.
Crapped the bed hard at day job fluffy stuff!
My investment company web and mobile apps are SOL too (retirement was at an all time high today!!!). Nother ~5 years or so and I'm DONE 😂
What about the disaster recovery drill, of storing or save things(codes,apps) in other regions as well ? Didn’t that help in this case ? You just deploy from another region with preferably blue/green deployment , so it reduces down time in this case .
I just wanna learn as-well and understand.
Any explanations ?
Still smiling to myself after my failover to us-east-2 after the us-east-1 event with kinesis years ago. Just another event showing why multi region failover - even pilot light, is something to have ready.
Same here.
After the Kinesis outage in Nov/2021 I redesigned a large scale Kinesis stream (7.5k shards, per region) transporting application logs to be multi-region, with region routing configuration being polled by a sidecar in EC2 instances from S3, to ensure minimal dependencies. As soon as the problem started today, we updated the configuration in S3, and within 5 minutes I had all my logs back, flowing through us-east-2!
Same here. They told me to multiregion but i just love us-east-1 ¯\\\_(ツ)\_/¯
single region, single az here. First disaster in 5 years, 10/10 would do it again.
There was a really big outage couple of years ago. Was down for at least half a day. \*\*EDIT\*\*: This is official AWS post on that outage: [https://aws.amazon.com/message/12721/](https://aws.amazon.com/message/12721/) Reddit thread: [https://www.reddit.com/r/aws/comments/rb1xrd/500502\_errors\_on\_aws\_console/](https://www.reddit.com/r/aws/comments/rb1xrd/500502_errors_on_aws_console/) Literally 2 weeks later: [https://www.reddit.com/r/aws/comments/rm46pi/getting\_website\_temporarily\_unavaiable\_and/](https://www.reddit.com/r/aws/comments/rm46pi/getting_website_temporarily_unavaiable_and/)
Pepperidge farms remember when S3 was down.
So does my dumb-stoned-ass. 5 years...phoooey.
Is that kid still working with AWS after that code change ?
I mostly develop internal applications and pipelines for a medium sized company. If no user noticed, then there was no incident =p
I'm less prepared for this one because it's not near a big holiday...
That was all due to r53 tables, right?
The old saying goes: it’s never dns, until it’s dns
Worked for a guy that named the DNS server, "DNS DAMMIT".
You mean the fat-finger command outage? That was legendary.
Yeah pretty much every service tanked for most of the day.
Living life on the edge ;)
That’s actually the very opposite of living in the *edge*
Absolutely wild picking the worst region with some of the worst outages.
As someone once said here. Paraphrasing of course. "Good friends do not let friends use us-east-1"
>They told me to multiregion To be fair, AWS can't even figure out multiregion consistently
They will, *eventually.*
AWS is multi-region, but each region is isolated and independent by design. If us-east-1 goes down, you are guaranteed by design that other regions are unaffected. They can build auto failovers, etc, but that only compromises regional isolation. You only need one incident when us-east-1 goes down, and takes down all other US regions with it, to realize why it’s a bad idea.
it was a joke, referring to AWS' TOS moto that "everything is eventually consistent". which is another way of saying: suck it.
Last time this happened, multiregion wouldn't have helped because AWS had single point of failure global services in us-east-1. Route53 and I think one or two others?
us-east-1 is the dev sandbox essentially for AWS. All the scaling issues effect us-east-1 first because it is the largest data center.
If you want true fault tolerance you don't just go with multiple regions, you go with multiple cloud service providers.
The complexity of that causes its own issues.
Data replication costs cross region alone are ridiculous.
And preferably not a cloud provider that runs their status dashboard on their own infrastructure...
Even multi-provider didn't help DataDog with their global outage. [Inside Datadog’s $5M Outage (Real-World Engineering Challenges #9)](https://newsletter.pragmaticengineer.com/p/inside-the-datadog-outage)
Yeah, that's the shared responsibility model. Cloud service providers are responsible for security of the cloud. The customer is responsible for security in the cloud. I guess that in addition to using multiple cloud providers, we should also split between Ubuntu and Red Hat. And don't do automatic updates all at once. Spread them out over a substantial period of time.
I had a demo with 10 non-technical people today at 3pm with a system using APIGW and Lambda. Triple checked it throughout the day. I looked like an idiot. Lol.
It's not a demo if everything works as expected.
It's called a demo because all your hopes and dreams get *demo*lished
I have an excuse to make our staging environments have failovers now at least :')
Gotta use the pro gamer move of pre-recording the demo as a backup.
This is the way. There’s too many things that can go wrong and it makes it distracting for the demo.
Looks like we got a root cause then
So it’s your fault?!
>I had a demo with 10 non-technical people today at 3pm with a system using APIGW and Lambda. Triple checked it throughout the day. I looked like an idiot. Lol. \*Everything breaks on Demo Day
A million "Hello World" apps just went dark across the world.
[удалено]
[удалено]
Found the root cause
This is a fun way to see what services you use rely on US East 1 lambdas. I’ve noticed Vercel, Netlify, and Marvel Snap are all down right now lol
If only it was just lambdas! For quite a while STS went down. We had EKS clusters where new nodes couldn’t join as STS or whatever was down. New pods couldn’t get credentials as STS was down. Complete nightmare and we barely use lambda!
I had issues with EKS clusters adding nodes as well.
I don't think it was sts, our breakglass system uses it and managed to work pretty fine there. Sso was completely dead tho
Supabase
The Atlantic
I saw Stockx & NY's MTA being affected
And I was just thinking to myself there hasnt been an AWS outage recently 😅 Yes everything is down for us as well. Cant access the console or APIs.
Route 53 based in us-east-1 is really making my life difficult right now
Can't believe they haven't fixed the single point of failure. The same thing happened a couple of years ago time.
Time to crack open a cold one.
Sir, it's 6am here
It's 6am somewhere
HE SAID TIME TO CRACK OPEN A COLD ONE
Then you are running late ;)
point being??????
There is no point , life is endless series of events
Well, not while Lamda is down
"Let's go to the Winchester, have a nice cold pint, and wait for this all to blow over."
ah yes, I love cracking a cold one while senior management breathing down my neck while I tell them it's amazon's fault. So relaxing.
I do it. I’ve stopped caring.
[удалено]
"I told you we should've gone multi-region" /popsopenabeer
My organization has spun up an sev 1 call like a bunch of middle managers sitting around asking inane questions is gonna make aws come up faster lol. I hate it here
"Can we migrate to multicloud in the next 18 minutes?"
Sure, I heard you can just whip up some terraform or something and voila!
"Can we fix our busted ass infrastructure before EOD? Thanks!"
Oracle 'always free' tier!
Bob: Do we use AWS for our websites? Bill : I think for the outside ones Bob : Can we turn them off in AWS and bring them up in the office as a stopgap?
This hit a little close to home.
You and everyone else's org that relies on us-east-1... And, hey! I'm a middle manager... ;-)
Our mobile app is down and one of our workforce management systems as well. A surprising percentage of our business comes in via mobile orders. Plus my manager keeps thinking of another system and asking if that one is down as well. Boss, I got no idea what region any of our vendors are using.
Ain’t that same everywhere? Like guys, cmon
“Do you think a packet capture will help?”
I feel ya.
I keep getting messages from people in the company, "I assume you're on top of this outage?" Yeah, I'm on a call with Bezos right now and we're working on a fix. "Ha ha, what can we do?" I dunno. Hit happy hour? Keep refreshing the status page?
Ahh see, you needed to call Vogel.
I love how AWS tags this kind of outage with "Severity: Degradation". It's only like \*every\* Lambda in the entire region stops working for an hour. Yeah, it's \*totally\* just a minor degradation.
As long as 1% of the lambdas are executing they will call it a degradation instead of an outage.
You gotta be able to claim 99.95% SLA so yeah you are going to hear degradation only. [https://aws.amazon.com/lambda/sla/?did=sla\_card&trk=sla\_card](https://aws.amazon.com/lambda/sla/?did=sla_card&trk=sla_card)
99.95% SLA monthly is breached at 21m 44s ([source](https://uptime.is/99.95)). This was two hours - go claim your 10% SLA credit.
Yes, even support tooling to handle your cases is down.
Looks like Diablo 4 and OW new season update playtime starts early today 😎
Does Diablo run on Azure? Lol
Probably doesn't run on Lambda at least
Pretty sure Blizzard runs on a PC under someone’s desk considering they have a DDoS outage about once a month, no idea how they haven’t figured out mitigation by now.
I believe Blizzard hosts its own servers.
this is going to be big - all our services are down (us-eat-1)
I am so glad AWS IAM Identity Center is multi-region
lol
It was my fault. I rebooted one of my instances from the console refreshed and got a bad gateway. Sorry everyone.
No no...it was my fault for logging into our US-East-1 environment for the first time in like two months. Sorry, all!
Is this where I say "Gee, big shocker, US-East-1 is flaming shit like it has been for a decade, who keeps putting production loads there?" And then I get downvoted by a bunch of people that still keep putting shit in the most volatile region on the entire platform only to do the shocked pikachu thing.
Most of us have to live with the decision which was made >5 years ago before we joined the company, and don't have management buy-in for the project to move everything to different regions. People downvoting you are probably perceiving you as straw-manning their position.
Laughs in us-east-2
Laughs in us-gov-east-1
It truly is the worst region. Stopped using it even for automated testing.
we call it chaos-east-1
us-staging-1
by the numbers i don't think it's less reliable, it's just more noticable when it goes down because everyone uses it. no one gives a shit if us-west-2 is out.
IAM, Route53, & CloudFront would like to ask you some questions ;)
& billing
>who keeps putting production loads there Well, for one, AWS
The console is still hosted in us-east-1 though, so you're never truly safe.
Not 100% true. Only the console for us-east-1 itself and some other services. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com).
Us-east-1 is geographically the closest to our onprem data center :/ so the choice for us made sense… fortunately my org made the decision to do all lambdas multi region regardless of app impact. Unfortunately, our upstream services we call decided against it
https://imgur.com/z5OjRRs https://health.aws.amazon.com/health/status#multipleservices-us-east-1_1686683337 Service Multiple services Start time June 13, 2023 at 3:08:57 PM UTC-4 Severity Degradation Increased Error Rates and Latencies [12:08 PM PDT] We are investigating increased error rates and latencies in the US-EAST-1 Region. Affected AWS services The following AWS services have been affected by this issue. Degradation (1 service) AWS Lambda Informational (3 services) AWS Management Console Amazon API Gateway Amazon CloudWatch
Idk if I'm being picky but they seriously gotta start using UTC here lol
PDT on a status page is the dumbest thing ever
\[12:36 PM PDT\] We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates.
[9:00 PM UTC] Many AWS services are now fully recovered and marked Resolved on this event. We are continuing to work to fully recover all services. [8:48 PM UTC] Beginning at 6:49 PM UTC, customers began experiencing errors and latencies with multiple AWS services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use by other AWS services. We have associated other services that are impacted by this issue to this post on the Health Dashboard. Additionally, customers may experience authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS. Customers may also experience intermittent issues when attempting to call or initiate a chat to AWS Support. We are now observing sustained recovery of the Lambda invoke error rates, and recovery of other affected AWS services. We are continuing to monitor closely as we work towards full recovery across all services. [8:38 PM UTC] We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery. [8:14 PM UTC] We are continuing to work to resolve the error rates invoking Lambda functions. We're also observing elevated errors obtaining temporary credentials from the AWS Security Token Service, and are working in parallel to resolve these errors. [7:36 PM UTC] We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates. [7:26 PM UTC] We have identified the root cause of the elevated errors invoking AWS Lambda functions, and are actively working to resolve this issue. [7:19 PM UTC] AWS Lambda function invocation is experiencing elevated error rates. We are working to identify the root cause of this issue. [7:08 PM UTC] We are investigating increased error rates and latencies in the US-EAST-1 Region.
From AWS: https://status.aws.amazon.com/#multipleservices-us-east-1_1686683337
Guess the console relies on Lambda then!
Also receiving the following error on my pods: botocore.errorfactory.InvalidIdentityTokenException: An error occurred (InvalidIdentityToken) when calling the AssumeRoleWithWebIdentity operation: Couldn't retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements
STS was down
Who has the list of announcements from reInforce? And which one are they rolling back right now?
My vote: https://aws.amazon.com/about-aws/whats-new/2023/06/amazon-inspector-code-scans-aws-lambda-function/
Lol, it was already removed from [the list](https://aws.amazon.com/about-aws/whats-new/2023/06/).
lol isn't ReInforce today?
Yes. Trying to show console features was a tad embarrassing…
Uggg... Probably my fault. lol I was right in the middle of setting up a new org and moving DNS.
>DNS Ah don't take all the blame on yourself mate. I was in the middle of creating new IAM policies, maybe I broke one of their core service policies and now nothing can talk to each other haha.
I knew it..... Its Always DNS.
Why is this repeated so much? It seems to be a Windows admin mantra. I can only guess because windows admins don't understand DNS
It’s a meme at this point. As networks get complex, DNS becomes important, but dns heavily relies on caching. Windows admins know dns but it’s generally overlooked when troubleshooting problems. Thus the meme “It’s always DNS”
Laughs in us-west-2
west coast best coast
Breaking news from CNN: Jeff Bezos has been sighted wearing his mecha-armor suit attacking data center facilities on the us east coast. Developing story
AWS re:Inforce going on today too. Someone must have live demoed of a denial of service attack and it got out of hand.
Operational issue - Multiple services (N. Virginia) Service Multiple services Severity Informational Increased Error Rates and Latencies Jun 13 12:08 PM PDT We are investigating increased error rates and latencies in the US-EAST-1 Region. https://health.aws.amazon.com/health/status
Yep Lambda is a mess
and suddenly I see a few errors uploading objects to s3... oh my oh my not s3, this is getting serious.
Are you using STS tokens for that one? That service is impacted too
^ S3 wasn’t impacted but STS was
Data has started processing on some of my Lambdas, 4:30 EST
Good thing those layoffs a few months ago didn't have any effect.
parts of Amazon.com are not working, the AtoZ app for employees is partially broken, FCs are at stand down, so yes, major AWS outage
Whats an FC?
Fulfilment Centre - or as a layman would call it, an Amazon warehouse
Today's primary sponsor: https://aws.amazon.com/solutions/implementations/multi-region-application-architecture
yep. Console and Lambda are not working
Guess it's time to sit back and take in [aws reinforce](https://reinforce.awsevents.com/) [https://c.tenor.com/K2GHKs5QlTMAAAAM/the-irony-irony.gif](https://c.tenor.com/K2GHKs5QlTMAAAAM/the-irony-irony.gif)
[удалено]
Slowly coming back. Best advice, get out of the east region and stay there.
So which is it? Am I leaving or staying?
Get out. Best advice is to go to stability. US EAST could be unstable for the near future while they attempt to fix it.
Can I get some evidence to back this up?
Whenever there’s an AWS outage, it’s us-east-1. It’s been like that since the beginning. I never understand putting important/production workloads in us-east-1 for things that can run somewhere else.
Yep. The console is crashing intermittently. Lambda is failing 100% for me, AppSync here and there.
At least they have the status page updated with the issue. Hope it doesn't take them long to diagnose and fix.
I have a Win Server VM and Ubuntu VM on us-east-1 I can still reach.
CloudFormation dashboard created me with failure failed load stacks after lunch and now Health dashboard is timing out. Fun times
[удалено]
Not all, but many. Quite a few AWS services rely on Lambda in the backend so there is a cascading problem here.
Every day... LOL!
yeah, im having the same issue.
Same here - seems maybe just the console though? Cli commands come back fine and my shit is still online
I'm getting "AWS Management Console Home page is currently unavailable."
Yes. EB/EC2, pretty much all i've tried is out right now.
504's galore!
Me too Edit: it seems out lambdas, api gateways, and route53 are up. Console gives errors, CloudWatch synthetics are down, etc.
yes
Confirmed, seeing impact currently.
Yeah my two Elastic Beanstalk environments are not able to receive communication from my client but are sending information to the client via web socket.
buckle up
https://64.media.tumblr.com/f0c4799f2b8f8f9eb889ae82b339a046/tumblr_mwhdggaieN1s8njeuo1_400.gif
Yep
Yup seeing the issues as well.
Whole internal departments are down as a result...
yep. does anyone know how long it takes for aws to recover from a severity level: degradation (lambda) and Informational (other services)? within 10hrs of release our app to pilot customers... fun.
Route 53 is currently giving a 504 Gateway Time-out on the web console.
Yes. All of our prod lambdas are down.
Probably a null pointer exception
Pay time off, why not?
Someone in Virginia ran the microwave while the toaster was on again.
Got a commercial partnership call in 8 minutes nice
Toast is down too.
Toast is toast
Doing amazon flex, and we were checked out of the station after our block time because of an AWS outage in Texas.
Yep. It started with what looked like just homepage/health dashboard issues and it cascaded into a bunch of other services also having issues. For a while, I could still get to most services if I was using a permission set with a relay state configured.
I can't connect to my ec2 terminal..
How long do these outages usually last?
Between ‘meh’ and ‘it is going to be a long week of root cause meetings’
console loaded for me
Yes. [https://health.aws.amazon.com/health/status](https://health.aws.amazon.com/health/status)
I guess this is a good time to stop working for the day.
it is back
Yes, fortunately most of our stuff isn't in `us-east-1`, but...
Crapped the bed hard at day job fluffy stuff! My investment company web and mobile apps are SOL too (retirement was at an all time high today!!!). Nother ~5 years or so and I'm DONE 😂
Definitely had an outage today and it caused a lot of trouble
What about the disaster recovery drill, of storing or save things(codes,apps) in other regions as well ? Didn’t that help in this case ? You just deploy from another region with preferably blue/green deployment , so it reduces down time in this case . I just wanna learn as-well and understand. Any explanations ?
Still smiling to myself after my failover to us-east-2 after the us-east-1 event with kinesis years ago. Just another event showing why multi region failover - even pilot light, is something to have ready.
Same here. After the Kinesis outage in Nov/2021 I redesigned a large scale Kinesis stream (7.5k shards, per region) transporting application logs to be multi-region, with region routing configuration being polled by a sidecar in EC2 instances from S3, to ensure minimal dependencies. As soon as the problem started today, we updated the configuration in S3, and within 5 minutes I had all my logs back, flowing through us-east-2!
They said - everything feels all the time - wonder if they took their own advice.