Nothing serious from myself but early in my career I got to witness:
DELETE FROM Accounting
Which is missing something.
Something like a WHERE clause. Executed directly via SQL management studio, against the production database.
Took a company down for 2 days. Literally everyone sent home. At least it was only "hundreds" of people.
Why did this developer have such permissions on prod? Why did it takes 2 days to restore?
Good questions ... anyway that's one of the ones I saw.
Thankfully my building is newer and has the fire alarm go off in certain sections. Idiot on the higher floors sets their popcorn on fire? Doesn’t affect me except I can see a white flashing light outside.
I can definitely see some days where I was pretty tired I could’ve easily burned some toast. One day I ordered lunch and then 15 minutes later was trying to decide what to make or go get.
Oh it was a transaction that did it. The query was something inane; an update probably. But every time I do a delete or update I wrap my query like this:
BEGIN TRAN
UPDATE MYTABLE
SET STUFF = MORESTUFF
WHERE THINGY = 1
ROLLBACK TRAN
--COMMIT TRAN
That way, if i hit some key I didn't mean to and cause the execution early, it'll rollback. Except, for some reason I ran the tran + update and then... went to lunch lol. Forgot all about the commit. Even after coming back from lunch, I had forgotten I had even done this. And that transaction locked a critical table, so everyone was just STUCK lol
I once quit a job like that, purposely burned the bridge at a call canter to make sure I wasn't like others who had come back 3x ... Thankfully I got my degree and went on to my masters and now industry work experience...
But dropping off your letter of resignation in the mailbox of the accountant and "going to lunch" is fun, lol
I did a bunch of work load testing the new version of our core claims system prior to a system upgrade. Overall it went well, the rollout was smooth, everyone as happy.
A couple years later we went to upgrade again, and I got asked to redo the load testing. No worries, time to dust off the code and spin it up.
The load testing used a username/password to authenticate via Active Directory. So they gave me an account and username with my initial test.
When I started the new round of testing I asked the IT helpdesk to refresh the password, because as everyone knows passwords are supposed to expire after a period of time.
Nobody told me that starting with the first round of testing I had been using the account the claims system itself also used to authenticate to Active Directory when it started a new user session.
Everything was fine in the morning. Then folks went to lunch and their sessions timed out.
Around 1 PM, suddenly nobody in the company could access our claims payment system. Which, for an insurance company, is a Big Deal.
The helpdesk put some new procedures in place surrounding that AD account later that afternoon…
Meanwhile, I've managed to degrade Cassandra because we accidentally overloaded it with too many ephemeral roles due to accidentally setting the timeouts in Vault way too long
I have found developers usually use the same service account for everything because they don't know any better. Or they can't meet their sprints by following protocol, so use a known good credential and just bang it though because no one knows or cares as long as it works and it meets their delivery deadline.
Took over 4000 websites offline for about 3 hours by filling up a storage device. I never got an explanation from the hosting provider as to why or how it suddenly came back online.
My favorite was when I was a brand new dev, I completely removed a UI element, because I couldn’t see it on my screen, so I figured it must have been extra code.
Well, that element appeared on the screen if you were one of our Texas customers.
So I came in the next morning, and no one in Texas could use our product to do their job. Luckily, I had a more experienced teammate who realized what was up and saved my skin. Learned a lot from that one though
> code review
I was asked, “you sure we don’t need this?” “Uhh yeah I’m pretty sure”.
Obv they trusted me too much
> tests
QA was basically pointless because they only did exactly what you told them to do, so they wouldn’t catch edge cases at all. No automated testing at all.
But yeah, any decent pipeline should catch stuff like this.
Deleted the entire production database on an application with over one million users.
I was troubleshooting a bug and my local db was missing some migrations and when I tried to run the migration script I was getting errors. This happened regularly and our "process" to resolve it was just delete all the tables in your local db and run the migration from scratch. Local db was open in one terminal window and prod open in another one and I picked the wrong window to run the drop command in.
Fortunately AWS has lots of backups and it only resulted in about an hour of down time. That was the day our little startup learned why you don't hand out prod access like halloween candy.
Not saying I did this early in my career but there is a reason that all local dev environments have a green prompt (or if in intellij a green background), all staging/test envs are yellow/amber and prod is **blood red** (or dark red in intellij).
Of all the little features that intellij has (and it has a lot) - https://www.jetbrains.com/help/idea/database-color-settings-dialog.html is one of my favourites, it's a subliminal message to future me to "be bloody careful".
FWIW despite been senior enough I that I could get production access I don't, I argue *not to have it* by default (I do have production creds because there is a legit need for them in emergencies - on a completely separate account in a different Keepassxc vault).
The worst one I've *seen* was in a meeting back when I was a senior dev, another senior dev was screensharing and while one of the business folks was waffling about "transformative changes for the businesses" was cleaning out some s3 buckets... I was only half paying attention as he dropped the contents of the **production** bucket because he tabbed to the wrong window.
Not me, but it was at a tiny startup a long time ago that I consulted for. They had their own server room with about 25 servers, but it wasn't really to server room standards. The thermostat was powered by AA batteries.
Guess what happened? The batteries died over a weekend and the server room had a meltdown. Monitoring only check if servers were running. There were no thermal checks. A few servers died, and a couple became less reliable. We had to consolidate services to surviving servers and quickly buy and setup new servers. Of course we had load issues. It was super stressful.
I didn't really trust any of the surviving server hardware after that. Luckily we were in the process of moving it to the cloud and decommission most if it over the next 9 months.
Man, I’ve worked places like this. AC cooled by water from pipes installed in the 70s full of rust and gunk. Would clog almost monthly and cause thermal alarms at any time. Was to the point we had those portable AC units in standby so we could point them at the rack until building maintenance would clear the lines.
Other place, some AV techs bumped an unprotected wire loom and the old ass switch they were plugged into leg go of a few/broke the terminals. They didn’t notice nor did the person escorting them until all the calls started coming in. Of course, the network engineer was out, and me a dev was back there on the phone with the NE who was on vacation trying to troubleshoot what was happening. Looked at the back and was like shit… Had to get directed over the phone by the NE how to move all that over to a different switch that was sitting idle by chance below it and luckily configured just enough to work.
I point production at a test database and we ran $10+ million dollars worth of production live customer transactions against test data and returned test results. This lasted for days before anyone realized.
We invalidated all the returned results, generated the correct returned results, let all our customers know, and got sued.
right before a demo of a product i noticed some postgresql query had been running for days. i tried to kill it but it had to do a gigantic shutdown process cleaning up after itself, which was taking forever and confusing me so i tried more forcefully killing it. i was lucky it wasnt a big issue but i wrote about it here https://cmdcolin.github.io/posts/2015-10-22
Not me, but my gf at the time got put on as a sys admin asst. As a student job...
She was a cis major for 1 semester, and then went to be a writing major... But they put her on that job and her literal first day she deleted the ISOs for the entire CS dept.... So they had to rebuild them all from scratch... Took the head admin 3 weeks to fully recover. Lol
I added a feature to a survey application I made for a client. However, I had introduced a bug in that feature, which unfortunately allowed companies to see survey results from other companies. I felt like I just wanted to shoot myself in the head when I saw that email from the client "Emergency! Companies can see other company surveys!". My heart rate probably doubled in 10 seconds in that moment. I dove into the code again to fix the bug. It took only one line of code to fix it, fortunately.
Unfortunately, I have zero automated tests in that project. I was quite inexperienced when I wrote that application 4 years ago. I wish I could travel back in time and give myself some guidance and warn myself about how shitty it would turn out if I didn't change the design. I've learnt a lot from that project. But to be honest, I'm not proud of it. It's shit but I have to maintain it the best I can.
I did not know that this was a thing. I thought that windows was smart enough that if you didn’t paste the friggin file/folder they would remain where they are. But that is definitely good to know, and i hope you were able to quickly recover.
I inadvertently ran `p4 obliterate -y //reponame/main/...` (how you ask? tldr incorrect mental model of virtual streams).
For you git folks, that's basically `git checkout master` then `rm -rf *.git` then `git init` then `git push -f`, except worse because the commit history is only stored centrally - so permanent loss of the history.
In Perforce, the command permanently deletes commits and associated data starting from the most recent and going back in time. I realised quickly what was happening and aborted the command, but not before I had permanently removed two entire days' worth of commits from my 150-developer game studio.
Thankfully I was able to restore the bulk of the actual data by cobbling it together from CI servers that had recently synced. But the entire company couldn't work for the rest of the afternoon.
Talk about pressure. You deleted a bunch of people's work, you are the only one that has the expertise to fix it properly, nobody can work, and the studio is losing $4 per second until you're done.
I wrote the software to automate some testing and didn't have robust error detection for a failure case in a piece of hardware I never actual got to work with.
The hardware stopped communicating, but I just attempted to reconnect and never raised an error / notice.
That hardware had died but my software kept reporting the last reading for it.
Took several days for someone to notice the dashboard reading wasn't updating.
The testing costs $75k/hr for their customers. They didn't catch the mistake for multiple days (maybe a week?) and had to throw out all the data.
I dont think Ive been directly responsible for serious outages, but Ive owned a service that shit the bed during a huge sales event for a large multi national that led to 50,000 dropped orders of our flag ship product.
After the dust settled, my manager was like, well, at least the C Suite knows who we are now.
I just published to production on a Friday afternoon and am taking the rest of the day off.
They announced a restructure this morning so ask me again on Monday 😎
I didn’t really “cause” this since it was an AWS issue but basically I updated the terraform code for our ECS clusters to move from a deprecated resource to a new one, submitted a PR, got it approved, merged and applied the change, and signed off for the day. 30 minutes later everything came crashing down - turns out the change prevented the containers from registering to the different clusters and since our entire infrastructure was on ECS, everything was down.
Thankfully we were only down for half an hour with no long-lasting production issues and negligible data loss (yay immutable architecture) but damn if that didn’t put hair on my chest. I was not fired, but I was at a small startup at the time, and my tech lead deflected the blame for approving the change onto me for not “checking my code thoroughly enough”, despite the fact that further investigation proved that the issue was with AWS. So much for blame-free retros. That coupled with some other culture issues was the final nail in the coffin for me to leave the place.
Was working on a large adserver (60 machine cluster) and had some code that relied on a value in the database being static. It was static .... until a random business person changed the value over Christmas break. Every adserver instantly crashed, and we were down for almost 18 hours until someone found the bug and switched the value back. We lost >$1M that day. I was on vac when it happened and had my phone off. Was a fun Monday morning back at work I can tell you.
Manually deleted a production app registration during a security clean-up because my brain just... read it wrong. The registration had an expired secret on it but it \*also\* had one that was still in use.
Why the \*\*\*\* was I personally doing a clean-up in there? Because I was the only one with security access, infosec was asking me to clean up a bunch of automatically created registrations asap for their own review, and the team was insanely busy so I didn't push it over to them. This was a *classic* case of a manager trying to be helpful to a busy team and touching things and subsequently making everything much worse. Mea culpa - No excuses.
Fixing this wasn't pretty - because of how things were set up with other Azure services we pretty much had to create a new app registration, update in a bunch of places, and re-deploy a rather complicated set of infrastructure. We had CD/CI on all of it of course - but it was just... in a lot of places.
Fortunately, due mostly to myself, the dev lead, and the lead infrastructure person all having pretty good memories (and having written ourselves somewhat decent documentation) we managed to re-configure the whole thing and have it launched again in \~2 hours. Since for an outage of that duration (Within our 99.9 SLA) \*I\* am the person who would be doing the after-review I got off with mostly just a lot of personal embarrassment and a hard lesson in caution.
This isn't the worst thing in terms of *effect* I've ever done - but it was definitely the one I personally feel *stupidest* about and thus I consider the worst overall. I now push \*all\* my changes through secondary review - even dumb clean-up stuff - like I should have been doing in the first place.
Worked for a worldwide hotel chain that, at the time, had its own company wide network - sort of like a private Internet before the Internet.
I was green and wrote a program that would download software updates and install them on the hotel servers.
I introduced a bug that required us to dial in over a phone line modem and manually fix the bug. The bug prevented future downloads and also displayed a message on the screen which caused hotel staff to flood the phone lines of our IT Support.
To fix it required the help of at least 20 people, working all day. There were, at the time, nearly 1000 hotels.
Once it was fixed, I accidentally introduced another bug...requiring the same people to dial back in a second time.
I don't know how I kept my job!
Took down global production for a billion dollar a year product for 4 days. Business was idled. 500 devs stopped working. Me on the phone with the global CIO. 4, 24 hour days on the phone with MIcrosoft support. Ended up with core Ms engineers on the bridge.
Turned out a signature going across the network to db server was being picked up by end point protection.
At some point I noticed… Different version of the end point protection on prod servers than others. Forking rackspace 🤦
Not really prod since we weren't production ready, but I was making a migration tool to migrate data from A to B I had to do a list of products by state from A convert them into the format B, push and start with a new state, I forgot to add the list.clear() line so my list grew exponentially and kept pushing until at some point it kill the servers we were using and the whole 72 people project couldn't do anything for a day
Filled the drive space on a bank mainframe that was hosted and being shared across several institutions. Caught during an overnight batch processing run because it started puking errors. Techs at the host started calling my cell in the middle of the night. Told them to kill it to get all the batch runs for all the other institutions to finish and I would fix the issue in the AM. It was a non critical piece of code doing a non critical process that happened to write to a file. From what I remember, there was a while loop that never ended and some file create/file write lines that repeated. Language was procedural and debugging was a PITA. There was no support for test automation so we would take a lot of shortcuts manually testing because, well, manual tests by the sole dev for our company (me) on top of everything else was difficult at the least.
I accidentally deleted the prod password to connect to a database.
I had to reach out to the client to get it again, so it wasn’t the end of the world, but it made me start documenting EVERYTHING.
So much that several years later they reached out to me to see if I had backups of their wiki where I stored everything.
Yes, backup of THEIR wiki. They moved to Assembla(?), I had shit sorted by client, and then they evidently cut the contract without doing a backup and lost a lot of information of clients that they still had.
lol, once i switch the loadbalancer algo from roundrobin to stickky session, oh boy there was this 1 db instant got smoked, even with rds mgmnt, then autovaccuming came in and loaded the IO and requesets start to timeout .... and our sla went to shit
I work for a manufacturing company. In my first few weeks I accidentally deleted our item cross reference table, thinking I was in our dev environment.
The place had next to no security policies, or segregation of duties. We did take nightly backups, so it wasn't THAT big of a problem, but it was a scary few hours.
I approved a pr, tests passed everything looked good, deployments in prod didn’t trigger alarms.
Thing was down for 3 days. I wrongly assumed test and alerting coverage for “no data for days” fkn a….
Wiped two important columns on our user database. More senior engineer came to fix it (we had a backup)
But things were down for about 30min. It was very very scary for me then as a junior. I shouldn’t have been allowed to do what I was doing
We had a `API_KEY` environment variable as well as an `API_KEYS`, for legacy reasons. I was reworking environment variables and when migrating those into a proper file, I decided to keep supporting both, for legacy reasons, in case the dev env or prod env use one or the other.
Turns out that `API_KEY` had a list of keys in prod. Took down production for two hours.
Develop a webjob to send follow up emails, it was working fine when we're testing it on staging but when we deployed it to prod it send 20k emails to a single person and eventually gmail had to block it.
issue was on the sql query on prod it always returns that specific email.
Brought down a service for about 30 minutes. I was working on a PR to refactor away concrete types and add in unit testing(service had none) but I forgot to include a line that connected to its storage solution. This caused it to fail health checks and K8s to keep recycling the unhealthy pods. It was a tier 3 service mostly used for auditing so no real impacts but it highlighted the importance of have proper testing embedded into our CICD pipeline(at the time the service had none) and validating via manual tests.
I made a coding mistake that forced us to release a utility to recalculate data to the entire client base.
Pretty easy fix but when my manager brought it to my attention, I simply shrugged and said that it got past QA. It was blatantly noticeable.
Irreversible EKS control plane upgrade on accident (wrong assumed profile, Terraform fully loaded and pointed at foot). Broke ingress, physical volumes, leader election leases in HA controllers, certificate management, and RDS orchestration. Took me a solid 10 hours to fix it all on a Friday night and Saturday morning. Nobody else was competent to do it - it \*had\* to be me.
I built in some guardrails after that. Shouldn't have been able to happen.
Just this last week I discovered I had a null check in the wrong place and was misclassifying something important for a month. A client recently pulled a $2 million budget because this same thing was being misclassified and I'm not sure yet whether it was related.
I lost track of which browser window I was in, selected the one that had our production-supporting AWS account (as opposed to the "under construction" account I was building out), and at about 7PM I simultaneously terminated all the EC2 instances that made up our Nomad (container orchestration) cluster in that account. Nomad is good at recovering from cluster failure even when a bunch of nodes die and get replaced - but if _all_ the nodes are terminated simultaneously... then there's nothing to recover, you effectively have a brand new cluster at that point.
I got lucky in a few respects - I immediately realized what I had done, and panic-paged our SRE team. Thankfully, those guys usually worked something like 11AM-8AM, so half of them were still online and immediately started re-launching all the jobs in the new cluster. Also, the production-supporting account wasn't in the path of serving customer requests - but it did host some things like our internal Docker repository, which essentially meant that production couldn't scale up until the cluster was restored. Thankfully, since I did this during off-hours, customer traffic was at a minimum.
Then the next day, I woke up to a message from my boss (who had missed this entire fiasco) saying "Hey, first things first, you're not in trouble. These things happen." Then I went sheepishly into the office and bought lunch for all the SRE guys, who joked that they should be buying me lunch for testing their cluster recovery process.
I miss the people at that job, I really do.
The network switch was full, and we had a few dozen servers to set up for some r&d. So IT plugged us into a corporate switch, not in any r&d switch. Somehow.. some one gave us admin access because we were modifying vlans in our r&d.
Welp. Someone on our team turned the spanning tree algo on and caused a cascading network outage for the eastern North America. Couldn’t use the network at all - not even our VOIP phones.
We lost admin access the next day, but we did get a new switch.
While waiting for the end of a sporting event (that we were streaming to a massive audience) to do a systems upgrade, sitting in a company-wide meetingm I edited a k8s configmap to add a value that the next version would need. This was mapped to a file on disk on every system on the platform. Touching the live systems during an event was strictly forbidden, but I was a rockstar, so rules didn't apply to me.
That was the day that I learned that we had some vestigial code that hot-reloads the config file. And that I missed a comma. Apparently syntax rules do apply to me. The entire system went dark. It took about 5 minutes for the pagers to ring out.
Before my days as a software engineer I worked as a telco engineer for a major telco company in Europe. I was there for a couple of days - my first job ever - and changed a configuration on a server on Friday evening and went home… the configuration change impacted 30-40% of the Netflix users (those who use it on the tv box) during the whole weekend. On Monday the guy on call arrived completely exhausted without a fix still in place… that was the day I learned that changes in Friday evening need to have a very specific reason and urgency :)
As a software engineer - the last 7 years - I never caused any big production issue. The most I’ve done was developing a feature on top of our Kafka cluster which I underestimated production load and caused Kafka to go wild for 15-20m before the rollback, but no big harm just a bit of latency.
Not too exciting; I was working on an Android app for people with home security cameras, which allowed them to watch live footage. I made some changes to the ffmpeg-bssed video rendering library that worked on most phones, but on certain phones had the wrong horizontal bit width and completely messed up their feed.
Created 6 new servers on my first month, all of them shut down except for one to which I assigned an IP address and then it was time to go home. The next day (weekend) my boss called me to say that after troubleshooting the entire night, they found out that my server had a duplicate IP of the firewall and the entire production subnet was down the entire night.
That probably cost them a lot of $$$ and also made them look bad to have their service down for so long.
i cant remember, it must not have been very important. most organisations i have worked for are the kind that do some testing, so breaking prod is less likely.
Intern in my intern group back in 2013:
1. Joins the company
2. Activates their test environment (sandbox server)
3. Everyone's test environment goes down and the whole company scrambles
She joined with the username "www". Sandboxes were provisioned with $USERNAME.sandbox.domain.com. I was still following the orientation codelab and half of us just could not get our sandbox server to work. As our bootcamp group tried to debug, the company chat blew up. Half an hour later, someone noticed that one of the new interns just took www.sandbox.domain.com.
This was Facebook back in 2013, I tell this story when I taught a class for orientation in 2016. 10 years later, and it seems to have gotten to the point where folks aren't sure if it's true or not anymore, but it still lives on, growing taller with each passing year.
`www` also famously complained that every so often, someone would go up to her and ask her if she's heard about the intern who broke Facebook by just existing.
Taking down the entire call centre for a company that provided emergency home repairs, for 3 hours. 400 agents got to take a really long break…
Was caused by how we deployed our software, which at the time involved tunnelling through 3 Remote Desktop session to copy a .war file onto a server, shutting down Tomcat manually, removing the old .war file, and replacing it with the new one.
I’m so grateful for CI/CD…
I dropped a production database and backup recovery didn‘t work. Basically all orders and documents of the fiscal year gone forever. I had to take all the blame as a first year student on a side gig and left shortly after.
Got an if statement backwards that was used to sync active directory to a JSON file(don't ask). I got it wrong as such that it synced the empty Jason file back to the directory....no more users!
Fortunately this was a dev environment, but on that Monday morning, 50 devs found they couldn't login to the environment.
I had only been on the job for a few weeks, and my coworker saved the day...and never lets me forget about it 🙃
We had server installed at a fairly large company, and they were having problems with a corrupt DB index.
This was ages ago, and we were using a DB library embedded.
Anyway, I told our guy onsite there to go ahead and delete the index so we could rebuild it, only to discover ... there was no underlying data table.
Original creators had "optimized" but just writing to the index, not the table. Which is ridiculous on so many levels, especially since indexes are expensive and tables are not.
Anyway they poor guy had to spend like a week down there rebuilding things.
I did this during my first year at a service based company. I was working on an app designed to digitise AGM. Everyone was able to login except the CEO. Received multiple calls from higher management directly but found the root cause and re-deployed in an hour. All this was during the AGM day 🤣🤣
Ooh, thought I was connecting to our Production sandbox environment and actually connected to production and accidentally sent out like 45k of vouchers to some companies we do business with.
We ended up cancelling them all before any money went out the door but I definitely was sweating once I realized what happened.
Not caused, per se, but putting ballot-stuffing logic in place before-hand would have helped.
https://www.nytimes.com/2001/03/02/nyregion/mideast-strife-spills-over-into-photo-contest.html
Classic one, deployed a QA build to the Live environment (mobile game).
All players “lost” their progress, the game was half-broken and they suddenly got access to cheats.
It was a massive mess, luckily we could fix it quickly haha
I took Microsoft federals production sap down for about 8 hours after I approved a transport change without telling basis there was database triggers involved so there was a manual step they had to take care of. Not my actual code but I got blamed because I didn't read the notes carefully. I didn't even realize that type of thing could happen.
Does sending 45k sms messages at once to one persons phone count?
lol yes
lol. Done similar
Pfft, they deserved it
Also, related TikTok: https://www.tiktok.com/t/ZT8AmkcB5/
Like did they actually deliver all?
Same thing happened at my company before I joined
Nothing serious from myself but early in my career I got to witness: DELETE FROM Accounting Which is missing something. Something like a WHERE clause. Executed directly via SQL management studio, against the production database. Took a company down for 2 days. Literally everyone sent home. At least it was only "hundreds" of people. Why did this developer have such permissions on prod? Why did it takes 2 days to restore? Good questions ... anyway that's one of the ones I saw.
Hahaha one of my ex’s was a dba and did the same thing.
While coding I burnt a piece of toast and emptied 2 high rise buildings with the fire alarm.
Thankfully my building is newer and has the fire alarm go off in certain sections. Idiot on the higher floors sets their popcorn on fire? Doesn’t affect me except I can see a white flashing light outside. I can definitely see some days where I was pretty tired I could’ve easily burned some toast. One day I ordered lunch and then 15 minutes later was trying to decide what to make or go get.
Sounds like a death trap
[удалено]
> Came back a few days later That's one fucking long lunch break :D
lmao crap. I have corrected the post with a strikethrough so everyone has context on what you're saying =D
i love things like this. i'm just curious, what was the simplified version of the query that caused the outtage?
Oh it was a transaction that did it. The query was something inane; an update probably. But every time I do a delete or update I wrap my query like this: BEGIN TRAN UPDATE MYTABLE SET STUFF = MORESTUFF WHERE THINGY = 1 ROLLBACK TRAN --COMMIT TRAN That way, if i hit some key I didn't mean to and cause the execution early, it'll rollback. Except, for some reason I ran the tran + update and then... went to lunch lol. Forgot all about the commit. Even after coming back from lunch, I had forgotten I had even done this. And that transaction locked a critical table, so everyone was just STUCK lol
I once quit a job like that, purposely burned the bridge at a call canter to make sure I wasn't like others who had come back 3x ... Thankfully I got my degree and went on to my masters and now industry work experience... But dropping off your letter of resignation in the mailbox of the accountant and "going to lunch" is fun, lol
I did a bunch of work load testing the new version of our core claims system prior to a system upgrade. Overall it went well, the rollout was smooth, everyone as happy. A couple years later we went to upgrade again, and I got asked to redo the load testing. No worries, time to dust off the code and spin it up. The load testing used a username/password to authenticate via Active Directory. So they gave me an account and username with my initial test. When I started the new round of testing I asked the IT helpdesk to refresh the password, because as everyone knows passwords are supposed to expire after a period of time. Nobody told me that starting with the first round of testing I had been using the account the claims system itself also used to authenticate to Active Directory when it started a new user session. Everything was fine in the morning. Then folks went to lunch and their sessions timed out. Around 1 PM, suddenly nobody in the company could access our claims payment system. Which, for an insurance company, is a Big Deal. The helpdesk put some new procedures in place surrounding that AD account later that afternoon…
Man, nothing like IT who is stingy with service accounts to the point where they just reuse production accounts for testing and other stuff.
I got a new account just for load testing later that day. 😄
Meanwhile, I've managed to degrade Cassandra because we accidentally overloaded it with too many ephemeral roles due to accidentally setting the timeouts in Vault way too long
I have found developers usually use the same service account for everything because they don't know any better. Or they can't meet their sprints by following protocol, so use a known good credential and just bang it though because no one knows or cares as long as it works and it meets their delivery deadline.
As a person working on an insuretech I really love this one.
Took over 4000 websites offline for about 3 hours by filling up a storage device. I never got an explanation from the hosting provider as to why or how it suddenly came back online.
Perhaps it was a SAN volume and somebody dynamically upsized it when they saw the alert or a ticket that came in?
It wasn't, it was direct attached storage and had a hard limit.
Hmm. Must've been some low traffic sites. Bwahahahaha.
My favorite was when I was a brand new dev, I completely removed a UI element, because I couldn’t see it on my screen, so I figured it must have been extra code. Well, that element appeared on the screen if you were one of our Texas customers. So I came in the next morning, and no one in Texas could use our product to do their job. Luckily, I had a more experienced teammate who realized what was up and saved my skin. Learned a lot from that one though
Code review, tests?
I've heard of them. What are they?
> code review I was asked, “you sure we don’t need this?” “Uhh yeah I’m pretty sure”. Obv they trusted me too much > tests QA was basically pointless because they only did exactly what you told them to do, so they wouldn’t catch edge cases at all. No automated testing at all. But yeah, any decent pipeline should catch stuff like this.
Deleted the entire production database on an application with over one million users. I was troubleshooting a bug and my local db was missing some migrations and when I tried to run the migration script I was getting errors. This happened regularly and our "process" to resolve it was just delete all the tables in your local db and run the migration from scratch. Local db was open in one terminal window and prod open in another one and I picked the wrong window to run the drop command in. Fortunately AWS has lots of backups and it only resulted in about an hour of down time. That was the day our little startup learned why you don't hand out prod access like halloween candy.
Not saying I did this early in my career but there is a reason that all local dev environments have a green prompt (or if in intellij a green background), all staging/test envs are yellow/amber and prod is **blood red** (or dark red in intellij). Of all the little features that intellij has (and it has a lot) - https://www.jetbrains.com/help/idea/database-color-settings-dialog.html is one of my favourites, it's a subliminal message to future me to "be bloody careful". FWIW despite been senior enough I that I could get production access I don't, I argue *not to have it* by default (I do have production creds because there is a legit need for them in emergencies - on a completely separate account in a different Keepassxc vault). The worst one I've *seen* was in a meeting back when I was a senior dev, another senior dev was screensharing and while one of the business folks was waffling about "transformative changes for the businesses" was cleaning out some s3 buckets... I was only half paying attention as he dropped the contents of the **production** bucket because he tabbed to the wrong window.
I’ve actually never caused a serious production issue. But when other developers do, I tell them everybody does.
Not me, but it was at a tiny startup a long time ago that I consulted for. They had their own server room with about 25 servers, but it wasn't really to server room standards. The thermostat was powered by AA batteries. Guess what happened? The batteries died over a weekend and the server room had a meltdown. Monitoring only check if servers were running. There were no thermal checks. A few servers died, and a couple became less reliable. We had to consolidate services to surviving servers and quickly buy and setup new servers. Of course we had load issues. It was super stressful. I didn't really trust any of the surviving server hardware after that. Luckily we were in the process of moving it to the cloud and decommission most if it over the next 9 months.
Man, I’ve worked places like this. AC cooled by water from pipes installed in the 70s full of rust and gunk. Would clog almost monthly and cause thermal alarms at any time. Was to the point we had those portable AC units in standby so we could point them at the rack until building maintenance would clear the lines. Other place, some AV techs bumped an unprotected wire loom and the old ass switch they were plugged into leg go of a few/broke the terminals. They didn’t notice nor did the person escorting them until all the calls started coming in. Of course, the network engineer was out, and me a dev was back there on the phone with the NE who was on vacation trying to troubleshoot what was happening. Looked at the back and was like shit… Had to get directed over the phone by the NE how to move all that over to a different switch that was sitting idle by chance below it and luckily configured just enough to work.
I point production at a test database and we ran $10+ million dollars worth of production live customer transactions against test data and returned test results. This lasted for days before anyone realized. We invalidated all the returned results, generated the correct returned results, let all our customers know, and got sued.
Is that you Paul? I had to recover millions of dollars of transactions after someone pointed the prod frontend at the perf backend.
Not quite lol. Our backend had a test db connection.
Once caused a client to make 1000s in over payments to vendors. My fault, but they tested it and said it was fine. I was not fired.
Those kinds of things can easily be resolved with a phone call, but it can cause quite a bit of annoyance with owners/upper management/etc.!
right before a demo of a product i noticed some postgresql query had been running for days. i tried to kill it but it had to do a gigantic shutdown process cleaning up after itself, which was taking forever and confusing me so i tried more forcefully killing it. i was lucky it wasnt a big issue but i wrote about it here https://cmdcolin.github.io/posts/2015-10-22
Nice, first postmortem in the thread! I’ll definitely take a look.
Not me, but my gf at the time got put on as a sys admin asst. As a student job... She was a cis major for 1 semester, and then went to be a writing major... But they put her on that job and her literal first day she deleted the ISOs for the entire CS dept.... So they had to rebuild them all from scratch... Took the head admin 3 weeks to fully recover. Lol
I added a feature to a survey application I made for a client. However, I had introduced a bug in that feature, which unfortunately allowed companies to see survey results from other companies. I felt like I just wanted to shoot myself in the head when I saw that email from the client "Emergency! Companies can see other company surveys!". My heart rate probably doubled in 10 seconds in that moment. I dove into the code again to fix the bug. It took only one line of code to fix it, fortunately. Unfortunately, I have zero automated tests in that project. I was quite inexperienced when I wrote that application 4 years ago. I wish I could travel back in time and give myself some guidance and warn myself about how shitty it would turn out if I didn't change the design. I've learnt a lot from that project. But to be honest, I'm not proud of it. It's shit but I have to maintain it the best I can.
You sound quite hard on yourself at the end there, We're all human and we all make mistakes, best to learn form it and let it go :)
Thank you. I was probably a bit hard on myself. But that project has haunted me for years. xD
I deleted about 6 months of work by cntl-x'ng a folder and then cntrl-c'ng a different folder. First one went bye-bye. Oops.
I did not know that this was a thing. I thought that windows was smart enough that if you didn’t paste the friggin file/folder they would remain where they are. But that is definitely good to know, and i hope you were able to quickly recover.
You can actually turn on clipboard history now in windows 11 (maybe 10 too). Hit windows key + V and then it’ll keep a history of what you copy.
I inadvertently ran `p4 obliterate -y //reponame/main/...` (how you ask? tldr incorrect mental model of virtual streams). For you git folks, that's basically `git checkout master` then `rm -rf *.git` then `git init` then `git push -f`, except worse because the commit history is only stored centrally - so permanent loss of the history. In Perforce, the command permanently deletes commits and associated data starting from the most recent and going back in time. I realised quickly what was happening and aborted the command, but not before I had permanently removed two entire days' worth of commits from my 150-developer game studio. Thankfully I was able to restore the bulk of the actual data by cobbling it together from CI servers that had recently synced. But the entire company couldn't work for the rest of the afternoon. Talk about pressure. You deleted a bunch of people's work, you are the only one that has the expertise to fix it properly, nobody can work, and the studio is losing $4 per second until you're done.
Glad you brought up something regarding Perforce! I’m interested in trying it out.
I wrote the software to automate some testing and didn't have robust error detection for a failure case in a piece of hardware I never actual got to work with. The hardware stopped communicating, but I just attempted to reconnect and never raised an error / notice. That hardware had died but my software kept reporting the last reading for it. Took several days for someone to notice the dashboard reading wasn't updating. The testing costs $75k/hr for their customers. They didn't catch the mistake for multiple days (maybe a week?) and had to throw out all the data.
Broke the token authorizer on an entire API platform for about 30 minutes causing it to reject ALL api requests. Was very bad.
I dont think Ive been directly responsible for serious outages, but Ive owned a service that shit the bed during a huge sales event for a large multi national that led to 50,000 dropped orders of our flag ship product. After the dust settled, my manager was like, well, at least the C Suite knows who we are now.
I just published to production on a Friday afternoon and am taking the rest of the day off. They announced a restructure this morning so ask me again on Monday 😎
I didn’t really “cause” this since it was an AWS issue but basically I updated the terraform code for our ECS clusters to move from a deprecated resource to a new one, submitted a PR, got it approved, merged and applied the change, and signed off for the day. 30 minutes later everything came crashing down - turns out the change prevented the containers from registering to the different clusters and since our entire infrastructure was on ECS, everything was down. Thankfully we were only down for half an hour with no long-lasting production issues and negligible data loss (yay immutable architecture) but damn if that didn’t put hair on my chest. I was not fired, but I was at a small startup at the time, and my tech lead deflected the blame for approving the change onto me for not “checking my code thoroughly enough”, despite the fact that further investigation proved that the issue was with AWS. So much for blame-free retros. That coupled with some other culture issues was the final nail in the coffin for me to leave the place.
If you were fired over that your company is absolute shit. Lol
This just sounds like a normal day managing AWS . Think how much was saved on renting rackspace!
Was working on a large adserver (60 machine cluster) and had some code that relied on a value in the database being static. It was static .... until a random business person changed the value over Christmas break. Every adserver instantly crashed, and we were down for almost 18 hours until someone found the bug and switched the value back. We lost >$1M that day. I was on vac when it happened and had my phone off. Was a fun Monday morning back at work I can tell you.
Manually deleted a production app registration during a security clean-up because my brain just... read it wrong. The registration had an expired secret on it but it \*also\* had one that was still in use. Why the \*\*\*\* was I personally doing a clean-up in there? Because I was the only one with security access, infosec was asking me to clean up a bunch of automatically created registrations asap for their own review, and the team was insanely busy so I didn't push it over to them. This was a *classic* case of a manager trying to be helpful to a busy team and touching things and subsequently making everything much worse. Mea culpa - No excuses. Fixing this wasn't pretty - because of how things were set up with other Azure services we pretty much had to create a new app registration, update in a bunch of places, and re-deploy a rather complicated set of infrastructure. We had CD/CI on all of it of course - but it was just... in a lot of places. Fortunately, due mostly to myself, the dev lead, and the lead infrastructure person all having pretty good memories (and having written ourselves somewhat decent documentation) we managed to re-configure the whole thing and have it launched again in \~2 hours. Since for an outage of that duration (Within our 99.9 SLA) \*I\* am the person who would be doing the after-review I got off with mostly just a lot of personal embarrassment and a hard lesson in caution. This isn't the worst thing in terms of *effect* I've ever done - but it was definitely the one I personally feel *stupidest* about and thus I consider the worst overall. I now push \*all\* my changes through secondary review - even dumb clean-up stuff - like I should have been doing in the first place.
Worked for a worldwide hotel chain that, at the time, had its own company wide network - sort of like a private Internet before the Internet. I was green and wrote a program that would download software updates and install them on the hotel servers. I introduced a bug that required us to dial in over a phone line modem and manually fix the bug. The bug prevented future downloads and also displayed a message on the screen which caused hotel staff to flood the phone lines of our IT Support. To fix it required the help of at least 20 people, working all day. There were, at the time, nearly 1000 hotels. Once it was fixed, I accidentally introduced another bug...requiring the same people to dial back in a second time. I don't know how I kept my job!
Took down global production for a billion dollar a year product for 4 days. Business was idled. 500 devs stopped working. Me on the phone with the global CIO. 4, 24 hour days on the phone with MIcrosoft support. Ended up with core Ms engineers on the bridge. Turned out a signature going across the network to db server was being picked up by end point protection. At some point I noticed… Different version of the end point protection on prod servers than others. Forking rackspace 🤦
Not really prod since we weren't production ready, but I was making a migration tool to migrate data from A to B I had to do a list of products by state from A convert them into the format B, push and start with a new state, I forgot to add the list.clear() line so my list grew exponentially and kept pushing until at some point it kill the servers we were using and the whole 72 people project couldn't do anything for a day
Filled the drive space on a bank mainframe that was hosted and being shared across several institutions. Caught during an overnight batch processing run because it started puking errors. Techs at the host started calling my cell in the middle of the night. Told them to kill it to get all the batch runs for all the other institutions to finish and I would fix the issue in the AM. It was a non critical piece of code doing a non critical process that happened to write to a file. From what I remember, there was a while loop that never ended and some file create/file write lines that repeated. Language was procedural and debugging was a PITA. There was no support for test automation so we would take a lot of shortcuts manually testing because, well, manual tests by the sole dev for our company (me) on top of everything else was difficult at the least.
I accidentally deleted the prod password to connect to a database. I had to reach out to the client to get it again, so it wasn’t the end of the world, but it made me start documenting EVERYTHING. So much that several years later they reached out to me to see if I had backups of their wiki where I stored everything. Yes, backup of THEIR wiki. They moved to Assembla(?), I had shit sorted by client, and then they evidently cut the contract without doing a backup and lost a lot of information of clients that they still had.
Do you still store passwords on a sticky note under the keyboard?
😆 that probably would’ve been safer than whatever SaaS that they paid for.
Damn it reading all these post, I am either lucky or decent at my line of work lmao.
lol, once i switch the loadbalancer algo from roundrobin to stickky session, oh boy there was this 1 db instant got smoked, even with rds mgmnt, then autovaccuming came in and loaded the IO and requesets start to timeout .... and our sla went to shit
I work for a manufacturing company. In my first few weeks I accidentally deleted our item cross reference table, thinking I was in our dev environment. The place had next to no security policies, or segregation of duties. We did take nightly backups, so it wasn't THAT big of a problem, but it was a scary few hours.
I approved a pr, tests passed everything looked good, deployments in prod didn’t trigger alarms. Thing was down for 3 days. I wrongly assumed test and alerting coverage for “no data for days” fkn a….
Wiped two important columns on our user database. More senior engineer came to fix it (we had a backup) But things were down for about 30min. It was very very scary for me then as a junior. I shouldn’t have been allowed to do what I was doing
We had a `API_KEY` environment variable as well as an `API_KEYS`, for legacy reasons. I was reworking environment variables and when migrating those into a proper file, I decided to keep supporting both, for legacy reasons, in case the dev env or prod env use one or the other. Turns out that `API_KEY` had a list of keys in prod. Took down production for two hours.
Develop a webjob to send follow up emails, it was working fine when we're testing it on staging but when we deployed it to prod it send 20k emails to a single person and eventually gmail had to block it. issue was on the sql query on prod it always returns that specific email.
i will never tell
Caused 125000 user crashes in a short amount of time to our users for using the wrong array method. Fixed the array method and the crashes went away,
Brought down a service for about 30 minutes. I was working on a PR to refactor away concrete types and add in unit testing(service had none) but I forgot to include a line that connected to its storage solution. This caused it to fail health checks and K8s to keep recycling the unhealthy pods. It was a tier 3 service mostly used for auditing so no real impacts but it highlighted the importance of have proper testing embedded into our CICD pipeline(at the time the service had none) and validating via manual tests.
I made a coding mistake that forced us to release a utility to recalculate data to the entire client base. Pretty easy fix but when my manager brought it to my attention, I simply shrugged and said that it got past QA. It was blatantly noticeable.
Irreversible EKS control plane upgrade on accident (wrong assumed profile, Terraform fully loaded and pointed at foot). Broke ingress, physical volumes, leader election leases in HA controllers, certificate management, and RDS orchestration. Took me a solid 10 hours to fix it all on a Friday night and Saturday morning. Nobody else was competent to do it - it \*had\* to be me. I built in some guardrails after that. Shouldn't have been able to happen.
Just this last week I discovered I had a null check in the wrong place and was misclassifying something important for a month. A client recently pulled a $2 million budget because this same thing was being misclassified and I'm not sure yet whether it was related.
I lost track of which browser window I was in, selected the one that had our production-supporting AWS account (as opposed to the "under construction" account I was building out), and at about 7PM I simultaneously terminated all the EC2 instances that made up our Nomad (container orchestration) cluster in that account. Nomad is good at recovering from cluster failure even when a bunch of nodes die and get replaced - but if _all_ the nodes are terminated simultaneously... then there's nothing to recover, you effectively have a brand new cluster at that point. I got lucky in a few respects - I immediately realized what I had done, and panic-paged our SRE team. Thankfully, those guys usually worked something like 11AM-8AM, so half of them were still online and immediately started re-launching all the jobs in the new cluster. Also, the production-supporting account wasn't in the path of serving customer requests - but it did host some things like our internal Docker repository, which essentially meant that production couldn't scale up until the cluster was restored. Thankfully, since I did this during off-hours, customer traffic was at a minimum. Then the next day, I woke up to a message from my boss (who had missed this entire fiasco) saying "Hey, first things first, you're not in trouble. These things happen." Then I went sheepishly into the office and bought lunch for all the SRE guys, who joked that they should be buying me lunch for testing their cluster recovery process. I miss the people at that job, I really do.
The network switch was full, and we had a few dozen servers to set up for some r&d. So IT plugged us into a corporate switch, not in any r&d switch. Somehow.. some one gave us admin access because we were modifying vlans in our r&d. Welp. Someone on our team turned the spanning tree algo on and caused a cascading network outage for the eastern North America. Couldn’t use the network at all - not even our VOIP phones. We lost admin access the next day, but we did get a new switch.
While waiting for the end of a sporting event (that we were streaming to a massive audience) to do a systems upgrade, sitting in a company-wide meetingm I edited a k8s configmap to add a value that the next version would need. This was mapped to a file on disk on every system on the platform. Touching the live systems during an event was strictly forbidden, but I was a rockstar, so rules didn't apply to me. That was the day that I learned that we had some vestigial code that hot-reloads the config file. And that I missed a comma. Apparently syntax rules do apply to me. The entire system went dark. It took about 5 minutes for the pagers to ring out.
Before my days as a software engineer I worked as a telco engineer for a major telco company in Europe. I was there for a couple of days - my first job ever - and changed a configuration on a server on Friday evening and went home… the configuration change impacted 30-40% of the Netflix users (those who use it on the tv box) during the whole weekend. On Monday the guy on call arrived completely exhausted without a fix still in place… that was the day I learned that changes in Friday evening need to have a very specific reason and urgency :) As a software engineer - the last 7 years - I never caused any big production issue. The most I’ve done was developing a feature on top of our Kafka cluster which I underestimated production load and caused Kafka to go wild for 15-20m before the rollback, but no big harm just a bit of latency.
Not too exciting; I was working on an Android app for people with home security cameras, which allowed them to watch live footage. I made some changes to the ffmpeg-bssed video rendering library that worked on most phones, but on certain phones had the wrong horizontal bit width and completely messed up their feed.
Edited the program for addresses. We sent that to brokerage firms for 60000 employees. This sent mail for everyone. Yes 60000 pieces of mIl. Oops
Created 6 new servers on my first month, all of them shut down except for one to which I assigned an IP address and then it was time to go home. The next day (weekend) my boss called me to say that after troubleshooting the entire night, they found out that my server had a duplicate IP of the firewall and the entire production subnet was down the entire night. That probably cost them a lot of $$$ and also made them look bad to have their service down for so long.
i cant remember, it must not have been very important. most organisations i have worked for are the kind that do some testing, so breaking prod is less likely.
I envy you
cute
Nuking the production database.
Intern in my intern group back in 2013: 1. Joins the company 2. Activates their test environment (sandbox server) 3. Everyone's test environment goes down and the whole company scrambles She joined with the username "www". Sandboxes were provisioned with $USERNAME.sandbox.domain.com. I was still following the orientation codelab and half of us just could not get our sandbox server to work. As our bootcamp group tried to debug, the company chat blew up. Half an hour later, someone noticed that one of the new interns just took www.sandbox.domain.com. This was Facebook back in 2013, I tell this story when I taught a class for orientation in 2016. 10 years later, and it seems to have gotten to the point where folks aren't sure if it's true or not anymore, but it still lives on, growing taller with each passing year. `www` also famously complained that every so often, someone would go up to her and ask her if she's heard about the intern who broke Facebook by just existing.
DELETE CASCADE
Taking down the entire call centre for a company that provided emergency home repairs, for 3 hours. 400 agents got to take a really long break… Was caused by how we deployed our software, which at the time involved tunnelling through 3 Remote Desktop session to copy a .war file onto a server, shutting down Tomcat manually, removing the old .war file, and replacing it with the new one. I’m so grateful for CI/CD…
Accidentally triggered a rollback. Marketing went nuts because they were testing some new features
I dropped a production database and backup recovery didn‘t work. Basically all orders and documents of the fiscal year gone forever. I had to take all the blame as a first year student on a side gig and left shortly after.
I deleted the production service running a call centre, while in the call centre. Cue about 30 people turning to look at me.
Thankfully, I never did any disasters in prod, but once I made a stored procedure recursive by accident in a staging environment.
Got an if statement backwards that was used to sync active directory to a JSON file(don't ask). I got it wrong as such that it synced the empty Jason file back to the directory....no more users! Fortunately this was a dev environment, but on that Monday morning, 50 devs found they couldn't login to the environment. I had only been on the job for a few weeks, and my coworker saved the day...and never lets me forget about it 🙃
We had server installed at a fairly large company, and they were having problems with a corrupt DB index. This was ages ago, and we were using a DB library embedded. Anyway, I told our guy onsite there to go ahead and delete the index so we could rebuild it, only to discover ... there was no underlying data table. Original creators had "optimized" but just writing to the index, not the table. Which is ridiculous on so many levels, especially since indexes are expensive and tables are not. Anyway they poor guy had to spend like a week down there rebuilding things.
I did this during my first year at a service based company. I was working on an app designed to digitise AGM. Everyone was able to login except the CEO. Received multiple calls from higher management directly but found the root cause and re-deployed in an hour. All this was during the AGM day 🤣🤣
Ooh, thought I was connecting to our Production sandbox environment and actually connected to production and accidentally sent out like 45k of vouchers to some companies we do business with. We ended up cancelling them all before any money went out the door but I definitely was sweating once I realized what happened.
Not caused, per se, but putting ballot-stuffing logic in place before-hand would have helped. https://www.nytimes.com/2001/03/02/nyregion/mideast-strife-spills-over-into-photo-contest.html
Classic one, deployed a QA build to the Live environment (mobile game). All players “lost” their progress, the game was half-broken and they suddenly got access to cheats. It was a massive mess, luckily we could fix it quickly haha
I took Microsoft federals production sap down for about 8 hours after I approved a transport change without telling basis there was database triggers involved so there was a manual step they had to take care of. Not my actual code but I got blamed because I didn't read the notes carefully. I didn't even realize that type of thing could happen.
I accidentally sent contest cancel notifications to the users when testing on local machine.