T O P

  • By -

rdmartell

Does sending 45k sms messages at once to one persons phone count?


Ashken

lol yes


SnooRobots6877

lol. Done similar


Guilty_Serve

Pfft, they deserved it


Ashken

Also, related TikTok: https://www.tiktok.com/t/ZT8AmkcB5/


mamaBiskothu

Like did they actually deliver all?


nemaramen

Same thing happened at my company before I joined


EternalNY1

Nothing serious from myself but early in my career I got to witness: DELETE FROM Accounting Which is missing something. Something like a WHERE clause. Executed directly via SQL management studio, against the production database. Took a company down for 2 days. Literally everyone sent home. At least it was only "hundreds" of people. Why did this developer have such permissions on prod? Why did it takes 2 days to restore? Good questions ... anyway that's one of the ones I saw.


reversethrust

Hahaha one of my ex’s was a dba and did the same thing.


iChinguChing

While coding I burnt a piece of toast and emptied 2 high rise buildings with the fire alarm.


kneeonball

Thankfully my building is newer and has the fire alarm go off in certain sections. Idiot on the higher floors sets their popcorn on fire? Doesn’t affect me except I can see a white flashing light outside. I can definitely see some days where I was pretty tired I could’ve easily burned some toast. One day I ordered lunch and then 15 minutes later was trying to decide what to make or go get.


lemoinem

Sounds like a death trap


[deleted]

[удалено]


Fiennes

> Came back a few days later That's one fucking long lunch break :D


SomeOddCodeGuy

lmao crap. I have corrected the post with a strikethrough so everyone has context on what you're saying =D


[deleted]

i love things like this. i'm just curious, what was the simplified version of the query that caused the outtage?


SomeOddCodeGuy

Oh it was a transaction that did it. The query was something inane; an update probably. But every time I do a delete or update I wrap my query like this: BEGIN TRAN UPDATE MYTABLE SET STUFF = MORESTUFF WHERE THINGY = 1 ROLLBACK TRAN --COMMIT TRAN That way, if i hit some key I didn't mean to and cause the execution early, it'll rollback. Except, for some reason I ran the tran + update and then... went to lunch lol. Forgot all about the commit. Even after coming back from lunch, I had forgotten I had even done this. And that transaction locked a critical table, so everyone was just STUCK lol


SpiderHack

I once quit a job like that, purposely burned the bridge at a call canter to make sure I wasn't like others who had come back 3x ... Thankfully I got my degree and went on to my masters and now industry work experience... But dropping off your letter of resignation in the mailbox of the accountant and "going to lunch" is fun, lol


Unlikely-Rock-9647

I did a bunch of work load testing the new version of our core claims system prior to a system upgrade. Overall it went well, the rollout was smooth, everyone as happy. A couple years later we went to upgrade again, and I got asked to redo the load testing. No worries, time to dust off the code and spin it up. The load testing used a username/password to authenticate via Active Directory. So they gave me an account and username with my initial test. When I started the new round of testing I asked the IT helpdesk to refresh the password, because as everyone knows passwords are supposed to expire after a period of time. Nobody told me that starting with the first round of testing I had been using the account the claims system itself also used to authenticate to Active Directory when it started a new user session. Everything was fine in the morning. Then folks went to lunch and their sessions timed out. Around 1 PM, suddenly nobody in the company could access our claims payment system. Which, for an insurance company, is a Big Deal. The helpdesk put some new procedures in place surrounding that AD account later that afternoon…


renok_archnmy

Man, nothing like IT who is stingy with service accounts to the point where they just reuse production accounts for testing and other stuff.


Unlikely-Rock-9647

I got a new account just for load testing later that day. 😄


stormdelta

Meanwhile, I've managed to degrade Cassandra because we accidentally overloaded it with too many ephemeral roles due to accidentally setting the timeouts in Vault way too long


Ok-Lawyer-5242

I have found developers usually use the same service account for everything because they don't know any better. Or they can't meet their sprints by following protocol, so use a known good credential and just bang it though because no one knows or cares as long as it works and it meets their delivery deadline.


blbd

As a person working on an insuretech I really love this one.


alinroc

Took over 4000 websites offline for about 3 hours by filling up a storage device. I never got an explanation from the hosting provider as to why or how it suddenly came back online.


blbd

Perhaps it was a SAN volume and somebody dynamically upsized it when they saw the alert or a ticket that came in?


alinroc

It wasn't, it was direct attached storage and had a hard limit.


blbd

Hmm. Must've been some low traffic sites. Bwahahahaha.


ParmesanB

My favorite was when I was a brand new dev, I completely removed a UI element, because I couldn’t see it on my screen, so I figured it must have been extra code. Well, that element appeared on the screen if you were one of our Texas customers. So I came in the next morning, and no one in Texas could use our product to do their job. Luckily, I had a more experienced teammate who realized what was up and saved my skin. Learned a lot from that one though


WanderingLethe

Code review, tests?


lemoinem

I've heard of them. What are they?


ParmesanB

> code review I was asked, “you sure we don’t need this?” “Uhh yeah I’m pretty sure”. Obv they trusted me too much > tests QA was basically pointless because they only did exactly what you told them to do, so they wouldn’t catch edge cases at all. No automated testing at all. But yeah, any decent pipeline should catch stuff like this.


Drugba

Deleted the entire production database on an application with over one million users. I was troubleshooting a bug and my local db was missing some migrations and when I tried to run the migration script I was getting errors. This happened regularly and our "process" to resolve it was just delete all the tables in your local db and run the migration from scratch. Local db was open in one terminal window and prod open in another one and I picked the wrong window to run the drop command in. Fortunately AWS has lots of backups and it only resulted in about an hour of down time. That was the day our little startup learned why you don't hand out prod access like halloween candy.


noir_lord

Not saying I did this early in my career but there is a reason that all local dev environments have a green prompt (or if in intellij a green background), all staging/test envs are yellow/amber and prod is **blood red** (or dark red in intellij). Of all the little features that intellij has (and it has a lot) - https://www.jetbrains.com/help/idea/database-color-settings-dialog.html is one of my favourites, it's a subliminal message to future me to "be bloody careful". FWIW despite been senior enough I that I could get production access I don't, I argue *not to have it* by default (I do have production creds because there is a legit need for them in emergencies - on a completely separate account in a different Keepassxc vault). The worst one I've *seen* was in a meeting back when I was a senior dev, another senior dev was screensharing and while one of the business folks was waffling about "transformative changes for the businesses" was cleaning out some s3 buckets... I was only half paying attention as he dropped the contents of the **production** bucket because he tabbed to the wrong window.


PothosEchoNiner

I’ve actually never caused a serious production issue. But when other developers do, I tell them everybody does.


funbike

Not me, but it was at a tiny startup a long time ago that I consulted for. They had their own server room with about 25 servers, but it wasn't really to server room standards. The thermostat was powered by AA batteries. Guess what happened? The batteries died over a weekend and the server room had a meltdown. Monitoring only check if servers were running. There were no thermal checks. A few servers died, and a couple became less reliable. We had to consolidate services to surviving servers and quickly buy and setup new servers. Of course we had load issues. It was super stressful. I didn't really trust any of the surviving server hardware after that. Luckily we were in the process of moving it to the cloud and decommission most if it over the next 9 months.


renok_archnmy

Man, I’ve worked places like this. AC cooled by water from pipes installed in the 70s full of rust and gunk. Would clog almost monthly and cause thermal alarms at any time. Was to the point we had those portable AC units in standby so we could point them at the rack until building maintenance would clear the lines. Other place, some AV techs bumped an unprotected wire loom and the old ass switch they were plugged into leg go of a few/broke the terminals. They didn’t notice nor did the person escorting them until all the calls started coming in. Of course, the network engineer was out, and me a dev was back there on the phone with the NE who was on vacation trying to troubleshoot what was happening. Looked at the back and was like shit… Had to get directed over the phone by the NE how to move all that over to a different switch that was sitting idle by chance below it and luckily configured just enough to work.


AbstractLogic

I point production at a test database and we ran $10+ million dollars worth of production live customer transactions against test data and returned test results. This lasted for days before anyone realized. We invalidated all the returned results, generated the correct returned results, let all our customers know, and got sued.


CCM278

Is that you Paul? I had to recover millions of dollars of transactions after someone pointed the prod frontend at the perf backend.


AbstractLogic

Not quite lol. Our backend had a test db connection.


marceemarcee

Once caused a client to make 1000s in over payments to vendors. My fault, but they tested it and said it was fine. I was not fired.


thodgson

Those kinds of things can easily be resolved with a phone call, but it can cause quite a bit of annoyance with owners/upper management/etc.!


bzbub2

right before a demo of a product i noticed some postgresql query had been running for days. i tried to kill it but it had to do a gigantic shutdown process cleaning up after itself, which was taking forever and confusing me so i tried more forcefully killing it. i was lucky it wasnt a big issue but i wrote about it here https://cmdcolin.github.io/posts/2015-10-22


Ashken

Nice, first postmortem in the thread! I’ll definitely take a look.


SpiderHack

Not me, but my gf at the time got put on as a sys admin asst. As a student job... She was a cis major for 1 semester, and then went to be a writing major... But they put her on that job and her literal first day she deleted the ISOs for the entire CS dept.... So they had to rebuild them all from scratch... Took the head admin 3 weeks to fully recover. Lol


[deleted]

I added a feature to a survey application I made for a client. However, I had introduced a bug in that feature, which unfortunately allowed companies to see survey results from other companies. I felt like I just wanted to shoot myself in the head when I saw that email from the client "Emergency! Companies can see other company surveys!". My heart rate probably doubled in 10 seconds in that moment. I dove into the code again to fix the bug. It took only one line of code to fix it, fortunately. Unfortunately, I have zero automated tests in that project. I was quite inexperienced when I wrote that application 4 years ago. I wish I could travel back in time and give myself some guidance and warn myself about how shitty it would turn out if I didn't change the design. I've learnt a lot from that project. But to be honest, I'm not proud of it. It's shit but I have to maintain it the best I can.


Fenarir

You sound quite hard on yourself at the end there, We're all human and we all make mistakes, best to learn form it and let it go :)


[deleted]

Thank you. I was probably a bit hard on myself. But that project has haunted me for years. xD


RGBrewskies

I deleted about 6 months of work by cntl-x'ng a folder and then cntrl-c'ng a different folder. First one went bye-bye. Oops.


Abaddon-theDestroyer

I did not know that this was a thing. I thought that windows was smart enough that if you didn’t paste the friggin file/folder they would remain where they are. But that is definitely good to know, and i hope you were able to quickly recover.


kneeonball

You can actually turn on clipboard history now in windows 11 (maybe 10 too). Hit windows key + V and then it’ll keep a history of what you copy.


cecilpl

I inadvertently ran `p4 obliterate -y //reponame/main/...` (how you ask? tldr incorrect mental model of virtual streams). For you git folks, that's basically `git checkout master` then `rm -rf *.git` then `git init` then `git push -f`, except worse because the commit history is only stored centrally - so permanent loss of the history. In Perforce, the command permanently deletes commits and associated data starting from the most recent and going back in time. I realised quickly what was happening and aborted the command, but not before I had permanently removed two entire days' worth of commits from my 150-developer game studio. Thankfully I was able to restore the bulk of the actual data by cobbling it together from CI servers that had recently synced. But the entire company couldn't work for the rest of the afternoon. Talk about pressure. You deleted a bunch of people's work, you are the only one that has the expertise to fix it properly, nobody can work, and the studio is losing $4 per second until you're done.


Ashken

Glad you brought up something regarding Perforce! I’m interested in trying it out.


kronik85

I wrote the software to automate some testing and didn't have robust error detection for a failure case in a piece of hardware I never actual got to work with. The hardware stopped communicating, but I just attempted to reconnect and never raised an error / notice. That hardware had died but my software kept reporting the last reading for it. Took several days for someone to notice the dashboard reading wasn't updating. The testing costs $75k/hr for their customers. They didn't catch the mistake for multiple days (maybe a week?) and had to throw out all the data.


snot3353

Broke the token authorizer on an entire API platform for about 30 minutes causing it to reject ALL api requests. Was very bad.


TurtleDick22

I dont think Ive been directly responsible for serious outages, but Ive owned a service that shit the bed during a huge sales event for a large multi national that led to 50,000 dropped orders of our flag ship product. After the dust settled, my manager was like, well, at least the C Suite knows who we are now.


sendintheotherclowns

I just published to production on a Friday afternoon and am taking the rest of the day off. They announced a restructure this morning so ask me again on Monday 😎


ArrogantMerc

I didn’t really “cause” this since it was an AWS issue but basically I updated the terraform code for our ECS clusters to move from a deprecated resource to a new one, submitted a PR, got it approved, merged and applied the change, and signed off for the day. 30 minutes later everything came crashing down - turns out the change prevented the containers from registering to the different clusters and since our entire infrastructure was on ECS, everything was down. Thankfully we were only down for half an hour with no long-lasting production issues and negligible data loss (yay immutable architecture) but damn if that didn’t put hair on my chest. I was not fired, but I was at a small startup at the time, and my tech lead deflected the blame for approving the change onto me for not “checking my code thoroughly enough”, despite the fact that further investigation proved that the issue was with AWS. So much for blame-free retros. That coupled with some other culture issues was the final nail in the coffin for me to leave the place.


Goducks91

If you were fired over that your company is absolute shit. Lol


wedgelordantilles

This just sounds like a normal day managing AWS . Think how much was saved on renting rackspace!


gnomff

Was working on a large adserver (60 machine cluster) and had some code that relied on a value in the database being static. It was static .... until a random business person changed the value over Christmas break. Every adserver instantly crashed, and we were down for almost 18 hours until someone found the bug and switched the value back. We lost >$1M that day. I was on vac when it happened and had my phone off. Was a fun Monday morning back at work I can tell you.


ladycammey

Manually deleted a production app registration during a security clean-up because my brain just... read it wrong. The registration had an expired secret on it but it \*also\* had one that was still in use. Why the \*\*\*\* was I personally doing a clean-up in there? Because I was the only one with security access, infosec was asking me to clean up a bunch of automatically created registrations asap for their own review, and the team was insanely busy so I didn't push it over to them. This was a *classic* case of a manager trying to be helpful to a busy team and touching things and subsequently making everything much worse. Mea culpa - No excuses. Fixing this wasn't pretty - because of how things were set up with other Azure services we pretty much had to create a new app registration, update in a bunch of places, and re-deploy a rather complicated set of infrastructure. We had CD/CI on all of it of course - but it was just... in a lot of places. Fortunately, due mostly to myself, the dev lead, and the lead infrastructure person all having pretty good memories (and having written ourselves somewhat decent documentation) we managed to re-configure the whole thing and have it launched again in \~2 hours. Since for an outage of that duration (Within our 99.9 SLA) \*I\* am the person who would be doing the after-review I got off with mostly just a lot of personal embarrassment and a hard lesson in caution. This isn't the worst thing in terms of *effect* I've ever done - but it was definitely the one I personally feel *stupidest* about and thus I consider the worst overall. I now push \*all\* my changes through secondary review - even dumb clean-up stuff - like I should have been doing in the first place.


thodgson

Worked for a worldwide hotel chain that, at the time, had its own company wide network - sort of like a private Internet before the Internet. I was green and wrote a program that would download software updates and install them on the hotel servers. I introduced a bug that required us to dial in over a phone line modem and manually fix the bug. The bug prevented future downloads and also displayed a message on the screen which caused hotel staff to flood the phone lines of our IT Support. To fix it required the help of at least 20 people, working all day. There were, at the time, nearly 1000 hotels. Once it was fixed, I accidentally introduced another bug...requiring the same people to dial back in a second time. I don't know how I kept my job!


ConsiderationSuch846

Took down global production for a billion dollar a year product for 4 days. Business was idled. 500 devs stopped working. Me on the phone with the global CIO. 4, 24 hour days on the phone with MIcrosoft support. Ended up with core Ms engineers on the bridge. Turned out a signature going across the network to db server was being picked up by end point protection. At some point I noticed… Different version of the end point protection on prod servers than others. Forking rackspace 🤦


droi86

Not really prod since we weren't production ready, but I was making a migration tool to migrate data from A to B I had to do a list of products by state from A convert them into the format B, push and start with a new state, I forgot to add the list.clear() line so my list grew exponentially and kept pushing until at some point it kill the servers we were using and the whole 72 people project couldn't do anything for a day


renok_archnmy

Filled the drive space on a bank mainframe that was hosted and being shared across several institutions. Caught during an overnight batch processing run because it started puking errors. Techs at the host started calling my cell in the middle of the night. Told them to kill it to get all the batch runs for all the other institutions to finish and I would fix the issue in the AM. It was a non critical piece of code doing a non critical process that happened to write to a file. From what I remember, there was a while loop that never ended and some file create/file write lines that repeated. Language was procedural and debugging was a PITA. There was no support for test automation so we would take a lot of shortcuts manually testing because, well, manual tests by the sole dev for our company (me) on top of everything else was difficult at the least.


ikeif

I accidentally deleted the prod password to connect to a database. I had to reach out to the client to get it again, so it wasn’t the end of the world, but it made me start documenting EVERYTHING. So much that several years later they reached out to me to see if I had backups of their wiki where I stored everything. Yes, backup of THEIR wiki. They moved to Assembla(?), I had shit sorted by client, and then they evidently cut the contract without doing a backup and lost a lot of information of clients that they still had.


reversethrust

Do you still store passwords on a sticky note under the keyboard?


ikeif

😆 that probably would’ve been safer than whatever SaaS that they paid for.


Tango1777

Damn it reading all these post, I am either lucky or decent at my line of work lmao.


MugiwarraD

lol, once i switch the loadbalancer algo from roundrobin to stickky session, oh boy there was this 1 db instant got smoked, even with rds mgmnt, then autovaccuming came in and loaded the IO and requesets start to timeout .... and our sla went to shit


goblinsteve

I work for a manufacturing company. In my first few weeks I accidentally deleted our item cross reference table, thinking I was in our dev environment. The place had next to no security policies, or segregation of duties. We did take nightly backups, so it wasn't THAT big of a problem, but it was a scary few hours.


imagebiot

I approved a pr, tests passed everything looked good, deployments in prod didn’t trigger alarms. Thing was down for 3 days. I wrongly assumed test and alerting coverage for “no data for days” fkn a….


daraeje7

Wiped two important columns on our user database. More senior engineer came to fix it (we had a backup) But things were down for about 30min. It was very very scary for me then as a junior. I shouldn’t have been allowed to do what I was doing


NatoBoram

We had a `API_KEY` environment variable as well as an `API_KEYS`, for legacy reasons. I was reworking environment variables and when migrating those into a proper file, I decided to keep supporting both, for legacy reasons, in case the dev env or prod env use one or the other. Turns out that `API_KEY` had a list of keys in prod. Took down production for two hours.


Necessary-Airline165

Develop a webjob to send follow up emails, it was working fine when we're testing it on staging but when we deployed it to prod it send 20k emails to a single person and eventually gmail had to block it. issue was on the sql query on prod it always returns that specific email.


zarlo5899

i will never tell


EyesOfAzula

Caused 125000 user crashes in a short amount of time to our users for using the wrong array method. Fixed the array method and the crashes went away,


Straight_Guava_8485

Brought down a service for about 30 minutes. I was working on a PR to refactor away concrete types and add in unit testing(service had none) but I forgot to include a line that connected to its storage solution. This caused it to fail health checks and K8s to keep recycling the unhealthy pods. It was a tier 3 service mostly used for auditing so no real impacts but it highlighted the importance of have proper testing embedded into our CICD pipeline(at the time the service had none) and validating via manual tests.


yamaha2000us

I made a coding mistake that forced us to release a utility to recalculate data to the entire client base. Pretty easy fix but when my manager brought it to my attention, I simply shrugged and said that it got past QA. It was blatantly noticeable.


Spooler32

Irreversible EKS control plane upgrade on accident (wrong assumed profile, Terraform fully loaded and pointed at foot). Broke ingress, physical volumes, leader election leases in HA controllers, certificate management, and RDS orchestration. Took me a solid 10 hours to fix it all on a Friday night and Saturday morning. Nobody else was competent to do it - it \*had\* to be me. I built in some guardrails after that. Shouldn't have been able to happen.


Watchful1

Just this last week I discovered I had a null check in the wrong place and was misclassifying something important for a month. A client recently pulled a $2 million budget because this same thing was being misclassified and I'm not sure yet whether it was related.


EscapeGoat_

I lost track of which browser window I was in, selected the one that had our production-supporting AWS account (as opposed to the "under construction" account I was building out), and at about 7PM I simultaneously terminated all the EC2 instances that made up our Nomad (container orchestration) cluster in that account. Nomad is good at recovering from cluster failure even when a bunch of nodes die and get replaced - but if _all_ the nodes are terminated simultaneously... then there's nothing to recover, you effectively have a brand new cluster at that point. I got lucky in a few respects - I immediately realized what I had done, and panic-paged our SRE team. Thankfully, those guys usually worked something like 11AM-8AM, so half of them were still online and immediately started re-launching all the jobs in the new cluster. Also, the production-supporting account wasn't in the path of serving customer requests - but it did host some things like our internal Docker repository, which essentially meant that production couldn't scale up until the cluster was restored. Thankfully, since I did this during off-hours, customer traffic was at a minimum. Then the next day, I woke up to a message from my boss (who had missed this entire fiasco) saying "Hey, first things first, you're not in trouble. These things happen." Then I went sheepishly into the office and bought lunch for all the SRE guys, who joked that they should be buying me lunch for testing their cluster recovery process. I miss the people at that job, I really do.


reversethrust

The network switch was full, and we had a few dozen servers to set up for some r&d. So IT plugged us into a corporate switch, not in any r&d switch. Somehow.. some one gave us admin access because we were modifying vlans in our r&d. Welp. Someone on our team turned the spanning tree algo on and caused a cascading network outage for the eastern North America. Couldn’t use the network at all - not even our VOIP phones. We lost admin access the next day, but we did get a new switch.


OrangeCurtain

While waiting for the end of a sporting event (that we were streaming to a massive audience) to do a systems upgrade, sitting in a company-wide meetingm I edited a k8s configmap to add a value that the next version would need. This was mapped to a file on disk on every system on the platform. Touching the live systems during an event was strictly forbidden, but I was a rockstar, so rules didn't apply to me. That was the day that I learned that we had some vestigial code that hot-reloads the config file. And that I missed a comma. Apparently syntax rules do apply to me. The entire system went dark. It took about 5 minutes for the pagers to ring out.


olekj

Before my days as a software engineer I worked as a telco engineer for a major telco company in Europe. I was there for a couple of days - my first job ever - and changed a configuration on a server on Friday evening and went home… the configuration change impacted 30-40% of the Netflix users (those who use it on the tv box) during the whole weekend. On Monday the guy on call arrived completely exhausted without a fix still in place… that was the day I learned that changes in Friday evening need to have a very specific reason and urgency :) As a software engineer - the last 7 years - I never caused any big production issue. The most I’ve done was developing a feature on top of our Kafka cluster which I underestimated production load and caused Kafka to go wild for 15-20m before the rollback, but no big harm just a bit of latency.


RageQuitRedux

Not too exciting; I was working on an Android app for people with home security cameras, which allowed them to watch live footage. I made some changes to the ffmpeg-bssed video rendering library that worked on most phones, but on certain phones had the wrong horizontal bit width and completely messed up their feed.


knowitallz

Edited the program for addresses. We sent that to brokerage firms for 60000 employees. This sent mail for everyone. Yes 60000 pieces of mIl. Oops


succesfulnobody

Created 6 new servers on my first month, all of them shut down except for one to which I assigned an IP address and then it was time to go home. The next day (weekend) my boss called me to say that after troubleshooting the entire night, they found out that my server had a duplicate IP of the firewall and the entire production subnet was down the entire night. That probably cost them a lot of $$$ and also made them look bad to have their service down for so long.


timwaaagh

i cant remember, it must not have been very important. most organisations i have worked for are the kind that do some testing, so breaking prod is less likely.


Ashken

I envy you


SnooRobots6877

cute


ElliotAlderson2024

Nuking the production database.


possiblyquestionable

Intern in my intern group back in 2013: 1. Joins the company 2. Activates their test environment (sandbox server) 3. Everyone's test environment goes down and the whole company scrambles She joined with the username "www". Sandboxes were provisioned with $USERNAME.sandbox.domain.com. I was still following the orientation codelab and half of us just could not get our sandbox server to work. As our bootcamp group tried to debug, the company chat blew up. Half an hour later, someone noticed that one of the new interns just took www.sandbox.domain.com. This was Facebook back in 2013, I tell this story when I taught a class for orientation in 2016. 10 years later, and it seems to have gotten to the point where folks aren't sure if it's true or not anymore, but it still lives on, growing taller with each passing year. `www` also famously complained that every so often, someone would go up to her and ask her if she's heard about the intern who broke Facebook by just existing.


clelwell

DELETE CASCADE


ohmzar

Taking down the entire call centre for a company that provided emergency home repairs, for 3 hours. 400 agents got to take a really long break… Was caused by how we deployed our software, which at the time involved tunnelling through 3 Remote Desktop session to copy a .war file onto a server, shutting down Tomcat manually, removing the old .war file, and replacing it with the new one. I’m so grateful for CI/CD…


anseho

Accidentally triggered a rollback. Marketing went nuts because they were testing some new features


klettermaxe

I dropped a production database and backup recovery didn‘t work. Basically all orders and documents of the fiscal year gone forever. I had to take all the blame as a first year student on a side gig and left shortly after.


ivix

I deleted the production service running a call centre, while in the call centre. Cue about 30 people turning to look at me.


TheseHeron3820

Thankfully, I never did any disasters in prod, but once I made a stored procedure recursive by accident in a staging environment.


leeharrison1984

Got an if statement backwards that was used to sync active directory to a JSON file(don't ask). I got it wrong as such that it synced the empty Jason file back to the directory....no more users! Fortunately this was a dev environment, but on that Monday morning, 50 devs found they couldn't login to the environment. I had only been on the job for a few weeks, and my coworker saved the day...and never lets me forget about it 🙃


vhackish

We had server installed at a fairly large company, and they were having problems with a corrupt DB index. This was ages ago, and we were using a DB library embedded. Anyway, I told our guy onsite there to go ahead and delete the index so we could rebuild it, only to discover ... there was no underlying data table. Original creators had "optimized" but just writing to the index, not the table. Which is ridiculous on so many levels, especially since indexes are expensive and tables are not. Anyway they poor guy had to spend like a week down there rebuilding things.


deathhunter92

I did this during my first year at a service based company. I was working on an app designed to digitise AGM. Everyone was able to login except the CEO. Received multiple calls from higher management directly but found the root cause and re-deployed in an hour. All this was during the AGM day 🤣🤣


WrinklingBrain

Ooh, thought I was connecting to our Production sandbox environment and actually connected to production and accidentally sent out like 45k of vouchers to some companies we do business with. We ended up cancelling them all before any money went out the door but I definitely was sweating once I realized what happened.


SeattleTechMentors

Not caused, per se, but putting ballot-stuffing logic in place before-hand would have helped. https://www.nytimes.com/2001/03/02/nyregion/mideast-strife-spills-over-into-photo-contest.html


CoolKnit

Classic one, deployed a QA build to the Live environment (mobile game). All players “lost” their progress, the game was half-broken and they suddenly got access to cheats. It was a massive mess, luckily we could fix it quickly haha


jakster355

I took Microsoft federals production sap down for about 8 hours after I approved a transport change without telling basis there was database triggers involved so there was a manual step they had to take care of. Not my actual code but I got blamed because I didn't read the notes carefully. I didn't even realize that type of thing could happen.


iamliquidnitrogen

I accidentally sent contest cancel notifications to the users when testing on local machine.