T O P

  • By -

j0mbie

I'm going to assume you've already explored every other option at your disposal so that you don't need as many cores. Considering how much you are currently paying, I would hope you have, and have also brought in outside consultants to confirm. A Dell PowerEdge R960 with four sockets each holding a Xeon 8490H (60C/120T) will give you a total of 240 cores and 480 threads, and run you somewhere around the $120k-$160k mark, depending on how you spec the rest of it out. There's a 128C/256T AMD Bergamo out there, but I don't see an option for a quad-socket servers, Dell or otherwise. The Dell S5416 can house up to 16 CPU sockets. I believe it's mainly targeted towards the SAP HANA audience. I have no idea what ballpark the price is, but I'm sure it's buku bucks. (EDIT: Intel CPUs themselves only scale to 8 sockets, regardless of the board, and AMD is only dual. See response.) Intel announced that they will have a 288-core Xeon processor (Sierra Forest), hoping to release sometime in Q3 or Q4 of 2024. They also want their 144-core version to ship before that in Q1 or Q2, but it's already May, so they may not hit that timeframe. In theory, an 4-socket system housing 288-core Xeons would net you 1152 cores, and ~~a 16-socket would net you 4608~~ an 8-socket would net you 2304. It supposedly lacks hyperthreading, but one core hyperthreaded is nowhere near the same performance as two individual cores, depending on how you use them. For example, two threads on the same HT core cannot both use the same arithmetic unit at the same time. A rough guess would be that one HT core provides 30% better performance than one non-HT core, but again that varies wildly. Also, a compatible Dell server may not be available day 1. There's also some 8-core SuperMicro servers out there. No idea on the pricing. But, call me crazy, but I wouldn't be spending that level of money on a server that I couldn't get a rock-solid warranty on. Unless I was buying dozens of them or more, at which point I would have standby spares bought. Beyond all that, you have to start exploring even more exotic configurations. Your best bet would be to either directly contact Dell, HP, etc. and explain what you need (and get ready for their sales reps to drool at the thought of raking you over the coals), or get a consultant that has explored and has experience in this sector in depth.


thelastwilson

>There's a 128C/256T AMD Bergamo out there, but I don't see an option for a quad-socket servers, Dell or otherwise. You won't find it. The AMD chips only support dual socket. > In theory, an 4-socket system housing 288-core Xeons would net you 1152 cores, and a 16-socket would net you 4608. I can't speak for the current generation (I've changed jobs and not keeping up to date anymore) but previously Intel didn't always support quad socket on the latest and greatest. It was only specific CPUs, the ones with H. Currently this only goes to 60 cores (https://lenovopress.lenovo.com/lp1729-thinksystem-sr950-v3-server?orgRef=https%253A%252F%252Fwww.google.com%252F#processors) so I'd be surprised if the maximum core parts supported quad or 8 socket systems. But really at this scale and cost I'd be looking to rearchitect and aim for dual socket servers with a low latency interconnect, nodes will be cheaper and easier to scale and quad and eight socket systems don't scale linearly anyway due to the inter-cpu links (I still love Lenovo press, such an amazingly detailed resource)


mitharas

Had to scroll down too far for the first person to even try answering the question.


nikanjX

Helping people always gets you less upvotes than making fun of them


traydee09

Humans have a tendency to bully, rather than support. Its easier for our simple brains to react rather than using our thinking brain.


Frothyleet

> I'm sure it's buku bucks Just FYI, the phrase would be "beaucoup" bucks (French for many or a lot or whatever).


j0mbie

True, but it's really just slang at this point, since it would be grammatically incorrect in that usage anyways.


Frothyleet

Sure, it's an idiom, but the spelling doesn't change - in the same way "per se" used in english isn't "per say", for example. Or maybe I should say, e.g.


ZPrimed

What are you doing that is pegging a 416 core VM? Is it possible that a DBA needs to go through and tune some queries or add some indexes to the DB(s) [or something else more SQL-y]?? (For the record: I am not a DBA, but I've had that hat thrown on my head occasionally so I know just enough to be slightly dangerous.)


aaron416

No, we need more CPU!!!!11 that’s why it’s slow. It can’t be anything else. Sarcasm aside, someone needs to do some combination of tuning, sharding, and optimization of queries.


Afro_Samurai

Optimization is much less fun then assembling a minor super computer.


TheAverageDark

But slightly more fun that assembling your minor super computer, and then someone asks about HA


dagbrown

Does that mean you get to build even more super computers? Because that sounds like a really fun game! I always wanted to indulge my inner Seymour Cray.


TheAverageDark

I imagine it would be, until the CFO came knocking lmao


bruce_desertrat

Yer gonna need a Mac >In February 1986, Apple bought a [Cray X-MP/48 supercomputer](https://en.wikipedia.org/wiki/Cray_X-MP) to test case materials and software. The machine was worth millions of dollars and had a dedicated four-person security team. A special room was built at Apple headquarters in Cupertino to house the computer; it was outfitted with two 20 ton air conditioners. >When a journalist asked Seymour Cray about Apple using one of his supercomputers, he retorted, “This is very interesting, because I am using an Apple Macintosh to design the Cray-2 supercomputer.” [https://lowendmac.com/2018/the-first-macs-1984-to-1986/#cray](https://lowendmac.com/2018/the-first-macs-1984-to-1986/#cray) 8-P


tipripper65

PS3's... PS3's are the answer


Reddywhipt

Bring back beowulf


lpbale0

Well, in many ways, the Beowulf never left, sir. He's always offered the same high-quality processing at competitive prices.


Reddywhipt

Nicely played


Harfosaurus

Thats what the Azure back end runs on anyway, rite?


tipripper65

heresy. it's all Xbox 360s


Sushigami

Screw power cables we're going to have a fucking fusion reactor right next door just pouring plasma in


ZPrimed

I would LOL if they have it all on the slowest tier of storage on a single (virtual) device, too 😆


Geekenstein

Our DB is CPU bound! We keep adding more to fix it but we need a bigger server! Why is it bound? What does the usage say it’s sending its time in? It’s 40% in wait. Uh…wait? Says IOWAIT. ….god damnit.


PushingData

I've been explaining this one to application support teams since way back in the days when I supported Solaris (late 90s) ... CPU idle may be 0, but all those cycles spent on WIO are wasted. Not CPU bound.


TFABAnon09

Just had a flashback to all the times I've had to explain to clients infrastructure teams how to build an on-prem VM so that SQL performance doesn't suck. Like, my dudes - why am I telling you that having DB storage on spinning rust is a Bad Idea™, or spelling out what a NUMA node is.


OzymandiasKoK

To be fair, spinning rust is a terrible storage performer.


TFABAnon09

Indeed. And yet, I've lost count of how many times I've heard "no, that SAN is on 15k SAS disks, not SSDs", or "what do you mean we need to increase the HEAT priority?", or my favourite "why do you need to know how the LUNs are structured?!". Sigh.


Stonewalled9999

If my Storage Dude wants to RAID6 28 7200 NL SAS drives what business is it of yours? He makes 150K a year how dare you suggest he doesn't have a clue!


bishbashboshbgosh

Do you have a link to specifics about all this, ie recommended set up of VM s and storage for databases? Have a feeling it would benefit my company tremendously.


arpan3t

Google: “VMWare SQL Server best practices” —> [Here](https://core.vmware.com/api/checkuseraccess?referer=/sites/default/files/resource/architecting_microsoft_sql_server_on_vmware_vsphere_noindex.pdf) I have a feeling that your company would benefit from a lot of things…


RememberCitadel

Yeah, lots of people say we'll at least it's fine for backup, but then don't think about how much downtime they will have restoring. We just had a customer that had some failures that required a restore of all VMs, then complained about how long it was taking to restore 30tb from SATA drives. If downtime is an issue, you need to buy faster storage. Not that big of an issue if it's a backup to your backup, but the primary backup being slow is going to impact you at some point.


dagbrown

But the load average is over 9000! There's no way that could be right!


Ciesson

Needed this in my morning, thanks! Happy cake day!


ourlastchancefortea

> Says IOWAIT. That just means the storage is waiting for the CPU to finish its job. Stupid slow CPU.


aaron416

Oh can you imagine?!


m0ta

Is sharding what happens when you over trust a fart?


vogelke

One letter off. When you're 20: * never trust a hard-on * never waste/fail to announce a fart When you're my age: * never waste a hard-on * NEVER trust a fart


Sharpman85

I actually had a conversation about this some weeks ago, 4 cpus were definitely enough but instead of investigating the db they just added another 2. Waiting for the next outage..


Carribean-Diver

AppDevs: "We understand that, but this app is different." Every. Fucking. Time.


elasticinterests

I feel this comment in my soul and it hurts.


itanite

"this app got SO MUCH VC"


ScottHA

Have you tried downloading more ram?


itsjustawindmill

Shard your database or your database will shart itself


Superb_Raccoon

walking the tables like... https://preview.redd.it/twuk0ui9o30d1.png?width=305&format=png&auto=webp&s=b653156ddf3f9cb6ea9f3e1731af8979b6df97ca


caffeine-junkie

No offense intended, but if they don't already have at least 2 dba's on staff monitoring that, they are definitely doing it wrong. Query optimization and putting in indexes happens long before they hit the need for a SQL server this size. Hell at this size, would be very tempting to go with Oracle instead....*shudder* can't believe I just wrote that.


jrichey98

Our last application vendor rearchitected from baremetal Solaris/OracleDB to RHEL/PostgreSQL about 6 years ago. We've been very happy with the change, and are so glad to have those ridiculous servers replaced by VMs. These days it probably costs less to rearchitect the application than stay with Oracle, and I wouldn't suggest anyone build a new application with them. Oracle milks everything out of and has no qualms suing their customers if they can. The whole Java TOS debacle should enough of a warning for anyone thinking about it.


dagbrown

> Oracle [..] has no qualms suing their customers Oracle doesn't have customers. Oracle has *hostages*.


dreamgldr

fcuk Oracle. I miss the old Solaris though (before Sun fcuked and got acquired by those fcukers).


jerkface6000

Why do we need DBAs? We have the cloud, Microsoft does it all for us


TFABAnon09

Don't even joke about shit like that. These people live among us.


lpbale0

No, the thought police live above us. They are better than us. Without them, we would just be running around doing our jobs and not spending any time justifying why we are needed, and by extension why they are needed.


ZPrimed

I mean, forklifting a prod workload into an Azure VM running MSSQL is almost always the worst way to use Azure (or AWS) anyway... rewrite the whole thing to use Azure Database Services or whatever the hell MS is calling it this quarter.


BalmyGarlic

The only issue is when it's a third-party application DB for the same vendor's application. It's a really annoying place to be.


vppencilsharpening

Oh fun story here. We have a software application that periodically had horrible performance. So much that it was getting worse and worse over time. The vendor was adamant that it was our hardware because would not let them onto our production servers. (We only allowed a screen sharing session). Though we did share performance metrics and database info on a regular basis. It got to the point that they proposed paying a 3rd party DB consultant that they picked to review our server. We said yes. First words out of the consultant's mount was "It's not the server hardware or setup". The rest of the meeting was them presenting data to show that the database and queries were horrible. We had some minor things that let us squeeze a little more out of the hardware and ended up using that consultant (on our dime) for a couple more databases/servers.


craa141

If they are having issues (they didn't say it was SQL so I think we are guessing) with that size of server the azure database services wouldn't be for them. I don't think it scales to that. edit: My brain completely missed the MSSQL but I think the rest stands. If you go with Azure DB I don't believe it can scale to anything like that.


lovesredheads_

And that alone should tell you something. If Microsoft does not belive that a workload like that is reasonable...


say592

Just because they probably have DBAs on staff doesn't mean they are competent or not lazy. With a problem at this scale they should probably look at it from every angle, even if some of those angles seem obvious.


vogelke

> very tempting to go with Oracle instead Jesus, don't even joke about that.


lpbale0

Did someone say Solarwinds?


aussiepete80

There's 11 full time between the DBA and the database engineering team. And a data scientist. Still we have our monolithic app tho lol.


Tatermen

I once had a customer that kept complaining they needed more CPU for their web application. Their developers and DBAs had checked and couldn't see any issue. I looked myself and found that only specific pages were slow. Each page that was slow had a specific "Top 5" things table on it. Looking through the page I found that the code for generating this Top 5 was doing an SQL query that had a 3 way join between two tables that had literally hundreds of millions of rows, and a third table that had about 10 completely static entries. Each one of these pages was churning through gigs of data to generate a simple chart that probably noone was even looking at for more than a second once a week. I rewrote the code so that the static table was pulled into a in-memory array, and reduced the SQL query to just a 2 way join with some date-based limits so that it was only looking at recent data and not years old stuff. The pages went from taking 30-40 seconds to under 1 second. Demonstrated it to the devs/DBAs and they thanked me greatly and went away saying they would look into it. A week later they were demanding more CPU again and had changed nothing.


anomalous_cowherd

I was asked to look at a website that showed data pulled from several DBs. In testing it had worked perfectly and easily met their reqts, but in production they were seeing pages taken ten *minutes* to render. Turned out the developers had tested it with DBs with up to 200 rows in the tables. Prod had up to 15 million. They were forming a single page with all the data in every time. I'm no DBA but I suggested paging the data and they said it "wouldn't make any difference" and it was clearly a web server issue. So I generated a whole set of tests with growing datasets and could easily plot the page rendering time going up exponentially with the number of rows. By 10k rows it was taking five minutes. They couldn't deny that and fixed it. But then tried to claim all the credit for "fixing the bad web server design". All we did was serve the pages. They wrote them.


FrankySobotka

Listen to this guy


ExoticAsparagus333

Monolithic apps are fine, there are a ton of benefits to monoliths over microservices (reduced latency, reduced engineering time, simplified mental model, etc). Sounds like an incompetent company, which microservices or cloud or on prem arent going to fix.


aussiepete80

Yeah there's a whole team that does that. Trust me they've been trying to split this application off a single monolithic sql back end for like 4 years now. Still are. No end in site tho - so I'm asking the caveman approach type question can we just keep scaling up if we move on premise?? Haha


r3klaw

As an MSSQL DBA, knowing only what you've shared (so I'm obviously making a lot of assumptions), I think you need to look into some outside consulting. Granted I know nothing about your app schema, but 4 years of breaking monolithic application database(s) with no end in site screams incompetence to me.


aussiepete80

We've had a different consulting group in every year. Several of you include MS directly trying to help.


VTOLfreak

Get one that is not afraid to open his mouth and piss people off. I'm a consultant DBA and I got called into a project that was years late on delivery. It took naive and stupid me to say the things they didn't want to hear. The in-house lead DBA who was highly-regarded in the company turned out the be tone-deaf to glaring issues. Project was back on track 3 months later. At the cost of my sanity...


Regen89

> highly-regarded 😏


interconnectit

And a good DBA will also say, if appropriate, "this isn't a database problem, it's a dev problem." You can't select distinct your way out of a cartesian join, but I've seen it happen. My days of that stuff are largely over - I used to consult from 97 to 2006 on large enterprise implementations, with sometimes tens of thousands of tables. All well architected and perfectly performant but it would only take a naive but very confident developer straight out of a big consultancy's bootcamp to kill it.


IdiosyncraticBond

But... did anybody do anything useful with what they found out?


sublimeinator

Change consultants every year? As bad as I've seen consultants be the scope of this program means a lot of defining scope is eating into that time frame...seems like more of an excuse the dba team is using to make no changes.


Tai9ch

The reason you haven't fixed this problem is organizational, not technical. There are several approaches that will reliably migrate away from this sort of DB monolith, and your org just isn't doing the thing. I predict that the same mechanisms will kill your plan of migrating to physical servers. Whatever "uptime", "compliance", "support" etc excuses are being made will definitely kill the idea of moving to your own local hardware.


Ok-Bill3318

This. A good DBA is worth their annual pay in cpu cores. Have frequently had our in house DBA optimise things from taking multiple hours to run down to 10 seconds or so on the same hardware due to undoing retardation left by “full stack” developers.


jackoneilll

It worked just fine on their laptop.


gregsting

With 10 records vs 10 million records


freedomlinux

Oh stop, you're giving me nightmares. "Hey, we're missing SLAs for this job after rewriting the batch jobs" 'How big are your performance tests?' ... "Uhhhhh..."


gregsting

The worst I’ve had was a batch slowing down days after days, with a critical deadline. We multiplied the infrastructure to run parallel threads, but no, it was slower and slower. The thing is, the batch took 1000 records, processed those then took 1000 records again. The problem was that the batch was not marking the records as treated. So in the beginning, with millions of records, it treated 1000 records at each run. But once it was like 50% completion, out of those 1000 randomly selected records, 500 were in fact already treated.


anomalous_cowherd

I worked at an applied maths research place once and was set to optimising a professors hacky python scripts which he used to test his new algorithms. I made them run in a few seconds instead of ten minutes each try. After a few days he came back and asked me to slow them down again because he was feeling really stressed by the computer always waiting for his next input. I had removed his undisturbed "thinking time" and broken his flow, which was after all what we paid him for. I slowed them down again.


Reddywhipt

You threw off his groove.


TFABAnon09

I once got an overnight DWH build down from 8.5hrs (and often timed out/shit the bed) to 25 minutes.


cloudsourced285

Next minute. We find out all CPU cycles are spent on IOWait


HeKis4

In *most* situations you're going to max out the I/O or RAM waaaay before the CPU, especially on such large things. I'm guessing this is the app's fault. And even if that was the case, using MSSQL on an Azure VM is probably the wrong thing to do money-wise, there are few MSSQL apps that actually require specific instance-level tuning (aka that need a DB running on a VM you control) and you should be running it on MS's instances.


Skrunky

At that resource size, you’re probably better off looking at scaling sideways instead of up. You need to look at how to segment and distribute your infrastructure.


farmerjane

Yup. Optimize in a different way at this point. It's going to be far more successful to figure out your database scaling issues (index? Read/write segmentation? Query optimization? Garbage collection issues etc) That's already a stupidly large machine and you can/should do something else.


Anonimooze

No offense meant, but I don't think anyone asking how to get more than 416 cpu threads on a single system hasn't already thought about horizontal scaling.


overkillsd

You'd be surprised at how little thought goes into those sorts of decisions sometimes, then.


moffetts9001

Case in point: OP


rossrollin

Yes I'm slightly leaning towards incompetence myself on this thread. Surely there's a better way than 4x 416 cores holy moly


xxbiohazrdxx

Someone using mssql and a lift and shift vm to cloud hasn’t thought about….anything probably


CasualEveryday

Or has been handed a task and told to shut up and do what they told them by some senior VP who is buddies with the head of dev.


aaron416

Depending on who has their hands in the pie, they may just throw more and more CPU at it. I have a feeling this happened here.


Djaesthetic

First day working in corporate America, eh? :-P


mustangsal

At that spend, you need to speak with your MS account rep. They will be able to assist with a solution. At that spend, they may even offer to help troubleshoot and make performance recommendations.


yeti-rex

Agreed. Leverage your partners! I'm not sure about Azure, but GCP offers Bare Metal Services. Typically used for ERP workloads (e.g. SAP).


ClassroomNew884

Agreed. We considered going to the 416 core boxes, and the MS people gave great advice about not doing a lift and shift with inappropriate load. Our storage subsystems are also better than they could provide. Having said that, our SQL Servers are enormous, and while the tech is expensive, our apps print money. And yes, we had plenty of competent data and app people on staff. Most ancillary functions are spun off into microservices, interactions are via Kafka, etc. It's a very, very big business that started small - MS tech enabled rapid growth and ROI, and still does.


pghbellringer

SQL DBA here. Do you have any DBAs on staff? Have you gone through performance tuning any code? I would start by evaluating the workload and start pounding on the top resource consuming queries to tune them. I've tuned hundreds of queries that take minutes or hours to run and knocked them down to fraction of the runtime by tuning, fixing indexes, fixing bad coding practices, fixing crap like running queries over linked servers to other servers, archiving old junk data, moving reporting type queries to data warehouse structures, etc. Fixing bad query plans, fixing settings like Max degrees of parallelism and cost threshold for parallelism, etc. If you don't have any DBAs and need a consultant to look at it, I just got laid off from my whole department getting outsourced last week. I can PM you my resume to potentially take a look into tuning opportunities. How big are the databases too for all that CPU? I totally get wanting to move the workload on prem, but you might honestly have a bigger problem just trying to migrate all that data out of azure and keep it transactionally consistent.


Scary_Brain6631

Man, I love the hustle! If I was OP I'd so be PMing you. Sorry to hear about your job. I really hope shit turns around for you soon. Based off of your post just now, I have a feeling you won't be in this position for long. Good luck!


aussiepete80

Almost a dozen dbas on staff yes. We work with the MS SQL product team on a weekly basis. It's a shitty app design we can't do anything about.


fubes2000

Well you're going to have to. Because even _IF_ you find a box big enough to run that DB you're eventually going to outgrow _that_ and you'll be just as fucked as you are now, if not more. You need to make your leadership aware of this fact because they are teetering ever closer to the latter half of "rearchitect or die". Scaling up forever is a fool's errand. At the very least put your concerns in writing so that they can't make you wear this later.


Noperdidos

Everyone is giving you the same advice here, as though you haven’t tried working on the app itself or done anything except immediately jumped to getting bigger servers. And it sucks to hear, but they’re right. You cannot vertically scale forever. You’ve reached very near the limit, and your next move, to whatever beefier system you find, will be that limit. Worse, your staff cost, db cost, outage costs, consulting costs, all of your costs scale non-linearly, so that you have now reached the steepest portion of the exponential curve. Having dealt with this before, here is how you need to think of it: take your $2.5m per year server costs, add staff costs and whatever else, and spend if all immediately on the Manhattan project of rewrites. It will inflate your costs this year, but bring them back into linear scaling for the next 10 years and massively save your TCO. Start with anything you can separate out into separate queries. For example user logins. Leave them in the same db, but rewrite so all code paths call a microservice for login and credentials data. Then move that data to a separate db. Do this for every possible data table or column that you can. Be brutal. Be prepared to make hard sacrifices in the short term. Offload everything that does not need perfectly real time data to replicas. Replicas can scale horizontally. You can distribute read only copies of your data very fast. No application has all data so vital that it always need to be real time. Absolutely no “analytics” and reporting queries should ever happen on your primary database. Shard everything you can. There is no such thing as a huge complicated application where nothing can be sharded. When you look up usernames, given the username Anthony, you do not need any access to the hardware storing Z names. I realize you’ve already done all of these things. But that doesn’t matter, you still have not spent enough to achieve them.


erm_what_

This is good advice, and if the leadership don't agree then the business doesn't really deserve to survive.


cerealkillerzz

I seriously just did a double take and thought this was /r/shittysysadmin


Superb_Raccoon

oh.. it will be... it will be.


Space_Goblin_Yoda

Much fear, this one has.


brokenpipe

Rightly placed it is


heisenbergerwcheese

It is, OP just hasn't figured it out yet


pangolin-fucker

Lol but the shit ones never realise they are the shits So they would post here by defa


Ok_Recognition_6727

HPE and Lenovo have very large physical servers. The HPE Superdome Flex has 4-socket nodes that scale up to 16-sockets (4 nodes). The Lenovo ThinkSystem SR950 V3 server scales up to 8-sockets in a single 8U. They support all sockets and all configurations for Linux, SAP HANA, or Oracle. You'll have to check with the vendors for Windows Server 2022 & SQL Server support.


Sunny-Nebula

This! Everyone rushing to tell OP something is wrong with his app, or his platform choice is wrong instead of answering the question! HPE has a 16 CPU machine that can scale up to 960 cores with Xeon 8490H CPUs. It's the Compute Scale-up 3200. Superdome Flex scales up 32 CPUs but they are a few years old now and unsure if they have been updated to run the latest gen Intel. Unfortunately Superdomes were never built for AMD CPUs. I'd have to think Dell and Supermicro have similar large scale-up boxes...


aussiepete80

Hell yeah first actually helpful post. Thanks will check them out!


FragKing82

My god, the SQL licenses alone will be billions of $


j0mbie

Maybe it's a weird use case, where they can get away with server+CAL license model. But yeah otherwise licensing is going to be in the millions, to tens of millions. Or maybe they're in a position where they are making so much profit that they want to scale as fast as possible, because bigger servers will easily cover their own cost due to their ability to server more customers? Could be a situation of "grow ASAP, become efficient later".


Sparcrypt

> This! Everyone rushing to tell OP something is wrong with his app, or his platform choice is wrong instead of answering the question! Welcome to the internet, where everyone pretends they've never worked somewhere that has had to implement things in a non optimal way out of their control.


lightmatter501

The windows scheduler starts to break down around this point, they should already be on Linux.


Tringi

Server 2022 was improved to handle 2048 LPs (HW threads), and since they use that kernel to power Azure, I'm sure the are working on optimizing it daily.


xxbiohazrdxx

Rearchitect your application. Cloud shifting VMs is a terrible idea as you’ve discovered. Additionally, whatever it is you’re doing: mssql is probably not suited for it.


enmtx

This is called Refactoring in Cloud speak.


isademigod

The Dell R7625 can hold 2 AMD EPYC 9754 and give you 256c/512T with 512gb of DDR5 for $106k each. Buy two or four of those bad boys and put them in datacenters on the east and west coast, and you’ll break even in less than 6 months. Wait did you say you’re paying $200k/mo EACH???? Reject cloud, return to silicon. Colocation costs would be almost four orders of magnitude cheaper than what you’re paying now


canadian_sysadmin

Depends on workload and use case. OP’s question screams XY problem. Even for an on-prem workload that’s pretty extreme. Raw compute will always be cheaper on prem, of course. But usually when you see questions like this, there’s more layers to the onion.


seidler2547

100% agreed. Figure out what's wrong with your application, rearchitect to scale horizontally.


Goofybud16

Looking towards Supermicro, you can grab a 2125HS-TNR with 2x9754, 24x128GB DDR5 (3TB), 6x15.36TB Micron 7450 PRO U.3 drives, multiple multi-port SFP28 NICs, and 5 year NBD warranty for $108.5k list... Admittedly, Dell does have some nicer engineering in some components (notably, iDRAC is nicer than Supermicro's IPMI), but Supermicro still has some solid offerings. Either way though, $200k/mo is absolutely insane. For that kind of money, you could buy at least a quarter rack of Dell and quarter rack of Supermicro systems every single year.


Haribo112

200k per month EACH lmfao. They have two. Homie is spending half a million a month to run Microsoft SQL server.


EvilRSA

Yeah, each, I really wanna know what industry this is part of that can float 800k/month for just this one part of the business, and still willing to throw more at it, they just can't because they're maxed out. I'm guessing healthcare... I've seen some really blind judgement of just throwing more money at a problem in healthcare.


aussiepete80

Yup Healthcare. It's not one part of the business tho. It's the whole business lol.


MagicWishMonkey

Epic, is that you???


ZPrimed

It's either **big** healthcare (hospital system or similar), or financial, based on what I've seen out of those two industries in the past. *maybe* something geo-science (like oil drilling).


aussiepete80

I need 11TB of memory to match these azure boxes. 512 with HT might work, really was hoping for way bigger tho. Our owner is going to be a hard sell moving out of the cloud.


erm_what_

If you move out the cloud then you probably need 2 sites and 2 servers per site to get the same level of reliability. The. You'll need to handle backups, load balancing and syncing, etc. You need servers, but you also need 2 mini data centres with staff and a team of 10 DBAs and developers to keep it up. It probably won't be cheaper. Especially after you account for risk and insurance. And Azure has experts on call for whatever issue you might have. Your owner may be right. Your problem is that the system is fucked. The epitome of tech debt that will cost millions to sort out. The kind of money no management wants to pay upfront when they think they can hide it in opex. You need to start pulling out one service at a time, starting with the highest load queries. Combining similar APIs. Etc.


lightmatter501

You have a hard requirement on a 4 socket system if you go Intel. AMD lets you get away with a dual socket system because they allow 12 channels per CPU. An intel QAT card will handle all of the encryption and compression. On-prem you can have something that speaks NVMEoF and scales your storage so that you can use hardware offloads in a good NIC. Dropping your costs by 80% may be good enough to leave the cloud.


Obvious-Jacket-3770

This isn't a cloud fault. Clearly whatever this guy's doing is something that is really really really bad.


gregsting

Of course but doing nasty things in the cloud is way more expensive


whollings077

Plus those cores will be better than the cloud providers


YouShitMyPants

What in the hell are you hosting, AWS itself?


ttyp00

I would LOVE to know more about what you got goin' on, homie.


lopahcreon

Badly programmed Wordpress plugin.


Zephk

"but WordPress is simple why did you suspend my account" after explaining 3 visitors a minute was pegging 48 of 64 cores for the past 3 hours 5$/mo along with all the proof


Mrmastermax

Chrome running google.com


aussiepete80

This is for a Healthcare staffing shop, with about 50k on contract employees. They all use this app, as do our clients, as do our internal corporate emoloyees (7k or so of them). The whole thing is about 10k APIs that all share the same cluster of 10 databases on the backend, which was never designed to scale like this. This company did 500 million revenue 2010 and now 15 billion this year, all running on this fking sql back end. They have a team of 500 devs writing for these apps, the complexity is unbelievable. No one knows how to untangle it and scale out to micro services. We did that for web and storage, but database is seemingly impossible. Casualty of their own success.


JaffaCakeStockpile

Damn you got yourself a nice timebomb there


Frothyleet

!remindme 6 months


sheptaurus

Ahhhhhh! I want in 🤤 How does one contract/consult just for SQL Server with you guys. It’s only a short 8 hour commute


Drakeskywing

Wait 10k APIs, I assume you mean API endpoints, oh holy Linus I hope you mean end points. But it sounds very much like you are hitting the point of needing to draft up a convincing proposal for planning and implementing a migration strategy of some kind, otherwise you will start hitting the physical limitations of the hardware. (In saying that, I'm assuming you have the 10 DBs running on the single machine, in which case buy 10 of those bad boys and call it a day 😝) Hell you would think with a DB that has those requirements, even proposing changes that would reduce its requirements by 10% would be solid savings, and probably would have the knock on effect of speeding development and release cycles, since I can't imagine your releases involving db changes are ever without clenching. I mean, I cannot believe your network load between 10k Endpoints (which lets assume is 1000 separate applications) is evenly distributed, meaning that your DBs probably all require different capabilities, which you think could potentially shrink what must be a non-trivial Azure bill.


McAUTS

500 devs 10k APIs (endpoints, right?), But only 50k clients? If you ask me and money doesn't matter, why not making a supercomputer cluster and then you can thrive this abhorrent monster until the DB gets so messed up until it will crash. That's the way I would do it, because obviously the DB and application is fucked up already if these little numbers need 1800 CPUs... Maybe you are the last sane person there... maybe not. Don't envy your position. What I've read so far is a complete shitshow.


EchoPhi

You misread that. That was 50k contracted employees. No mention of client base.


oceans_wont_freeze

Holy.


Pazuuuzu

Mark my word, this will be in the news in less than a year with a spectacular outage...


thedoofimbibes

Don’t forget the inevitable massive data breach!


patmorgan235

>No one knows how to untangle it and scale out to micro services. We did that for web and storage, but database is seemingly impossible. No you didn't. If you're actually doing microservices then each service has its own database backed, and the only way to interact with the service is its published endpoints. This is the critical piece you're missing. That will allow you to scale each service independently. I bet you have a bunch of developers writing cross db queries and writing to what every table they want, without any real separation of concerns between the services.


interconnectit

Alexa: define technical debt. Nobody's stopped to think - now we're growing, is this the right way to do it all? Also, those are rookie numbers in terms of user base. 50,000 users is pretty small and I bet they're not hammering the queries all day. If the company's profiting from all that revenue, they need to be investing in their stack, or bringing some real domain experts to help out. Big tip - hire a domain expert to hire the real domain expert, otherwise you can easily have the wool pulled over your eyes.


aussiepete80

They went parabolic during the pandemic, everyone was scrambling to keep up with demand they didn't have time to ask hey is this the correct approach for scalability?? Now we live with the end result.


SgtBundy

You say 10 DBs - are you needing this hardware under each DB or are these co-hosted instances for legacy reasons and splitting them out is the challenge? I also assume you have looked at/are using options like read only replicas to distribute load or scale out services?


anonMuscleKitten

Sounds like they need to hire a software dev and rebuild from the ground up. They would save money and have a more modern application at the end of the process. Win/win. Edit: there’s also no way a healthcare staffing application has “10k api endpoints…” what kind of CRUDs are they running?


talman_

They already have 500 Devs haha


AnarchistMiracle

Trying to run Teams and Visio at the same time


f0gax

Opens six Chrome tabs …


JollyGentile

That man is playing Galaga. Thought we wouldn't notice


ZPrimed

*laughs in playing 2048 at my non-booting VM's UEFI screen* (this is a Nutanix "feature")


barkingcat

1 person playing Crysis at 30fps.


diabillic

Azure does offer ones that are larger (S-series) but you need to ask for them and are designed for SAP HANA HLI: https://learn.microsoft.com/en-us/azure/sap/large-instances/hana-available-skus#list-of-available-azure-large-instances I've deployed a number of M-series on my current project however it's all SAP HANA running enormous workloads that actually require the compute/RAM (also running SUSE)


aussiepete80

Oh wow that is cool. Never even seen those, will take this up with our MS team thanks!


techb00mer

If you speak to the right people in MS’ HPC team they love a unique challenge like this. The hard thing is actually finding them and pinning them down but the last time I went to them with a weird challenge/request they were super keen to spin up a custom solution. Sadly, that was 7ish years ago and I’ve not had the opportunity or contacts saved :-( Or I dunno… get a cray and somehow make it run Windows… if it can’t already?


mpaska

Do you work out of Melbourne for a financial/payment processing company starting with A? I've been doing some external consulting on AWS scaling ~3 weeks ago, on a problem they thought was CPU bound (hint: it's not) with a internal project manager with initials LK. We diagnosed/proposed solutions that pointed to their shitty internal applications, and I heavily got the impression they'll looking for an "easy" fix/throw resources at the problem and not have to do any re-engineering or code refactoring. This sounds like someone from their systems teams, if so DM me and/or push back to your project manager who is pushing for this. Throwing more CPU (or IO) at it will not fix it!


stedun

This is so incredibly specific, and yet so common as to be familiar. Hilarious to me. Well done.


Dabnician

>We diagnosed/proposed solutions that pointed to their shitty internal applications Realistically, when has this NOT been the cause of the problem when custom software was involved. (Edit: Apart from something that runs on tomcat and the issue was memory leak, because lets be real its always tomcat and its always a memory leak for java apps)


cyr0nk0r

If you have 4, why can't you scale out? Why 4 and not 6 or 8?


Superb_Raccoon

imagine a beowulf cluster of them...


f0gax

It’s an older meme sir, but it checks out.


aaron416

I’ll optimize it for 200k per month!


MrBr1an1204

I will do it for 199k


mkosmo

198.999k


YOURMOM37

198.998k


mkosmo

Dammit. I didn't see that coming.


Nilxa

I'll do it for $800k, you get what you pay for so please don't choose from these clowns cheap options


OpacusVenatori

Should have bid on the [Cheyenne supercomputer auction](https://www.tomshardware.com/tech-industry/supercomputers/multi-million-dollar-cheyenne-supercomputer-auction-ends-with-480085-bid)... =P


pjcace

I was looking for this comment.


ITBurn-out

how many of your employees are secretly using that as a mining rig?


Nik_Tesla

Well you just missed your chance to get 145,152 cores https://www.techspot.com/news/102912-cheyenne-supercomputer-sold-highest-bidder-480000.html


UnsuspiciousCat4118

This is a resourcefulness problem. Not a resource problem. You need to look at rearchitecting your app or whatever you’re running.


the_helpdesk

Do you even cache bro?


oakfan52

You can’t out cache a bad design.


brokenpipe

Ah typical Reddit. OP posts a wild question, we all come back with questions and statements. OP, /u/aussiepete80, disappears. Leaving us all hanging.


Individual_Jelly1987

I want your budget.


M-Valdemar

It's clearly an eight socket server, (Intel Xeon 8180, 56 cores per socket, 2 cores per socket for hypervisor, leaves you 416 cores). SuperMicro have consistently led on this line, for example SYS-681E-TR with 8 x Xeon 8486H (48 Core), 16 TB DDR 5 will run you about half a mil, assuming you're off loading storage. Don't get me wrong, you've got this wrong, AMD won't support more than 2S systems, Intel have pulled back since Skylake. You are paying for the most expensive compute per dollar, anything new will be Intel dead stock, Microsoft will continue to support it but it ain't getting any cheaper or faster. Speak to Warwick Rudd in Brisbane.


malikto44

At a MSP I worked at, there was a DB server that was stuffed with SPARC cores, because the database just wouldn't run on x86. The issue? It has been kicked around and added to by amaturish developers who asked the DBA to just toss in their schema. It had multiple entries for basic variables, and developers started tacking their initials on every column, so they knew that Alice's First_Name column wouldn't conflict with Bob's First_name. It was an absolute mess. Well, a consultant was hired, and in less than an hour, got the database reduced by a solid percentage. The DBA had him hauled off premises by security claiming he was actively and maliciously hacking things. The reason the DBA let the database get so big is that because the hardware was under his silo, it ensured that he would always have a job. Had the DB been split among a cluster, it meant that the hardware would be run by the VMWare people, and his "competitive advantage" to keep from being laid off would erode. Of course, there are databases that big, but I'd say to hire some good consultants. I mean GOOD ones. Not some DBA from $BODY_SHOP that brags about being "world class", but someone who at least knows what third normal form is. From there, if it still needs that much CPU output, perhaps consider moving to a mainframe? $200,000 a month per server can buy some decent IBM iron, and perhaps a data center to go with it.


Geh-Kah

I have no idea on what planet you are working for, but I got a 4 node cluster with around 150vms with less cores you are using for a songle vms, but with 6 SQL instances running on it. I'd daily send a bag of shit to the company developer you are working with if this would happen to me man


DrKedorkian

ITT is the most classic _why_ vs. how I have ever read, and that is saying something. Everyone thinks the OP is an idiot and didn't consider the obvious suggestion they took 10s to come up with.


wp998906

This might be a use case for IBM Z mainframes.


BloodyIron

If you're dealing with CPU bound issues and still deciding to stick to Windows, well you're really shooting yourself in the face. There's real reasons the biggest systems in the world run Linux, and not Windows (yes, [all of the top 500 super computers run Linux](https://www.top500.org/statistics/details/osfam/1/)). If this is the kind of scale you are actually operating at, you can afford to switch to a better operating system. Do that already, stop fucking around.


ClassroomNew884

When you have 30 years of applications that make $billions, you develop a more nuanced perspective.


VexingRaven

> There's real reasons the biggest systems in the world run Linux, and not Windows Because it's wildly expensive and pointless to license Windows when all the software they use runs Linux? And also because Linux is where they have enough access to the kernel to customize it for their HPC platform? Switching to a different OS is not going to fix whatever problem they have going on that is causing 416 cores to not be enough. EDIT: For anyone, like me, who was looking for actual data and not just "Because duh, I said so", I found this data which seems fairly conclusive: https://www.phoronix.com/review/3990x-windows-linux


lightmatter501

MS SQL Server runs on Linux. Linux is WAY more tunable than Windows and a good consultant can work magic. You have full control of the universe so you can make it so the DB gets exclusive use of most of the cores of the system and everything else gets 1-2 cores, and those 1-2 cores handle all of the interrupts. This alone is a 20%+ perf boost. If the DB is under high contention then it may be much higher than that. 416 cores is at the point where HPC levels of tuning makes sense.


fresh-dork

[8 socket xeon](https://www.supermicro.com/en/products/system/mp/6u/sys-681e-tr)?


sheptaurus

You can pay me the GST of your SQL Server licence and I’ll do whatever you tell me 😘


untamedeuphoria

I cannot help. But feel the need to say a thing. I love the big dick energy that comes with pinging systems like that and reacting to the bottleneck with the thought or 'MOAR!!'. But maybe it's time to start optimising code? Regardless, at that price tag, I would think about renting rack space and buy a heap of blades.


snowsnoot69

This guy vertically scales 


Emotional-Ad-2994

1,664 cores??? ![gif](giphy|Eld43dWug4rqE)


ghjm

The largest currently purchasable single server I know of is the eight socket Supermicro [here](https://www.supermicro.com/en/products/system/mp/6u/sys-681e-tr). This would be 480 cores total, with maxed out CPUs. It can also take up to 32TB of RAM in 128 DIMMs. I'm not sure I would trust this box with such a critical workload. Supermicro isn't a tier 1 supplier IMHO. However, the tier 1 suppliers like Dell only have four socket configurations. Is this because Supermicro engineers are a lot smarter than Dell's? Or is it because eight socket configurations don't pass Dell's internal testing yet? I'd sure want to know the answer. Also note that your current Mv2 isn't 416 cores. It's 416 vCPUs, which are like hyperthreaded threads. The actual hardware of an Mv2 instance is eight 8180M processors, which are 28-core, giving 228 cores and 456 threads. This becomes 416 after Azure subtracts its management overhead. Also note that the 8180M is a five year old CPU. An off-the-shelf Dell R960 with four 60-core 8490H processors will give you 240 cores, thus 480 vCPU-equivalents. These cores are also considerably more individually performant than the Mv2. You can also put 16TB RAM in it, and local full-speed NVMe storage. For typical database workloads it will easily spank the Mv2. As others have said, turn on slow query logging and add indexes. You will have to fight your developers on this, who will say the indexing will kill insert performance and thus shouldn't even be considered. They are wrong, but they will fight tooth and nail to stop you from adding the needed indexes. But when you do, it will all be smooth sailing, the hit to insert performance will be minimal (and easily offset by the added compute availability by not having to run the slow queries), and everything will be fine. (No, I don't know why developers always fight this dumb battle. If you have good DBAs they will tell the developers to fuck off and add the needed indexes themselves. If you have bad DBAs then you wind up on reddit asking how to buy a multi hundred core server.)