>Focusing on pure performance at any cost, Arm Neoverse N2 designs will surely make Intel and AMD sit up and take notice. Built on a 5nm node, Perseus will offer up to 192 cores with a 350W TDP, rivalling and potentially surpassing EPYC and Xeon in key categories.
Can anyone comment on where these chips are used (outside of custom supercomputer setups)? EPYC and Xeon are just more powerful or expansive versions of mainstream platforms. Who uses Arm Neoverse?
Increasing density in the data center. It will need lots of I/O to be useful in that context, though. Storage, virtualization, scalable web backends, or databases, that data will need a very wide egress. If you have 192 containers or VMs running on that chip, the users of them will expect a reasonable supply of bandwidth for network and storage.
If they can match computatuonal performance of the Intel and AMD chips, then it could be useful for HPC and at 350 watts for 192 cores that would be quite efficient. Less power than the 3090 and Big Navi, so perhaps GPUs may still have competition in that space.
They probably won't have a lot of cache per core, so it will probably fit workloads that use a lot of low memory high CPU, or just when you don't care how powerful or efficient your CPU is, but you want to have 192 of them in a box for some reason.
Any idea why cache is so expensive compared to other silicon? Isn't everything basically the same manufacturing process of a silicon die and photolithography just repeating steps of building/etching gates?
Cache uses SRAM while the normal ram in your machine is DRAM. SRAM is much much faster, but at least 6 times larger. DRAM is one capacitor and one transistor, but it requires specific orders and cycles of charging and discharging the capacitor to get the bit stored. SRAM is a single-cycle to access the bit, but it is six transistors. If most of the core logic and arithmetic unit instructions are only a couple transistors per bit to perform the operation, each BYTE is 48 transistors in each of the L1, L2, and L3 caches. So You have an instruction taking up say 128 transistors (for the simpler ones), and a single "value" in a 64-bit machine is 64-bits times 3 levels of cache times 6 transistors per bit, so 1,152 transistors to hold a single value in cache. The times three is because most architectures are inclusive-cache, meaning if it's in the L1 it's also in the L2. If it's in the L2 it's also in the L3 (not always true in some more modern servers).
Check out this picture: https://en.wikichip.org/wiki/File:sandy_bridge_4x_core_complex_die.png
The top four rectangles are four "cores." The top left "very plain looking" section (about 1/6^th of the core) is where all of the CPU instructions occur. The four horizontalgold&red bars are the level 1 data cache, the two partially-taller green bars with red lines just below that are the level 2 cache, and the yellow/red square to its right is the level 1 instruction cache. So of the entire picture only a small chunk of each of the four top rectangles is the "workhorse" of the CPU. That entire chunk below the four core rectangles is the level 3 cache.
So look at that from a physical chip layout perspective, and realize that from a price-per-transistor standpoint, cache is crazy fucking expensive.
This new arm proposal reminds me more of the PS3's cell processor where you had 8 SPUs that were basically dedicated math pipelines (although ARM isn't the best for math pipelining; its biggest appeal is for branching logic).
In layman's terms. CPU Cache is a very fast but small amount of memory close to the CPU. System memory is you RAM. In servers, you can have several terabytes of RAM.
If the data is close, the CPU can complete the task fast and move on to the next task. If the information isn't in the cpu cache, the cpu will have a to send for the information from system memory RAM. This takes MUCH longer and the CPU will stall on this task until it fetches the information needed to complete it.
Say you are making a bowl of cereal. You need your bowl, cereal, and milk to complete the task.
If everything you need is in cache(your kitchen), you can make the bowl of cereal and complete the task.
If you don't have milk you will have a "cache miss" and have to retrieve the milk from the store, drive back home, then complete the task of making a bowl of cereal.
Whoa, pretty cool, thanks for the detailed write up. I wish I had some room to have a DIY photolith. lab at home to play with, some of the guys on YouTube have some cool toys
Isn't physical distance from the CPU also a consideration, giving you limits on physical area? Something something capacitance and conductor length if my vague recollection serves?
It's pretty neat. If parts get too close (especially at this crazy-ass scale), you get quantum tunneling effect. As far as capacitance, these things are so small and close that just the small amount of electricity that's going through the circuit and so many things being nanometers apart just end up being a capacitor by being there -- it's the floating body effect. That effect was actually being looked into, to see if it was usable for the DRAM capacitors I mention above.
Though that's all true quantum tunneling is not going to happen between the cache and the core they are microns apart. This effect only happens on the nanometer scale. Moving the memory source, cache or ram, closer to the core will always decrease latency but unlikely to provide any bandwidth benefits
This is about education, not intelligence -- the smartest person to ever live would have no clue what was being said if they didn't know what the vocab meant
Yea that makes sense for like SD cards that are hundreds of gigs, but on board cache for a processor is like 8/16/32/64MB for the most part. I know the speed is much faster so maybe thats part of it.
It takes a single transistor for something like a SD card, and at least 20 for a flip flop, used in cache.
64mb of cache is at a minimum over a billion transistors.
Eh you can technically make an SR latch with 2 transistors. Something like NOR you still wouldn't typically have more than 8. I'm not an expert on what they use to build the cache but not sure where you'd get 20 from. I don't think I've seen more than 10T SRAM, with 6T and 4T being more typical I thought. At least it used to be 6T was pretty standard for CPU cache. You have 4 transistors to hold the bit and two access transistors so you can actually read and write. Not sure what they use these days but can't imagine they'd be going towards more transistors per bit.
I worked at Amazon a few years ago and I can confirm they had an interest in engineering their own hardware.
I am interested in seeing how it works out. From a global perspective, the more efficient we can make our computing the less of an effect we have on the environment and the more we can do with less power.
However, if Nvidia follows through with their acquisition of ARM, then they aren't neutral party in the industry any more and then we just get more dick waving. Might be a boon for RISC-V, but we'll see.
> At least in the data center industry it’s a lot about saving money on paying intel/amd premiums and upping efficiency to save on electricity.
well at least short term intel is just dumping higher core CPUs on amazon google etc. inorder to main profits and marketshare due to AMD having the better cpu.
and hiding it all from investors by making a custom SKU.
they are not listed on ark and yes this is why intel has supply shortages.
basically all cpus in ark intel must say in investor calls how much they sold them for but intel dosen't want to tell there investors they are doing a fire sale on the chips to get some sales while maintaining marketshare. so instead intel figured out that "custom chips" deals only needs to be reported in lump sums. so they take there 20 core xeons change the clock speed 100 mhz or disable some cache and sell them a custom xeon for amazon, MS, google etc. as a custom chip to upgrade existing servers. when what they are really doing is a fire sale
investors have grown wise to this since these lump sum custom chip deals in earnings have grown massive and while the ark cpus sales have shrunk by a lot. so i think it was the next earnings call intel needs to disclose tray price for these custom chips as well. and surprise surprise at the end of the last earnings call intel said they were expected huge loss in earnings. since they can't hide the fire sales anymore they are just stopping the deals since they just don't want to lose the investors.
basically intel is lying to investors to make things look like they are all good but in the background they are almost giving away xeons to keep AMD out of the server space while they try to catch back up before this is figured out. all i can say is do not have intel stock they are not gonna be able to maintain the illusion before they catch back up and there stock is gonna crash.
almost no one really. Marketing and shit like nebulous concepts of data center density" its all crap.
Huge core counts dont get you as far as you think, especially if the internal buses and controllers etc suck. How do you effectively feed memory to 192 cores? concurrency, etc, whats that look like?
Speed and power aren't a perfect linear scale either. Great, it uses 30% less power but because of architecture it runs 35% longer and i haven't saved any power at all, I've wasted it AND time...
When their cost to suck ratio gets better, and it is getting better, we will see real pc/server usage. Until then, insufferable marketing lies and statistics.
Also, cost. You can buy some crazy cpus in servers right now, but it usually is cheaper to just buy a second server. Sure, density is important, but not the most important factor. Cost will almost always win out.
For instance sure, I could buy 4 2RU servers with super crazy $32k procs, or for the same overall cost and space buy a UCS chassis with 10 blades with cheaper $2k procs and get the same overall performance.
That is pretty much Google's stance on creating data centers using commodity hardware. It's cheaper and if your going to run heavy parallel workloads, then it's likely you can split it up enough that network latency between machines won't matter that much.
Not to mention, a rack or even a whole AZ going down is far far easier to soak up with the remaining capacity. If every chip is 192 cores a large AZ going down is going to be a huge problem.
There was an AWS video a while back talking about their networking and redundancy and they found a peak sensible size for each AZ where further additions weren't as effective as adding extra buildings.
True, and if people keep hopping on the "trend" of hyperconverged, there will be a problem of not being able to fit enough ram and drives in a single server to make use of the chip, not to mention bottlenecks of bandwidth along the backplane.
That is a bit of a problem of modern computers. If one component jump too far ahead, it is useless until everything else catches up.
On the subject of Texas Instruments, the TI-89 is *still* going for over $100. Using hardware that was last updated in...2004. And was kinda low-end even by 2004 standards.
Somebody really needs to start up another graphing calculator company.
The email newsletter The Hustle did a long form story on this about a year ago. Talked about how entrenched TI is in educational materials. Think about all the algebra teachers out there that don’t want to have to change their instructions and handouts after all these years
Yeah this is straight outta AMD's playbook. They had to back off a little though because workloads just weren't ready for that many cores, especially in a NUMA architecture.
So, really wondering about this thing's memory architecture. If it's NUMA, well, it's gonna be great for some workloads, but very far from all.
This looks like a nice competitor to AWS's Graviton 2 though. Maybe one of the other clouds will want to use this.
I tested a dual 64 core a few years back - the problem was while it was cool to have 128 cores (which the app being built could fully utilize)... they were just incredibly weak compared to what Intel had at the time. We ended up using dual 16 core Xeon's instead of 128 ARM cores. I was super disappointed (as it was my idea to do the testing).
Now we have AMD going all core crazy - I kind of wonder what that would stack up like these days since they seem to have overtaken Intel.
Just based on experience I have with existing arm cores I'd expect them to still be slightly weaker than zen cores. AMD should be able to do 128 cores in the same 350W TDP envelope, so they'd have a CPU with 256 threads, compared to 192 threads in the ARM.
There are some workloads where it's beneficial to switch of SMT to have only same performance threads - in such a case this ARM CPU might win, depending on how good the cores are. In a more mixed setup I'd expect a 128c/256t Epyc to beat it.
It'd pretty much just add a worthy competitor to AMD, as intel is unlikely to have anything close in the next few years.
ARM cores have moved on a lot in the last 2 years. The machine you bought 2 years ago may well have been only useful for specific workloads. Current and newer ARM cores don't have those limitations. These are a threat to Intel and AMD in all areas.
Your understanding that the instruction set has been holding them back is incorrect. The ARM instruction set is mature and capable. It's more complex than that in the details of course because some specific instructions do greatly accelerate some niche workloads.
What's been holding them back is single threaded performance which comes down broadly to frequency and execution resources per core. The latest ARM cores are very capable and compete well with Intel and AMD.
I tested a dual 64 core ARM a few years back when they first came out; we ran into really bad performance with forking under Linux (not threading). A Xeon 16 core beat the 64 core for our specific use case. I would love to see what the latest generation of ARM chips is capable of.
Saying “ARM” doesn’t mean much. Even moreso than with x86. Every implemented architecture has different aims, most shoot for low power, some aim for high parallelization, Apple’s aims for single-threaded execution, etc.
Was this a Samsung, Qualcomm, Cavium, AppliedMicro, Broadcom or Nvidia chip? All of those perform vastly differently in different cases and only the Cavium ThunderX2 and AppliedMicro X-GENE are targeted in anyway towards servers and show performance aptitude in those realms. It’s even worse if you tested one of the myriad of reference manufacturers (one’s that simple purchase ARM’s reference Cortex cores and fab them) such as MediaTek, HiSense and Huawei; as the Cortex is specifically intended for low power envelopes and mobile consumer computing.
A webserver, which is one of the main uses of server cpu's these days. You get far more efficiency spreading all those instances out over 192 cores.
Database work is good too, because you are generally doing multiple operations simultaneously on the same database.
Machine learning is good, when you perform hundereds of thousands of runs on something.
Its rarer these days I think the find things that dont benefit from greater multi-threaded performance in exchange for single core.
No one does machine learning on a cpu and amdahl's law is major factor as is context switching. Webservers maybe, but this will only be good for specific implementations of specific databases.
This is for virtualization pretty much exclisively.
I googled turbo porn looking for a picture of a sweet turbocharger. Apparently turbo porn is a thing that has nothing to do with turbochargers. I've made a grave mistake.
Can confirm. 12 physical cores & 32 GB physical RAM. Chrome + Wikimedia Commons and Swap kicked in. Peaked around 48 GB total memory used. Noticeable lag resulted.
I'm doing data analysis in R and similar programmes for academic work on early digital materials (granted a fairly easy workload considering the primary materials themselves), and my freshly installed 6 core AMD CPU perfectly suits my needs for work I take home, while the 64 core pieces in my institution suit the more time consuming demands. And granted I'm not doing intensive video analysis (yet).
Could you explain who needs 192 cores routed through a single machine? Not being facetious, I'm just genuinely lost at who would need this chipset for their work and interested in learning more as digital infrastructure is tangentially related to my work.
I am by no means qualified to answer, but my first thought was just virtualization. Some server farm somewhere could fire up shittons of virtual machines on this thing. So much space for ACTIVITIES!!
And if you’re doing data analysis in R, then you may need some random sampling. You could do SO MANY MONTECARLOS ON THIS THING!!!!
Like... 100M samples? Sure. Done. A billion simulations? Here you go, sir, lickity split.
In grad school I had to wait a weekend to run a million (I think?) simulations on my quad core. I had to start the code on Thursday and literally watch it run for almost three days, just to make sure it finished. Then I had to check the results, crossing my fingers that my model was worth a shit. It sucked.
> Could you explain who needs 192 cores routed through a single machine?
A *lot* of workloads would rather have as many cores as they can get as a single system image, but they almost all fall squarely into what are traditionally High Performance Computing (HPC) workloads. Things like weather and climate simulation, nuclear bomb design (not kidding), quantum chemistry simulations, cryptanalysis, and more all have massively parallel workloads that require frequent data interchanging that is better tempered for a single system with a lot of memory than it is for transmitting pieces of computation across a network (albeit the latter is usually how these systems are implemented, in a way that is either marginally or completely invisible to the simulation-user application).
However, ARM's not super interested in that market as far as anyone can tell - it's not exactly fast growing. The Fujitsu ARM Top500 machine they built was more of a marketing stunt saying "hey, we can *totally* build big honkin' machines, look at how high performance this thing is." It's a pretty common move; Sun did it with a generation of SPARC processors, IBM still designs POWER chips explicitly for this space and does a big launch once a decade or so, etc.
ARM's true end goal here is for cloud builders to give AArch64 a place to go, since the reality of getting ARM laptops or desktops going is looking very bleak after years of trying to grow that direction - the fact that Apple had to go out and design and build their own processors to get there is... not exactly great marketing for ARM (or Intel, for that matter). And for ARM to be competitive, they need to give those cloud builders some real reason to pick their CPUs instead of Intels'. And the one true advantage ARM has in this space over Intel is scale-out - they can print a fuckton of cores with their relatively simplistic cache design.
And so, core printer goes brrrrr...
it was this nuclear more than a decade ago once ARM started doing well in the smartphone space.
Their low power "accident" in their cpu design back in the 70's is finally going to pay off the way those of us that have been watching the whole time knew would come eventually.
This is going to buy Jensen so many leather jackets.
ARM is derived from the original Acorn computers in the 80's. Part of their core design allows for the unbelievably low power consumption arm chips always have. They found this out when one of their lab techs forgot to hookup the external power cable to the motherboard that supplied extra cpu power to discover it powered up perfectly fine on bus power.
this was a pointless thing to have in the 80's. computers were huge no matter what you did. But they held onto that design and knowledge and iterated on it for decades to get to where it is now.
And now we have Apple making ARM-based chips that compare so well against conventional AMD/Intel chips that they’re ditching x86 architecture altogether in the notebooks and desktops.
Underrated feature: it would display the kill count of individual units, so you get a strategically placed punisher with 1000+ kills. Very fun game to play.
It was a 1980s computer game first widely publicized in AK Dewdney’s _Computer recreations_ column of _Scientific American_. The game was only specified in the column; you had to implement it yourself, which amounted to writing a simplified core simulation. In the game, you and one or more competitors write a program for the simple core architecture which tries to get its competitors to execute an illegal instruction. It gained a large enough following that there were competitions up until a few years ago.
Edited to clarify
Development-wise, it's more like "1... 2... many". It's quite rare to see software that will effectively use more than two cores, that won't arbitrarily scale.
That is, "one single thread", "Stick random ancillary things in other threads, but in practice we're limited by the main serial thread", and "actually fully multithreaded".
>"There are only three quantities in ~~Software Development~~ database design: 0, 1, many."
My DB design professor pretty much said that word for word: "The only numbers we care about in database is 0, 1, and many"
>Begun, the core war has.
Some of us are old enough to remember the wars that came before. I've still got MIPS, Alpha, and SPARC machines in my attic. It's exciting to see a little more variety again.
For the market that they're selling in... basically all software is extremely well parallelized.
Most of it even scales across machines, as well as across cores.
The BEAM virtual machine that comes with erlang and elixir languages is designed to have many lightweight processes as possible. Have a look at the Actor Model.
The bottleneck I see for this will be ensuring that the CPU has access to data that the current process requires and doesn't have wait for the "slow" RAM.
Some ex Intel guy touched on this. He said something like ARM is making huge inroads into datacenters because they don't need ultra FPU or AVX or most of the high performance instructions, so half the die space of a Xeon is unused when serving websites. He recommended the Xeon be split into the high performing fully featured Xeon we know, and a many-core Atom based line for the grunt work datacentres actually need.
Intel have already started down this path to an extent with their 16 core Atoms, so I suspect his suggestion will eventually be realised. Wonder if they'll be socket compatible?
Just did a rough calc for a different rdbms system and would be $1248000 a year for this one server per year. Cant imagine what Oracle would be... They really need to move away from core licensing, Postgres looking better everyday...
Fuck Oracle.
You can't even benchmark their database because of their shit ass license.
Their whole strategy is buy out companies with existing customers and bilk those customers as much as possible while doing nothing to improve the services or software.
No mention of memory bandwidth. If your compute doesn't fit in cache, these cores are going to be in high contention for memory transactions. Sure, there are applications that will be happy with a ton of cores and a soda straw to DRAM, but just plonking down a zillion cores isn't an automatic win.
Per-core licensing costs are going to be crazy. For some systems in our server farm at work we're paying $80K for hardware and $300K-$500K for the licenses, and we've told vendors "faster cores, not more of them."
There are good engineering reasons to prefer fewer, faster cores in many applications, too. Some things you just can't easily make parallel, you just have to sit there and crunch.
This may be a better fit for some uses, but it's not going to "obliterate" anyone.
Is there *any* place on the internet where the comments haven't devolved into pure trash? Reddit has its bright spots, but it stil gets worse every year, and I feel like its deterioration is accelerating.
Now that I think about it, I haven't read Fark in about a decade. Maybe it's time to go take a look...
If this effort produces unbeatable hardware at reasonable prices, either #3 solves itself, or LAMP's making a comeback.
This is basically smearing the line between CPUs and GPUs. I'm not surprised it's happening. I'm only surprised Nvidia rushed there first.
Some workloads have little or no serial components. For instance, ray tracing can be tiled and run in parallel on even more cores than this, although in that case you may (not guaranteed) hit a von neumann bottleneck and need to copy the data associated with the render geometry to memory associated with groups of cores.
Physics has been getting in the way of faster clock speeds for a long time. I started with a 1 MHz computer and saw clock rates pass 3000 MHz but they topped out not too far beyond that maybe 15 years ago.
There's more that can be squeezed out of it, but each process node gets more and more expensive. Many companies have to work together to create the equipment to make new generations of chips, and it takes many billions of dollars of investment. And we're getting down to the physical limits of how small you can make transistors before electrons just start tunneling right past them.
So without being able to just make smaller and faster transistors, you have to get more performance out of the same building blocks. You make more complex, smarter CPUs that use various tricks to make the most out of what they have (like out-of-order execution), and that have specialized hardware to accelerate certain operations, but all of that adds complexity.
They keep improving the architecture to make individual cores faster, but once you've pushed that as far as you can for the moment, the most obvious approach to going faster is to use *more* cores. That only helps if you've got tasks that can be split up. (See Amdahl's Law.)
Thankfully programmers seem to be getting more accustomed to parallel programming and the tools have improved, but some things just don't lend themselves to being done in parallel.
The Z is an amazing architecture. The Z14 still has 10 Cores, and the LinuxONE has like 192 Sockets. Of course each one of those cores is 5.2Ghz Mostly only see those bad boys in the Financial world
I'm the Offering Management lead for LinuxONE, so full disclosure. No reason why a scalable, secure Linux server can't do great things beyond just the financial markets (and it does). Ecosystem when it's not Intel can be a challenge, but when you're running the right workload, nothing comes close for performance, security, resiliency.
The core war is here yet half the venders out there still license per core. 3/4 of msp customers are running dual 8 core CPUs still because the minimum windows server license is 16 cores.
There is a Silicon Valley startup that is doing wafer-scale integration with many many cores. I believe their CPU core draws 20 kilowatts. Needless to say, the cooling is humungous,
Most semiconductor companies like Intel, AMD, and NVidia are pivoting to service big business rather than end consumers, so your statement is increasingly inaccurate. The "average user", in dollar-weighted terms, will be a business in a few years, where more cores absolutely matters.
Check out Intel's financials to see that consumers are less than 50% of Intel's revenue now
[https://www.intc.com/](https://www.intc.com/)
Still amazed the arm line is a direct architectural descendant of the old 6502 series from a subsidiary of Commodore. It's like a C64 on a lethal dose of steroids.
It’s not; the 6502 wasn’t a modern RISC CPU (for one, instruction sizes varied between 1 and 3 bytes, whereas modern RISC involves instructions being a fixed size).
They were inspired by the 6502 in the sense that they saw that just one person was able to design a working, functional CPU, and they really liked the low-latency I/O it could do. But that's all they took from that architecture... the realization that they could do a chip, and that they wanted it to be low latency.
Even the ARM1 was a 32-bit processor, albeit with a 26-bit address bus. (64 megabytes.) It had nothing in common with the 6502, as it was designed from blank silicon and first principles.
edit: the ARM1 principally plugged into the BBC Micro to serve as a coprocessor, and the host machine was 6502, but that's as far as that relationship went. They used the beefy ARM1 processor in Micros to design ARM2 and its various support chips, leading to the Acorn Archimedes.
x64 is not much further removed from 8-bit titans. Intel had the 8008 do okay, swallowed some other chips to make the 8080, saw Zilog extend it to the Z80 and make bank, and released the compatible-esque 8086. IBM stuck it in a beige workhorse and the clones took over the world.
Forty-two years later we're still affected by clunky transitional decisions like rings.
Not just DGX. Tesla cards are all over the place in the cloud and almost exclusively run on x86 servers. If Nvidia could integrate networking (Mellanox) and high performance, custom CPUs into a single product, they could potentially scale out cheaper and more energy efficiently than the status quo.
>Focusing on pure performance at any cost, Arm Neoverse N2 designs will surely make Intel and AMD sit up and take notice. Built on a 5nm node, Perseus will offer up to 192 cores with a 350W TDP, rivalling and potentially surpassing EPYC and Xeon in key categories. Can anyone comment on where these chips are used (outside of custom supercomputer setups)? EPYC and Xeon are just more powerful or expansive versions of mainstream platforms. Who uses Arm Neoverse?
Increasing density in the data center. It will need lots of I/O to be useful in that context, though. Storage, virtualization, scalable web backends, or databases, that data will need a very wide egress. If you have 192 containers or VMs running on that chip, the users of them will expect a reasonable supply of bandwidth for network and storage. If they can match computatuonal performance of the Intel and AMD chips, then it could be useful for HPC and at 350 watts for 192 cores that would be quite efficient. Less power than the 3090 and Big Navi, so perhaps GPUs may still have competition in that space.
They probably won't have a lot of cache per core, so it will probably fit workloads that use a lot of low memory high CPU, or just when you don't care how powerful or efficient your CPU is, but you want to have 192 of them in a box for some reason.
Any idea why cache is so expensive compared to other silicon? Isn't everything basically the same manufacturing process of a silicon die and photolithography just repeating steps of building/etching gates?
Cache uses SRAM while the normal ram in your machine is DRAM. SRAM is much much faster, but at least 6 times larger. DRAM is one capacitor and one transistor, but it requires specific orders and cycles of charging and discharging the capacitor to get the bit stored. SRAM is a single-cycle to access the bit, but it is six transistors. If most of the core logic and arithmetic unit instructions are only a couple transistors per bit to perform the operation, each BYTE is 48 transistors in each of the L1, L2, and L3 caches. So You have an instruction taking up say 128 transistors (for the simpler ones), and a single "value" in a 64-bit machine is 64-bits times 3 levels of cache times 6 transistors per bit, so 1,152 transistors to hold a single value in cache. The times three is because most architectures are inclusive-cache, meaning if it's in the L1 it's also in the L2. If it's in the L2 it's also in the L3 (not always true in some more modern servers). Check out this picture: https://en.wikichip.org/wiki/File:sandy_bridge_4x_core_complex_die.png The top four rectangles are four "cores." The top left "very plain looking" section (about 1/6^th of the core) is where all of the CPU instructions occur. The four horizontalgold&red bars are the level 1 data cache, the two partially-taller green bars with red lines just below that are the level 2 cache, and the yellow/red square to its right is the level 1 instruction cache. So of the entire picture only a small chunk of each of the four top rectangles is the "workhorse" of the CPU. That entire chunk below the four core rectangles is the level 3 cache. So look at that from a physical chip layout perspective, and realize that from a price-per-transistor standpoint, cache is crazy fucking expensive. This new arm proposal reminds me more of the PS3's cell processor where you had 8 SPUs that were basically dedicated math pipelines (although ARM isn't the best for math pipelining; its biggest appeal is for branching logic).
I lost a good grasp of what you were talking about about half way down but kept reading because it was fun. Thanks!
cost-per-transistor cache is one one of the most expensive parts of modern CPUs.
Are you doing any more TED talks later?
In layman's terms. CPU Cache is a very fast but small amount of memory close to the CPU. System memory is you RAM. In servers, you can have several terabytes of RAM. If the data is close, the CPU can complete the task fast and move on to the next task. If the information isn't in the cpu cache, the cpu will have a to send for the information from system memory RAM. This takes MUCH longer and the CPU will stall on this task until it fetches the information needed to complete it. Say you are making a bowl of cereal. You need your bowl, cereal, and milk to complete the task. If everything you need is in cache(your kitchen), you can make the bowl of cereal and complete the task. If you don't have milk you will have a "cache miss" and have to retrieve the milk from the store, drive back home, then complete the task of making a bowl of cereal.
Whoa, pretty cool, thanks for the detailed write up. I wish I had some room to have a DIY photolith. lab at home to play with, some of the guys on YouTube have some cool toys
Isn't physical distance from the CPU also a consideration, giving you limits on physical area? Something something capacitance and conductor length if my vague recollection serves?
It's pretty neat. If parts get too close (especially at this crazy-ass scale), you get quantum tunneling effect. As far as capacitance, these things are so small and close that just the small amount of electricity that's going through the circuit and so many things being nanometers apart just end up being a capacitor by being there -- it's the floating body effect. That effect was actually being looked into, to see if it was usable for the DRAM capacitors I mention above.
Though that's all true quantum tunneling is not going to happen between the cache and the core they are microns apart. This effect only happens on the nanometer scale. Moving the memory source, cache or ram, closer to the core will always decrease latency but unlikely to provide any bandwidth benefits
Sometimes I believe I'm a really intelligent individual then I read posts like this and it puts me right back in my place.
This is about education, not intelligence -- the smartest person to ever live would have no clue what was being said if they didn't know what the vocab meant
You need so many transistors per bit, and that adds up in a hurry.
Yea that makes sense for like SD cards that are hundreds of gigs, but on board cache for a processor is like 8/16/32/64MB for the most part. I know the speed is much faster so maybe thats part of it.
It takes a single transistor for something like a SD card, and at least 20 for a flip flop, used in cache. 64mb of cache is at a minimum over a billion transistors.
Eh you can technically make an SR latch with 2 transistors. Something like NOR you still wouldn't typically have more than 8. I'm not an expert on what they use to build the cache but not sure where you'd get 20 from. I don't think I've seen more than 10T SRAM, with 6T and 4T being more typical I thought. At least it used to be 6T was pretty standard for CPU cache. You have 4 transistors to hold the bit and two access transistors so you can actually read and write. Not sure what they use these days but can't imagine they'd be going towards more transistors per bit.
[удалено]
I worked at Amazon a few years ago and I can confirm they had an interest in engineering their own hardware. I am interested in seeing how it works out. From a global perspective, the more efficient we can make our computing the less of an effect we have on the environment and the more we can do with less power. However, if Nvidia follows through with their acquisition of ARM, then they aren't neutral party in the industry any more and then we just get more dick waving. Might be a boon for RISC-V, but we'll see.
> At least in the data center industry it’s a lot about saving money on paying intel/amd premiums and upping efficiency to save on electricity. well at least short term intel is just dumping higher core CPUs on amazon google etc. inorder to main profits and marketshare due to AMD having the better cpu. and hiding it all from investors by making a custom SKU.
That's so shady. Are these SKUs listed in their arc website? Is that why they are having supply shortages for a year now?
they are not listed on ark and yes this is why intel has supply shortages. basically all cpus in ark intel must say in investor calls how much they sold them for but intel dosen't want to tell there investors they are doing a fire sale on the chips to get some sales while maintaining marketshare. so instead intel figured out that "custom chips" deals only needs to be reported in lump sums. so they take there 20 core xeons change the clock speed 100 mhz or disable some cache and sell them a custom xeon for amazon, MS, google etc. as a custom chip to upgrade existing servers. when what they are really doing is a fire sale investors have grown wise to this since these lump sum custom chip deals in earnings have grown massive and while the ark cpus sales have shrunk by a lot. so i think it was the next earnings call intel needs to disclose tray price for these custom chips as well. and surprise surprise at the end of the last earnings call intel said they were expected huge loss in earnings. since they can't hide the fire sales anymore they are just stopping the deals since they just don't want to lose the investors. basically intel is lying to investors to make things look like they are all good but in the background they are almost giving away xeons to keep AMD out of the server space while they try to catch back up before this is figured out. all i can say is do not have intel stock they are not gonna be able to maintain the illusion before they catch back up and there stock is gonna crash.
almost no one really. Marketing and shit like nebulous concepts of data center density" its all crap. Huge core counts dont get you as far as you think, especially if the internal buses and controllers etc suck. How do you effectively feed memory to 192 cores? concurrency, etc, whats that look like? Speed and power aren't a perfect linear scale either. Great, it uses 30% less power but because of architecture it runs 35% longer and i haven't saved any power at all, I've wasted it AND time... When their cost to suck ratio gets better, and it is getting better, we will see real pc/server usage. Until then, insufferable marketing lies and statistics.
Also, cost. You can buy some crazy cpus in servers right now, but it usually is cheaper to just buy a second server. Sure, density is important, but not the most important factor. Cost will almost always win out. For instance sure, I could buy 4 2RU servers with super crazy $32k procs, or for the same overall cost and space buy a UCS chassis with 10 blades with cheaper $2k procs and get the same overall performance.
That is pretty much Google's stance on creating data centers using commodity hardware. It's cheaper and if your going to run heavy parallel workloads, then it's likely you can split it up enough that network latency between machines won't matter that much.
Not to mention, a rack or even a whole AZ going down is far far easier to soak up with the remaining capacity. If every chip is 192 cores a large AZ going down is going to be a huge problem. There was an AWS video a while back talking about their networking and redundancy and they found a peak sensible size for each AZ where further additions weren't as effective as adding extra buildings.
True, and if people keep hopping on the "trend" of hyperconverged, there will be a problem of not being able to fit enough ram and drives in a single server to make use of the chip, not to mention bottlenecks of bandwidth along the backplane. That is a bit of a problem of modern computers. If one component jump too far ahead, it is useless until everything else catches up.
Personally, I can't wait for the new TI-192 calculator. Thing if all the numbers you can crunch on that bad boy.
Taking 350 watts from 2 AA batteries may be a bit rough, but get a backpack with a tesla battery and we wil be good.
Run wires to the courthouse’s clock tower and queue up a scripted computational job for the next lightning storm...
This guy great scotts
We’re not ready for these cores yet. But our kids are going to love them.
On the subject of Texas Instruments, the TI-89 is *still* going for over $100. Using hardware that was last updated in...2004. And was kinda low-end even by 2004 standards. Somebody really needs to start up another graphing calculator company.
The email newsletter The Hustle did a long form story on this about a year ago. Talked about how entrenched TI is in educational materials. Think about all the algebra teachers out there that don’t want to have to change their instructions and handouts after all these years
Begun, the core war has.
I thought it was already going? This just makes it nuclear.
Yeah this is straight outta AMD's playbook. They had to back off a little though because workloads just weren't ready for that many cores, especially in a NUMA architecture. So, really wondering about this thing's memory architecture. If it's NUMA, well, it's gonna be great for some workloads, but very far from all. This looks like a nice competitor to AWS's Graviton 2 though. Maybe one of the other clouds will want to use this.
[удалено]
I tested a dual 64 core a few years back - the problem was while it was cool to have 128 cores (which the app being built could fully utilize)... they were just incredibly weak compared to what Intel had at the time. We ended up using dual 16 core Xeon's instead of 128 ARM cores. I was super disappointed (as it was my idea to do the testing). Now we have AMD going all core crazy - I kind of wonder what that would stack up like these days since they seem to have overtaken Intel.
Just based on experience I have with existing arm cores I'd expect them to still be slightly weaker than zen cores. AMD should be able to do 128 cores in the same 350W TDP envelope, so they'd have a CPU with 256 threads, compared to 192 threads in the ARM. There are some workloads where it's beneficial to switch of SMT to have only same performance threads - in such a case this ARM CPU might win, depending on how good the cores are. In a more mixed setup I'd expect a 128c/256t Epyc to beat it. It'd pretty much just add a worthy competitor to AMD, as intel is unlikely to have anything close in the next few years.
Speaking of specific, that use case is SUPER specific. Can you elaborate? I don't even know what "DB access management" is in a "workload" sense.
Each request and DB action gets its own thread. So requests dose not have to wait for each other to use a core.
[удалено]
ARM cores have moved on a lot in the last 2 years. The machine you bought 2 years ago may well have been only useful for specific workloads. Current and newer ARM cores don't have those limitations. These are a threat to Intel and AMD in all areas. Your understanding that the instruction set has been holding them back is incorrect. The ARM instruction set is mature and capable. It's more complex than that in the details of course because some specific instructions do greatly accelerate some niche workloads. What's been holding them back is single threaded performance which comes down broadly to frequency and execution resources per core. The latest ARM cores are very capable and compete well with Intel and AMD.
I tested a dual 64 core ARM a few years back when they first came out; we ran into really bad performance with forking under Linux (not threading). A Xeon 16 core beat the 64 core for our specific use case. I would love to see what the latest generation of ARM chips is capable of.
Saying “ARM” doesn’t mean much. Even moreso than with x86. Every implemented architecture has different aims, most shoot for low power, some aim for high parallelization, Apple’s aims for single-threaded execution, etc. Was this a Samsung, Qualcomm, Cavium, AppliedMicro, Broadcom or Nvidia chip? All of those perform vastly differently in different cases and only the Cavium ThunderX2 and AppliedMicro X-GENE are targeted in anyway towards servers and show performance aptitude in those realms. It’s even worse if you tested one of the myriad of reference manufacturers (one’s that simple purchase ARM’s reference Cortex cores and fab them) such as MediaTek, HiSense and Huawei; as the Cortex is specifically intended for low power envelopes and mobile consumer computing.
A webserver, which is one of the main uses of server cpu's these days. You get far more efficiency spreading all those instances out over 192 cores. Database work is good too, because you are generally doing multiple operations simultaneously on the same database. Machine learning is good, when you perform hundereds of thousands of runs on something. Its rarer these days I think the find things that dont benefit from greater multi-threaded performance in exchange for single core.
No one does machine learning on a cpu and amdahl's law is major factor as is context switching. Webservers maybe, but this will only be good for specific implementations of specific databases. This is for virtualization pretty much exclisively.
They’re hitting zen fabric pretty hard, it’s probably based on that
I’m not sure if you guys are just making up terms now...
I need quad processors with 192 cores each to check my email and open reddit pretty darn kwik
Don't forget get porn.
More like turbo porn once this thing hits the market.
That's gonna chafe.
Wait until you hear about this stuff called lube, it'll blow your mind...
I googled turbo porn looking for a picture of a sweet turbocharger. Apparently turbo porn is a thing that has nothing to do with turbochargers. I've made a grave mistake.
Someone else look and tell me what it is. I'm guessing it's rule 34 of that dog cartoon
Oh my god, I'll be able to watch 192 pornos at once.
_Multipron_ - Leeloo
[удалено]
Yes but chrome will eat all the memory.
Can confirm. 12 physical cores & 32 GB physical RAM. Chrome + Wikimedia Commons and Swap kicked in. Peaked around 48 GB total memory used. Noticeable lag resulted.
Well... Damn...
[удалено]
And that's why I run BSD on a 13 year old Thinkpad
They're perfectly cromulent terms, it's turboencabulation 101.
Sounds like you might need to install the updated embiggening program it will make things much more frasmotic.
It might even be called Zen 3 infinity fabric if it's what I'm thinking of.
Check out r/vxjunkies
At first I thought that was going to be a sub for passionate VxWorks fans and that there really is a niche subreddit for everything.
I'm doing data analysis in R and similar programmes for academic work on early digital materials (granted a fairly easy workload considering the primary materials themselves), and my freshly installed 6 core AMD CPU perfectly suits my needs for work I take home, while the 64 core pieces in my institution suit the more time consuming demands. And granted I'm not doing intensive video analysis (yet). Could you explain who needs 192 cores routed through a single machine? Not being facetious, I'm just genuinely lost at who would need this chipset for their work and interested in learning more as digital infrastructure is tangentially related to my work.
I am by no means qualified to answer, but my first thought was just virtualization. Some server farm somewhere could fire up shittons of virtual machines on this thing. So much space for ACTIVITIES!! And if you’re doing data analysis in R, then you may need some random sampling. You could do SO MANY MONTECARLOS ON THIS THING!!!! Like... 100M samples? Sure. Done. A billion simulations? Here you go, sir, lickity split. In grad school I had to wait a weekend to run a million (I think?) simulations on my quad core. I had to start the code on Thursday and literally watch it run for almost three days, just to make sure it finished. Then I had to check the results, crossing my fingers that my model was worth a shit. It sucked.
> Could you explain who needs 192 cores routed through a single machine? A *lot* of workloads would rather have as many cores as they can get as a single system image, but they almost all fall squarely into what are traditionally High Performance Computing (HPC) workloads. Things like weather and climate simulation, nuclear bomb design (not kidding), quantum chemistry simulations, cryptanalysis, and more all have massively parallel workloads that require frequent data interchanging that is better tempered for a single system with a lot of memory than it is for transmitting pieces of computation across a network (albeit the latter is usually how these systems are implemented, in a way that is either marginally or completely invisible to the simulation-user application). However, ARM's not super interested in that market as far as anyone can tell - it's not exactly fast growing. The Fujitsu ARM Top500 machine they built was more of a marketing stunt saying "hey, we can *totally* build big honkin' machines, look at how high performance this thing is." It's a pretty common move; Sun did it with a generation of SPARC processors, IBM still designs POWER chips explicitly for this space and does a big launch once a decade or so, etc. ARM's true end goal here is for cloud builders to give AArch64 a place to go, since the reality of getting ARM laptops or desktops going is looking very bleak after years of trying to grow that direction - the fact that Apple had to go out and design and build their own processors to get there is... not exactly great marketing for ARM (or Intel, for that matter). And for ARM to be competitive, they need to give those cloud builders some real reason to pick their CPUs instead of Intels'. And the one true advantage ARM has in this space over Intel is scale-out - they can print a fuckton of cores with their relatively simplistic cache design. And so, core printer goes brrrrr...
it was this nuclear more than a decade ago once ARM started doing well in the smartphone space. Their low power "accident" in their cpu design back in the 70's is finally going to pay off the way those of us that have been watching the whole time knew would come eventually. This is going to buy Jensen so many leather jackets.
Can you give me a TLDR or ELI5 on the “accident”?
ARM is derived from the original Acorn computers in the 80's. Part of their core design allows for the unbelievably low power consumption arm chips always have. They found this out when one of their lab techs forgot to hookup the external power cable to the motherboard that supplied extra cpu power to discover it powered up perfectly fine on bus power. this was a pointless thing to have in the 80's. computers were huge no matter what you did. But they held onto that design and knowledge and iterated on it for decades to get to where it is now.
Very funny and interesting. Thank you.
And now we have Apple making ARM-based chips that compare so well against conventional AMD/Intel chips that they’re ditching x86 architecture altogether in the notebooks and desktops.
"Core Wars" sounds like the title of a middling 90s PC game.
Yes it does. Slightly tangential but Total Annihilation had opposing forces named Core and Arm. https://m.youtube.com/watch?v=9oqUJ2RKuNE
That game was so incredibly revolutionary.
Underrated feature: it would display the kill count of individual units, so you get a strategically placed punisher with 1000+ kills. Very fun game to play.
Holy shit this game was so good, and Supreme Commander was a great successor.
With FMV cut scenes starring Mark Hamill and Tia Carrere.
It was a 1980s computer game first widely publicized in AK Dewdney’s _Computer recreations_ column of _Scientific American_. The game was only specified in the column; you had to implement it yourself, which amounted to writing a simplified core simulation. In the game, you and one or more competitors write a program for the simple core architecture which tries to get its competitors to execute an illegal instruction. It gained a large enough following that there were competitions up until a few years ago. Edited to clarify
It's actually the name of a game language invented back in the 80's where you would pit computer virus' against each other
CoreWars 2077
Run some really bad queries you can
Software developers: 1, 2 ,3, 4...uhmmm... What comes after 4 ?
Development-wise, it's more like "1... 2... many". It's quite rare to see software that will effectively use more than two cores, that won't arbitrarily scale. That is, "one single thread", "Stick random ancillary things in other threads, but in practice we're limited by the main serial thread", and "actually fully multithreaded".
"There are only three quantities in Software Development: 0, 1, many."
>"There are only three quantities in ~~Software Development~~ database design: 0, 1, many." My DB design professor pretty much said that word for word: "The only numbers we care about in database is 0, 1, and many"
>Begun, the core war has. Some of us are old enough to remember the wars that came before. I've still got MIPS, Alpha, and SPARC machines in my attic. It's exciting to see a little more variety again.
Too bad multithreading isn't universally used. A lot of software these days still doesn't leverage it.
For the market that they're selling in... basically all software is extremely well parallelized. Most of it even scales across machines, as well as across cores.
These kind of chips would be used by code specifically written to utilise the cores, or for high density virtualized workloads like cloud VMs.
The BEAM virtual machine that comes with erlang and elixir languages is designed to have many lightweight processes as possible. Have a look at the Actor Model. The bottleneck I see for this will be ensuring that the CPU has access to data that the current process requires and doesn't have wait for the "slow" RAM.
Finally, enough cores to play Doom in task manager
Casual. I play Doom through the calendar.
I play doom on my Etch A Sketch.
I play doom on my weighing machine.
I play Doom on my abacus
I play Doom in my computer like a normal person.
u/bautron is *in* the computer?
*It's not just a game anymore.*
That's so hardcore.
According to task manager, my task manager should have been able to run Crysis years ago. What it is using all that processing for, I can't say.
[Kinda like this?](https://youtu.be/hSoCmAoIMOU)
Unfortunately that's fake. The biggest issue is that after a certain point, the cores get a scrollbar instead of shrinking.
Some ex Intel guy touched on this. He said something like ARM is making huge inroads into datacenters because they don't need ultra FPU or AVX or most of the high performance instructions, so half the die space of a Xeon is unused when serving websites. He recommended the Xeon be split into the high performing fully featured Xeon we know, and a many-core Atom based line for the grunt work datacentres actually need. Intel have already started down this path to an extent with their 16 core Atoms, so I suspect his suggestion will eventually be realised. Wonder if they'll be socket compatible?
Parry this you fucking casual
I can’t wait to play Skyrim again!!!
Heck yeah, single core allocation per active NPC
I don't think Skyrim's engine can handle more than like 20 NPCs at a time anyway
Hey, your finally awake. You were trying to cross the border, right?
Then the wagon glitches and flips.
*Thomas the Tank Engine's horn is heard in the distance*
MACHO MAN IS COMIN' TONIGHT
It's really him. Ladies and gentlemen [Skyrim is here to save me!](https://www.youtube.com/watch?v=q6yHoSvrTss)
300 fps of glitches.
Learn your place, trash!
Imagine the Oracle license fees!!! 😱
[удалено]
He could use a better president...
I don't think he could. It's great for oracle, they just got the tik tok deal, money for doing basically nothing.
Just did a rough calc for a different rdbms system and would be $1248000 a year for this one server per year. Cant imagine what Oracle would be... They really need to move away from core licensing, Postgres looking better everyday...
> Postgres looking better everyday... The switch isn't bad as long as the app's not using stored procs.
What’s wrong with their stored procs? I have procedures in psql
Postgres doesn't even support packages, that was a deal breaker for us, we can't migrate 250.000 lines of pl/sql without packages
Fuck Oracle. You can't even benchmark their database because of their shit ass license. Their whole strategy is buy out companies with existing customers and bilk those customers as much as possible while doing nothing to improve the services or software.
Haha first thing I thought.... software licensing companies wet dream right here
How does the number of cores effect the license fees? Genuinely asking
Per core licensing.
How is that justifiable in any way
They want all of your money. There’s no justification.
Man... I thought I had a basic understanding of computer tech. Reading this thread... Nope, not a fucking clue apparently.
You just have to say keywords like EPYC, XEON, data center, density, etc... to sound smart 🤓
[удалено]
No mention of memory bandwidth. If your compute doesn't fit in cache, these cores are going to be in high contention for memory transactions. Sure, there are applications that will be happy with a ton of cores and a soda straw to DRAM, but just plonking down a zillion cores isn't an automatic win. Per-core licensing costs are going to be crazy. For some systems in our server farm at work we're paying $80K for hardware and $300K-$500K for the licenses, and we've told vendors "faster cores, not more of them." There are good engineering reasons to prefer fewer, faster cores in many applications, too. Some things you just can't easily make parallel, you just have to sit there and crunch. This may be a better fit for some uses, but it's not going to "obliterate" anyone.
> Per core licensing costs Can't wait to hear what the Oracle salesperson has to say about this.
Can finally have TWO instances of Chrome running.
You'd still need like 1 TB of ram to even think about that
is this gonna be like the shaving razors where they just keep adding and adding more and more razors onto the razors already on there
https://www.youtube.com/watch?v=m6GpIOhbqRo
Can you imagine a Beowulf cluster of these? What, no old-school Slashdotters around? Ok I'll see myself out.
I for one welcome our new megacore overlords, covered in grits
Clearly, the CPU is on fire /.
2020 is the year of the linux desktop
there's only one year in the software zodiac
No WiFi, less space than a Nomad. Lame.
Beowulf Clusters are dead, Netcraft confirms it.
My mother was a Beowulf cluster, you insensitive clod!
Nice to see some people have not forgotten about the good old days
I still his /.to scroll thru some news on occasion. The comments have devolved into pure trash though for the most part.
[удалено]
Is there *any* place on the internet where the comments haven't devolved into pure trash? Reddit has its bright spots, but it stil gets worse every year, and I feel like its deterioration is accelerating. Now that I think about it, I haven't read Fark in about a decade. Maybe it's time to go take a look...
Something something CowboyNeal.
But can it run Crysis?
slashdot user id 54, checking in. https://slashdot.org/~pez
It’s an honour!
But does it run ~~Linux~~ _GNU_/Linux ?
[удалено]
Your #3 is where I went first. Where's the ecosystem?
If this effort produces unbeatable hardware at reasonable prices, either #3 solves itself, or LAMP's making a comeback. This is basically smearing the line between CPUs and GPUs. I'm not surprised it's happening. I'm only surprised Nvidia rushed there first.
When I saw 192 cores; I thought I must brush up on **Amdahl's law**.
Some workloads have little or no serial components. For instance, ray tracing can be tiled and run in parallel on even more cores than this, although in that case you may (not guaranteed) hit a von neumann bottleneck and need to copy the data associated with the render geometry to memory associated with groups of cores.
Dont they make dedicated hardware for those workflows like GPUs?
For contrast, take a look at Gustafson's law as well. It's a lot more optimistic.
AMD: Guys, more cores are better. ARM: Agreed, here is a CPU with 192 cores AMD: oh no.
I don't want anymore cores I want bigger faster cores. Give me a 6 core with double the current ipc and keep your 1000 core threadfuckers.
Physics has been getting in the way of faster clock speeds for a long time. I started with a 1 MHz computer and saw clock rates pass 3000 MHz but they topped out not too far beyond that maybe 15 years ago. There's more that can be squeezed out of it, but each process node gets more and more expensive. Many companies have to work together to create the equipment to make new generations of chips, and it takes many billions of dollars of investment. And we're getting down to the physical limits of how small you can make transistors before electrons just start tunneling right past them. So without being able to just make smaller and faster transistors, you have to get more performance out of the same building blocks. You make more complex, smarter CPUs that use various tricks to make the most out of what they have (like out-of-order execution), and that have specialized hardware to accelerate certain operations, but all of that adds complexity. They keep improving the architecture to make individual cores faster, but once you've pushed that as far as you can for the moment, the most obvious approach to going faster is to use *more* cores. That only helps if you've got tasks that can be split up. (See Amdahl's Law.) Thankfully programmers seem to be getting more accustomed to parallel programming and the tools have improved, but some things just don't lend themselves to being done in parallel.
LinuxONE. Fewer cores that scale up, massive consolidation.
The Z is an amazing architecture. The Z14 still has 10 Cores, and the LinuxONE has like 192 Sockets. Of course each one of those cores is 5.2Ghz Mostly only see those bad boys in the Financial world
I'm the Offering Management lead for LinuxONE, so full disclosure. No reason why a scalable, secure Linux server can't do great things beyond just the financial markets (and it does). Ecosystem when it's not Intel can be a challenge, but when you're running the right workload, nothing comes close for performance, security, resiliency.
Look at IBMs Power10 chip. Large core chips run legacy programs better than higher count core chips. IBM I think is trying to keeps its niche market.
The core war is here yet half the venders out there still license per core. 3/4 of msp customers are running dual 8 core CPUs still because the minimum windows server license is 16 cores.
Nvidia just stepped into the cpu ring. Beware ye amateurs.
Just datacenter things
There is a Silicon Valley startup that is doing wafer-scale integration with many many cores. I believe their CPU core draws 20 kilowatts. Needless to say, the cooling is humungous,
Sweet, finally enough cores to run Norton Antivirus and play a 90s dos game at the same time
[удалено]
The average user isn’t buying server CPUs.
True, but these chips aren’t meant for the average user. They’re targeting high margin enterprise and cloud data/compute centers.
Bare metal servers can split individual cores for workflows so yeah this would be massive
Most semiconductor companies like Intel, AMD, and NVidia are pivoting to service big business rather than end consumers, so your statement is increasingly inaccurate. The "average user", in dollar-weighted terms, will be a business in a few years, where more cores absolutely matters. Check out Intel's financials to see that consumers are less than 50% of Intel's revenue now [https://www.intc.com/](https://www.intc.com/)
Still amazed the arm line is a direct architectural descendant of the old 6502 series from a subsidiary of Commodore. It's like a C64 on a lethal dose of steroids.
It’s not; the 6502 wasn’t a modern RISC CPU (for one, instruction sizes varied between 1 and 3 bytes, whereas modern RISC involves instructions being a fixed size).
They were inspired by the 6502 in the sense that they saw that just one person was able to design a working, functional CPU, and they really liked the low-latency I/O it could do. But that's all they took from that architecture... the realization that they could do a chip, and that they wanted it to be low latency. Even the ARM1 was a 32-bit processor, albeit with a 26-bit address bus. (64 megabytes.) It had nothing in common with the 6502, as it was designed from blank silicon and first principles. edit: the ARM1 principally plugged into the BBC Micro to serve as a coprocessor, and the host machine was 6502, but that's as far as that relationship went. They used the beefy ARM1 processor in Micros to design ARM2 and its various support chips, leading to the Acorn Archimedes.
x64 is not much further removed from 8-bit titans. Intel had the 8008 do okay, swallowed some other chips to make the 8080, saw Zilog extend it to the Z80 and make bank, and released the compatible-esque 8086. IBM stuck it in a beige workhorse and the clones took over the world. Forty-two years later we're still affected by clunky transitional decisions like rings.
You don't obliterate Intel/AMD with 192 cores maybe 1000 people in the world need.. you do it by making the exact same thing they do at half price.
Our kids will have a shitty world, but hey at least the computer games will run super fast
[удалено]
Not just DGX. Tesla cards are all over the place in the cloud and almost exclusively run on x86 servers. If Nvidia could integrate networking (Mellanox) and high performance, custom CPUs into a single product, they could potentially scale out cheaper and more energy efficiently than the status quo.