... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,716 of 131,241
Anton Ertl to BGB
Re: Intel's Software Defined Super Cores
19 Sep 25 09:50:32
   From: anton@mips.complang.tuwien.ac.at   
      
   BGB  writes:   
   >On 9/17/2025 4:33 PM, John Levine wrote:   
   >> According to BGB  :   
   >>> Still sometimes it seems like it is only a matter of time until Intel or   
   >>> AMD releases a new CPU that just sort of jettisons x86 entirely at the   
   >>> hardware level, but then pretends to still be an x86 chip by running   
   >>> *everything* in a firmware level emulator via dynamic translation.   
   >>   
   >> That sounds a whole lot like what Transmeta did 25 years ago:   
   >>   
   >> https://en.wikipedia.org/wiki/Transmeta_Crusoe   
   >>   
   >> They failed but perhaps things are different now.  Their   
   >> native architecture was VLIW which might have been part   
   >> of the problem.   
   >>   
   >   
   >Might be different now:   
   >25 years ago, Moore's law was still going strong, and the general   
   >concern was more about maximizing scalar performance rather than energy   
   >efficiency or core count (and, in those days, processors were generally   
   >single-core).   
      
   IA-64 CPUs were shipped until July 29, 2021, and Poulson (released   
   2012) has 8 cores.  If IA-64 (and dynamically translating AMD64 to it)   
   would be a good idea nowadays, it would not have been canceled.   
      
   How should the number of cores change anything?  If you cannot make   
   single-threaded IA-32 or AMD64 programs run at competetive speeds on   
   IA-64 hardware, how would that inefficiency be eliminated in   
   multi-threaded programs?   
      
   >Now we have a different situation:   
   >   Moore's law is dying off;   
      
   Even if that is the case, how should that change anything about the   
   relative merits of the two approaches?   
      
   >   Scalar CPU performance has hit a plateau;   
      
   True, but again, what's the relevance for the discussion at hand?   
      
   >   And, for many uses, performance is "good enough";   
      
   In that case, better buy a cheaper AMD64 CPU rather than a   
   particularly fast CPU with a different architecture X and then run a   
   dynamic AMD64->X translator on it.   
      
   >   A lot more software can make use of multi-threading;   
      
   Possible, but how would it change things?   
      
   >Likewise, x86 tends to need a lot of the "big CPU" stuff to perform   
   >well, whereas something like a RISC style ISA can get better performance   
   >on a comparably smaller and cheaper core, and with a somewhat better   
   >"performance per watt" metric.   
      
   Evidence?   
      
   Yes, you can run CPUs with Intel P-cores and AMD's non-compact cores   
   with higher power limits than what the Apple and Qualcomm chips   
   approximately consume (I have not seen proper power consumption   
   numbers for these since Anandtech stopped publishing), but you can   
   also run Intel CPUs and AMD CPUs at low power limits, with much better   
   "performance per watt".  It's just that many buyers of these CPUs care   
   about performance, not performance per watt.   
      
   And if you run AMD64 software on your binary translator on CPUs with   
   e.g., ARM A64 architecture, the performance per watt is worse than   
   when running it on an AMD64 CPU.   
      
   >So, one possibility could be, rather than a small number of big/fast   
   >cores (either VLIW or OoO), possibly a larger number of smaller cores.   
   >   
   >The cores could maybe be LIW or in-order RISC.   
      
   The approach of a large number of small, slow cores has been tried,   
   e.g., in the TILE64, but has not been successful with that core size.   
   Other examples are Sun's UltraSparc T1000 and followons, which were   
   somewhat more successful, but eventually led to the cancellation of   
   SPARC.   
      
   Finally, Intel now offers E-core-only chips for clients (e.g., N100)   
   and servers (Sierra Forest), but they have not stopped releasing   
   P-Core-only server CPUs.  For the desktop the CPU with the largest   
   numbers of E-Cores (16) also hase 8 P-cores, so Intel obviously   
   believes that not all desktop applications are embarrassingly   
   parallel.   
      
   Intel used to have Xeon Phi CPUs with a higher number of narrower   
   cores, but eventually replaced them with Xeon processors that have   
   fewer, but more powerful cores.   
      
   AMD offers compact-core-only server CPUs with more cores and less   
   cache per core, but otherwise the same microarchitecture, only with a   
   much lower clock ceiling.  (There is a difference in microarchitecture   
   wrt execurting AVX-512 instructions on Zen5, but that's minor).  AMD   
   also offers server CPUs with non-compact cores; interestingly, if we   
   compare CPUs with the same numbers of cores, the launch price (at the   
   same date) is not that far apart:   
      
                     GHz   
     Model  cores base boost cache TDP  launch   current   
   EPYC 9755 128   2.7  4.1  512MB 500W USD12984 EUR5979   
   EPYC 9745 128   2.3  3.7  256MB 400W USD12141 EUR4192   
      
   Current pricing from   
   ;   
   however, the third-cheapest dealer for the 9745 asks for EUR 6129, and   
   the cheapest price up to 2025-09-10 has been EUR 6149, so the current   
   price difference may be short-lived.  The cheapest price for the 9755   
   was 4461 on 2025-08-25, and at that time the 9755 was cheaper than the   
   9745 (at least as far as the prices seen by the website above are   
   concerned).   
      
   I have thought about why the idea of more smaller cores has not been   
   more successful, at least for the kinds of loads where you have a   
   large number of independent and individually not particularly   
   demanding threads, as in web shops.  My explanation is that you need   
   1) memory bandwidth and 2) interconnection with the rest of the   
   system.   
      
   The interconnection with the rest of the system probably does   
   not get much cheaper for the smaller cores, and probably becomes more   
   expensive with more cores (e.g., Intel switched from a ring to a grid   
   when they increased the cores in their server chips).   
      
   The bandwidth requirements to main memory for given cache sizes per   
   core reduce linearly with the performance of the cores; if the larger   
   number of smaller cores really leads to increased aggregate   
   performance, additional main memory bandwidth is needed, or you can   
   compensate for that with larger caches.   
      
   But to eliminate some variables, let's just consider the case where we   
   want to get the same performance with the same main memory bandwidth   
   from using more smaller cores than we use now.  Will the resulting CPU   
   require less area?  The cache sizes per core are not reduced, and   
   their area is not reduced much.  The core itself will get smaller, and   
   its performance will also get smaller (although by less than the   
   core).  But if you sum up the total per-core area (core, caches, and   
   interconnect), at some point the per-core area reduces by less than   
   the per-core performance, so for a given amount of total performance,   
   the area goes up.   
      
   There is one counterargument to these considerations: The largest   
   configuration of Turin dense has less cache for more cores than the   
   largest configuration of Turin.  I expect that's the reason why they   
   offer both; if you have less memory-intensive loads, Turin dense with   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]