... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,740 of 131,241
Kent Dickey to gneuner2@comcast.net
Re: Intel's Software Defined Super Cores
22 Sep 25 03:21:17
   From: kegs@provalid.com   
      
   In article ,   
   George Neuner   wrote:   
   >On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard   
   > wrote:   
   >   
   >>On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:   
   >>   
   >>> Although it's called "inverse hyperthreading", this technique could be   
   >>> combined with SMT - put the chunks into different threads on the same   
   >>> core, rather than on different cores, and then one wouldn't need to add   
   >>> extra connections between cores to make it work.   
   >>   
   >>On further reflection, this may be equivalent to re-inventing out-of-order   
   >>execution.   
   >>   
   >>John Savard   
   >   
   >Sounds more like dynamic micro-threading.   
   >   
   >Over the years I've seen a handful of papers about compile time   
   >micro-threading: that is the compiler itself identifies separable   
   >dependency chains in serial code and rewrites them into deliberate   
   >threaded code to be executed simultaneously.   
   >   
   >It is not easy to do under the best of circumstances and I've never   
   >seen anything about doing it dynamically at run time.   
   >   
   >To make a thread worth rehosting to another core, it would need to be   
   >(at least) many 10s of instructions in length.  To figure this out   
   >dynamically at run time, it seems like you'd need the decode window to   
   >be 1000s of instructions and a LOT of "figure-it-out" circuitry.   
   >   
   >   
   >MMV, but to me it doesn't seem worth the effort.   
      
   I began reading the patent, and it's not clear to me this approach is   
   going to be much of an improvement.  A great deal of analysis magic has   
   to happen to find code to spread across the cores.  To summarize, it's   
   basically taking code that looks like:   
      
   	for(i = 0; i < N; i++) {   
   		// Do some work   
   	}   
      
   	for(i = 0; i < M; i++) {   
   		// Do some different work   
   	}   
      
   and have two cores run the loops at the same time, with some special   
   check hardware to make sure they really are dependent (I gave up before   
   really figuring out what they're going to do, patents are not fun to read).   
   I think they actually want to divide up each loop into sections, and do   
   them in parallel.  If someone wanted to explain in better detail what   
   they are doing, I'd like to read that short summary in non-patentese.   
      
   A trivial alternative approach to shrinking core size while not losing   
   single thread speed is to basically make all cores Narrow (meaning   
   support something like 4 instructions wide), and when code needs more,   
   stall the neighboring core and steal it's functional units to form a new   
   8-wide core.  This approaches the SMT hardware sharing in a different   
   direction, and so code without much instruction parallelism will run   
   better on two smaller cores than on a big core with two threads, but if   
   a single thread can use 8-wide instruction execution, it can steal it from   
   the neighboring core for a while.   
      
   If that's too much trouble, then for x86, all cores have just AVX-256 width,   
   and take two clocks to do each AVX-512 operation (which is still better than   
   just AVX-256).  But hardware can join the neighboring cores together to be   
   AVX-512, with each AVX-512 op taking just one clock now (and this can just   
   be AVX, the other core can run other instructions unimpeded).   
      
   Kent   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]