... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,132 of 131,241
BGB to Terje Mathisen
Re: branch splitting
05 Nov 25 01:00:32
   From: cr88192@gmail.com   
      
   On 11/4/2025 3:44 PM, Terje Mathisen wrote:   
   > MitchAlsup wrote:   
   >>   
   >> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >>   
   >>> Bottom line: History-based branch prediction has won, any kind of   
   >>> delayed branches (including split-branch designs) turn out to be   
   >>> a bad idea.   
   >>   
   >> Or "Never bet against branch prediction".   
   >   
   > I have probably mentioned this before, once or twice, but I'm actually   
   > quite proud of the meeting I had with Intel Santa Clara in the spring of   
   > 1995:   
   >   
   > I had (accidentally) written the first public mention of the FDIV bug   
   > (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of   
   > MathWorks/MatLab fame led the effort to develop a minimum cost sw   
   > workaround for the bug. (My code became part of all/most x86 compiler   
   > runtimes for the next few years.)   
   >   
   > Due to this Intel invited me to receive an early engineering prototype   
   > of the PentiumPro, together with an NDA-covered briefing about its   
   > architecture.   
   >   
   > Before the start of that briefing I suggested that I should start off on   
   > the blackboard by showing what I had been able to figure out on my own,   
   > then I proceeded to pretty much exactly cover every single feature on   
   > the cpu, with one glaring exception:   
   >   
   > Based on the useful but not great branch predictor on the Pentium I told   
   > them that I expected the P6 to employ eager execution, i.e execute both   
   > ways of one or two layers of branches, discarding the non-taken paths as   
   > the branch direction info became available.   
   >   
   > That's the point when they got to brag about how having a much, much   
   > better branch predictor was better both from a performance and a power   
   > viewpoint, since out of order execution could predict much deeper than   
   > any eager execution would have the resources for.   
   >   
   > As you said: "Never bet against branch prediction".   
   >   
      
   Branch prediction is fun.   
      
      
   When I looked around online before, a lot of stuff about branch   
   prediction was talking about fairly large and convoluted schemes for the   
   branch predictors.   
      
   But, then always at the end of it using 2-bit saturating counters:   
      weakly taken, weakly not-taken, strongly taken, strongly not taken.   
      
   But, in my fiddling, there was seemingly a simple but moderately   
   effective strategy:   
      Keep a local history of taken/not-taken;   
      XOR this with the low-order-bits of PC for the table index;   
      Use a 5/6-bit finite-state-machine or similar.   
        Can model repeating patterns up to ~ 4 bits.   
      
   Where, the idea was that the state-machine in updated with the current   
   state and branch direction, giving the next state and next predicted   
   branch direction (for this state).   
      
      
   Could model slightly more complex patterns than the 2-bit saturating   
   counters, but it is sort of a partial mystery why (for mainstream   
   processors) more complex lookup schemes and 2 bit state, was preferable   
   to a simpler lookup scheme and 5-bit state.   
      
   Well, apart from the relative "dark arts" needed to cram 4-bit patterns   
   into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).   
      
      
      
   Then again, had before noted that the LLMs are seemingly also not really   
   able to figure out how to make a 5 bit FSM to model a full set of 4 bit   
   patterns.   
      
      
   Then again, I wouldn't expect it to be all that difficult of a problem   
   for someone that is "actually smart"; so presumably chip designers could   
   have done similar.   
      
   Well, unless maybe the argument is that 5 or 6 bits of storage would   
   cost more than 2 bits, but then presumably needing to have significantly   
   larger tables (to compensate for the relative predictive weakness of   
   2-bit state) would have costed more than the cost of smaller tables of 6   
   bit state ?...   
      
   Say, for example, 2b:   
     00_0 => 10_0  //Weakly not-taken, dir=0, goes strong not-taken   
     00_1 => 01_0  //Weakly not-taken, dir=1, goes weakly taken   
     01_0 => 00_1  //Weakly taken, dir=0, goes weakly not-taken   
     01_1 => 11_1  //Weakly taken, dir=1, goes strongly taken   
     10_0 => 10_0  //strongly not taken, dir=0   
     10_1 => 00_0  //strongly not taken, dir=1 (goes weak)   
     11_0 => 01_1  //strongly taken, dir=0   
     11_1 => 11_1  //strongly taken, dir=1 (goes weak)   
      
   Can expand it to 3-bits, for 2-bit patterns   
      As above, and 4-more alternating states   
      And slightly different transition logic.   
   Say (abbreviated):   
      000   weak, not taken   
      001   weak, taken   
      010   strong, not taken   
      011   strong, taken   
      100   weak, alternating, not-taken   
      101   weak, alternating, taken   
      110   strong, alternating, not-taken   
      111   strong, alternating, taken   
   The alternating states just flip-flopping between taken and not taken.   
      The weak states can more between any of the 4.   
      The strong states used if the pattern is reinforced.   
      
   Going up to 3 bit patterns is more of the same (add another bit,   
   doubling the number of states). Seemingly something goes nasty when   
   getting to 4 bit patterns though (and can't fit both weak and strong   
   states for longer patterns, so the 4b patterns effectively only exist as   
   weak states which partly overlap with the weak states for the 3-bit   
   patterns).   
      
   But, yeah, not going to type out state tables for these ones.   
      
      
   Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit   
   state might be impossible. Although there would be sufficient   
   state-space for the looping 5-bit patterns, there may not be sufficient   
   state-space to distinguish whether to move from a mismatched 4-bit   
   pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any   
   mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.   
   One needs to be able to express decay both to shorter patterns and to   
   longer patterns, and I suspect at this point, the pattern breaks down   
   (but can't easily confirm; it is either this or the pattern extends   
   indefinitely, I don't know...).   
      
      
   Could almost have this sort of thing as a "brain teaser" puzzle or   
   something...   
      
   Then again, maybe other people would not find any particular difficulty   
   in these sorts of tasks.   
      
      
   > Terje   
   >   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]