... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,134 of 131,241
BGB to BGB
Re: branch splitting (1/2)
05 Nov 25 01:38:30
   From: cr88192@gmail.com   
      
   On 11/5/2025 1:00 AM, BGB wrote:   
   > On 11/4/2025 3:44 PM, Terje Mathisen wrote:   
   >> MitchAlsup wrote:   
   >>>   
   >>> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >>>   
   >>>> Bottom line: History-based branch prediction has won, any kind of   
   >>>> delayed branches (including split-branch designs) turn out to be   
   >>>> a bad idea.   
   >>>   
   >>> Or "Never bet against branch prediction".   
   >>   
   >> I have probably mentioned this before, once or twice, but I'm actually   
   >> quite proud of the meeting I had with Intel Santa Clara in the spring   
   >> of 1995:   
   >>   
   >> I had (accidentally) written the first public mention of the FDIV bug   
   >> (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of   
   >> MathWorks/MatLab fame led the effort to develop a minimum cost sw   
   >> workaround for the bug. (My code became part of all/most x86 compiler   
   >> runtimes for the next few years.)   
   >>   
   >> Due to this Intel invited me to receive an early engineering prototype   
   >> of the PentiumPro, together with an NDA-covered briefing about its   
   >> architecture.   
   >>   
   >> Before the start of that briefing I suggested that I should start off   
   >> on the blackboard by showing what I had been able to figure out on my   
   >> own, then I proceeded to pretty much exactly cover every single   
   >> feature on the cpu, with one glaring exception:   
   >>   
   >> Based on the useful but not great branch predictor on the Pentium I   
   >> told them that I expected the P6 to employ eager execution, i.e   
   >> execute both ways of one or two layers of branches, discarding the   
   >> non-taken paths as the branch direction info became available.   
   >>   
   >> That's the point when they got to brag about how having a much, much   
   >> better branch predictor was better both from a performance and a power   
   >> viewpoint, since out of order execution could predict much deeper than   
   >> any eager execution would have the resources for.   
   >>   
   >> As you said: "Never bet against branch prediction".   
   >>   
   >   
   > Branch prediction is fun.   
   >   
   >   
   > When I looked around online before, a lot of stuff about branch   
   > prediction was talking about fairly large and convoluted schemes for the   
   > branch predictors.   
   >   
   > But, then always at the end of it using 2-bit saturating counters:   
   >    weakly taken, weakly not-taken, strongly taken, strongly not taken.   
   >   
   > But, in my fiddling, there was seemingly a simple but moderately   
   > effective strategy:   
   >    Keep a local history of taken/not-taken;   
   >    XOR this with the low-order-bits of PC for the table index;   
   >    Use a 5/6-bit finite-state-machine or similar.   
   >      Can model repeating patterns up to ~ 4 bits.   
   >   
   > Where, the idea was that the state-machine in updated with the current   
   > state and branch direction, giving the next state and next predicted   
   > branch direction (for this state).   
   >   
   >   
   > Could model slightly more complex patterns than the 2-bit saturating   
   > counters, but it is sort of a partial mystery why (for mainstream   
   > processors) more complex lookup schemes and 2 bit state, was preferable   
   > to a simpler lookup scheme and 5-bit state.   
   >   
   > Well, apart from the relative "dark arts" needed to cram 4-bit patterns   
   > into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).   
   >   
   >   
   >   
   > Then again, had before noted that the LLMs are seemingly also not really   
   > able to figure out how to make a 5 bit FSM to model a full set of 4 bit   
   > patterns.   
   >   
      
      
   Errm...   
      
   I just decided to test it, and it appears Grok was able to figure it out   
   (more or less).   
      
   This is concerning, either the AIs are getting smart enough to deal with   
   semi-difficult problems; or in fact it is not difficult and I was just   
   dumb for thinking there is any difficulty in working out the state   
   tables for the longer patterns.   
      
   I tried before with DeepSeek R1 and similar, which had failed.   
      
      
   >   
   > Then again, I wouldn't expect it to be all that difficult of a problem   
   > for someone that is "actually smart"; so presumably chip designers could   
   > have done similar.   
   >   
   > Well, unless maybe the argument is that 5 or 6 bits of storage would   
   > cost more than 2 bits, but then presumably needing to have significantly   
   > larger tables (to compensate for the relative predictive weakness of 2-   
   > bit state) would have costed more than the cost of smaller tables of 6   
   > bit state ?...   
   >   
   > Say, for example, 2b:   
   >   00_0 => 10_0  //Weakly not-taken, dir=0, goes strong not-taken   
   >   00_1 => 01_0  //Weakly not-taken, dir=1, goes weakly taken   
   >   01_0 => 00_1  //Weakly taken, dir=0, goes weakly not-taken   
   >   01_1 => 11_1  //Weakly taken, dir=1, goes strongly taken   
   >   10_0 => 10_0  //strongly not taken, dir=0   
   >   10_1 => 00_0  //strongly not taken, dir=1 (goes weak)   
   >   11_0 => 01_1  //strongly taken, dir=0   
   >   11_1 => 11_1  //strongly taken, dir=1 (goes weak)   
   >   
   > Can expand it to 3-bits, for 2-bit patterns   
   >    As above, and 4-more alternating states   
   >    And slightly different transition logic.   
   > Say (abbreviated):   
   >    000   weak, not taken   
   >    001   weak, taken   
   >    010   strong, not taken   
   >    011   strong, taken   
   >    100   weak, alternating, not-taken   
   >    101   weak, alternating, taken   
   >    110   strong, alternating, not-taken   
   >    111   strong, alternating, taken   
   > The alternating states just flip-flopping between taken and not taken.   
   >    The weak states can more between any of the 4.   
   >    The strong states used if the pattern is reinforced.   
   >   
   > Going up to 3 bit patterns is more of the same (add another bit,   
   > doubling the number of states). Seemingly something goes nasty when   
   > getting to 4 bit patterns though (and can't fit both weak and strong   
   > states for longer patterns, so the 4b patterns effectively only exist as   
   > weak states which partly overlap with the weak states for the 3-bit   
   > patterns).   
   >   
   > But, yeah, not going to type out state tables for these ones.   
   >   
   >   
   > Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit   
   > state might be impossible. Although there would be sufficient state-   
   > space for the looping 5-bit patterns, there may not be sufficient state-   
   > space to distinguish whether to move from a mismatched 4-bit pattern to   
   > a 3 or 5 bit pattern. Whereas, at least with 4-bit, any mismatch of the   
   > 4-bit pattern can always decay to a 3-bit pattern, etc. One needs to be   
   > able to express decay both to shorter patterns and to longer patterns,   
   > and I suspect at this point, the pattern breaks down (but can't easily   
   > confirm; it is either this or the pattern extends indefinitely, I don't   
   > know...).   
   >   
   >   
   > Could almost have this sort of thing as a "brain teaser" puzzle or   
   > something...   
   >   
   > Then again, maybe other people would not find any particular difficulty   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]