Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 130,132 of 131,241    |
|    BGB to Terje Mathisen    |
|    Re: branch splitting    |
|    05 Nov 25 01:00:32    |
      From: cr88192@gmail.com              On 11/4/2025 3:44 PM, Terje Mathisen wrote:       > MitchAlsup wrote:       >>       >> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:       >>       >>> Bottom line: History-based branch prediction has won, any kind of       >>> delayed branches (including split-branch designs) turn out to be       >>> a bad idea.       >>       >> Or "Never bet against branch prediction".       >       > I have probably mentioned this before, once or twice, but I'm actually       > quite proud of the meeting I had with Intel Santa Clara in the spring of       > 1995:       >       > I had (accidentally) written the first public mention of the FDIV bug       > (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of       > MathWorks/MatLab fame led the effort to develop a minimum cost sw       > workaround for the bug. (My code became part of all/most x86 compiler       > runtimes for the next few years.)       >       > Due to this Intel invited me to receive an early engineering prototype       > of the PentiumPro, together with an NDA-covered briefing about its       > architecture.       >       > Before the start of that briefing I suggested that I should start off on       > the blackboard by showing what I had been able to figure out on my own,       > then I proceeded to pretty much exactly cover every single feature on       > the cpu, with one glaring exception:       >       > Based on the useful but not great branch predictor on the Pentium I told       > them that I expected the P6 to employ eager execution, i.e execute both       > ways of one or two layers of branches, discarding the non-taken paths as       > the branch direction info became available.       >       > That's the point when they got to brag about how having a much, much       > better branch predictor was better both from a performance and a power       > viewpoint, since out of order execution could predict much deeper than       > any eager execution would have the resources for.       >       > As you said: "Never bet against branch prediction".       >              Branch prediction is fun.                     When I looked around online before, a lot of stuff about branch       prediction was talking about fairly large and convoluted schemes for the       branch predictors.              But, then always at the end of it using 2-bit saturating counters:        weakly taken, weakly not-taken, strongly taken, strongly not taken.              But, in my fiddling, there was seemingly a simple but moderately       effective strategy:        Keep a local history of taken/not-taken;        XOR this with the low-order-bits of PC for the table index;        Use a 5/6-bit finite-state-machine or similar.        Can model repeating patterns up to ~ 4 bits.              Where, the idea was that the state-machine in updated with the current       state and branch direction, giving the next state and next predicted       branch direction (for this state).                     Could model slightly more complex patterns than the 2-bit saturating       counters, but it is sort of a partial mystery why (for mainstream       processors) more complex lookup schemes and 2 bit state, was preferable       to a simpler lookup scheme and 5-bit state.              Well, apart from the relative "dark arts" needed to cram 4-bit patterns       into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).                            Then again, had before noted that the LLMs are seemingly also not really       able to figure out how to make a 5 bit FSM to model a full set of 4 bit       patterns.                     Then again, I wouldn't expect it to be all that difficult of a problem       for someone that is "actually smart"; so presumably chip designers could       have done similar.              Well, unless maybe the argument is that 5 or 6 bits of storage would       cost more than 2 bits, but then presumably needing to have significantly       larger tables (to compensate for the relative predictive weakness of       2-bit state) would have costed more than the cost of smaller tables of 6       bit state ?...              Say, for example, 2b:        00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken        00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken        01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken        01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken        10_0 => 10_0 //strongly not taken, dir=0        10_1 => 00_0 //strongly not taken, dir=1 (goes weak)        11_0 => 01_1 //strongly taken, dir=0        11_1 => 11_1 //strongly taken, dir=1 (goes weak)              Can expand it to 3-bits, for 2-bit patterns        As above, and 4-more alternating states        And slightly different transition logic.       Say (abbreviated):        000 weak, not taken        001 weak, taken        010 strong, not taken        011 strong, taken        100 weak, alternating, not-taken        101 weak, alternating, taken        110 strong, alternating, not-taken        111 strong, alternating, taken       The alternating states just flip-flopping between taken and not taken.        The weak states can more between any of the 4.        The strong states used if the pattern is reinforced.              Going up to 3 bit patterns is more of the same (add another bit,       doubling the number of states). Seemingly something goes nasty when       getting to 4 bit patterns though (and can't fit both weak and strong       states for longer patterns, so the 4b patterns effectively only exist as       weak states which partly overlap with the weak states for the 3-bit       patterns).              But, yeah, not going to type out state tables for these ones.                     Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit       state might be impossible. Although there would be sufficient       state-space for the looping 5-bit patterns, there may not be sufficient       state-space to distinguish whether to move from a mismatched 4-bit       pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any       mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.       One needs to be able to express decay both to shorter patterns and to       longer patterns, and I suspect at this point, the pattern breaks down       (but can't easily confirm; it is either this or the pattern extends       indefinitely, I don't know...).                     Could almost have this sort of thing as a "brain teaser" puzzle or       something...              Then again, maybe other people would not find any particular difficulty       in these sorts of tasks.                     > Terje       >              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca