From: terje.mathisen@nospicedham.tmsw.no   
      
   Anton Ertl wrote:   
   > James Harris writes:   
   >>> One very important thing is whether the computations one benchmarks   
   >>> are independent of each other (then you measure throughput), or   
   >>> dependent on each other (then you measure latency). And of course,   
   >>> for code sequences involving branches, the predictability of the   
   >>> branches is an important issue.   
   >>   
   >> Yes, rather than running   
   >>   
   >> start timer   
   >> loop   
   >> code under test   
   >> endloop   
   >> stop timer and store results   
   >>   
   >> it may (IMO) be better to run   
   >>   
   >> loop   
   >> start timer   
   >> code under test   
   >> stop timer and store results   
   >> endloop   
   >   
   > Long-running loops are very predictable, that's not what I meant. The   
   > timer code may cause more measurement variations than the loop.   
   >   
   > What I meant is that, if "code under test" contains branches, as in   
   > some of the sequences in this thread, one may want to arrange the   
   > input data such that the branch prediction accuracy in the benchmark   
   > is representative of the prediction accuracy in actual usage (and   
   > determining what that is is another problem).   
      
   This is the real crux of the matter!   
      
   As all these measurements have shown, the "speed of light" latency of   
   all the various branchless variants are identical, the measured jitter   
   between them depends on microarchitectural quirks on various cpu models,   
   including some machines where CMOV take a cycle more than on others.   
      
   The JZ versions however depends almost completely on the hit rate of the   
   branch predictor: If the branch is both regularly executed (otherwise   
   timing doesn't matter, right?), and well predicted (90%+), then it is   
   effectively impossible to beat it with branchless code.   
      
   There are many other possible reasons for going branchless however:   
      
   Vector/SIMD code typically have to use branchless.   
   Constant timing is important, maybe for security reasons   
   (Meltdown/Spectre/etc.)   
   The input data is know to have high entropy/low branch predictor hit   
   rate. I.e. decompressing h264 CABAC data where each bit in the input   
   stream should have very close to perfect entropy.   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|