... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.arch

Apparently more than just beeps & boops

131,241 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 129,459 of 131,241

Anton Ertl to Terje Mathisen

Re: SHA3 - 32 registers required for spe

21 Aug 25 13:23:11

   From: anton@mips.complang.tuwien.ac.at   
      
   Terje Mathisen  writes:   
   >I looked at the (newish) SHA3 backup algorithm for the current SHA2, it   
   >is based on a 5x5x64 bit internal core, with mixing operations that have   
   >been based on having each of those 5x5=25 stripes in a separate 64-bit   
   >(unsigned) variable, and then do a lot of logic ops across them, and   
   >using rotations for vertical motion.   
   >   
   >With a core containing 25 named variables, you need an easy way to   
   >access any of them, so even using SIMD might be a stretch?   
      
   Maybe the SIMD shuffle operations can help.   
      
   Even if not, AMD64 has 15 GPRs (+RSP), and 16 XMM registers, with   
   instructions for moving between GPRs and XMM, so that should fit.   
      
   x86-64-v4 has 32 XMM registers.   
      
   What other instruction set with 64-bit GPRs and less than 32 GPRs do   
   you have in mind?   
      
   If you want to limit yourself to the GPRs of AMD64, keeping a part of   
   the variables in memory is not that bad for latency on recent   
   high-performance microarchitectures, thanks to 0-cycle store-to-load   
   forwarding; it still costs slots in the load/store units and in the   
   renamer (I think, not sure how AMD's macro-ops consume renamer   
   resources), so avoiding loads and stores through register allocation   
   is still a good idea.   
      
   >I see a github SHA3 project that simply defines an array of uint64_t,   
   >and all the ops index into this, but unless the compiler can extract   
   >them all into registers, this is going to generate a lot of extra $L1   
   >traffic.   
      
   I would not trust the compiler to be good at converting these accesses   
   to registers.  If you are unlucky, it auto-vectorizes the memory   
   accesses and runs into a slow path of the CPU while doing so   
   <2025Jul13.162453@mips.complang.tuwien.ac.at>.   
      
   Daniel J. Bernstein has analysed the performane of software   
   implementations of SHA3 candidates (the one that eventually won is a   
   variant of Keccak) in  and   
   measured around 12 cycles/byte for keccak512 on AMD64 CPUs of   
   2007-2010 vintage, with a theoretical minimum of 7.824 cycles/byte   
   (apparently based on resource limits); he suspects that the extra 4   
   cycles per byte are due to loads and stores, so apparently he did not   
   keep all values in registers.   
      
   - anton   
   --   
   'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'   
     Mitch Alsup,    
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]