home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 3,879 of 4,675   
   Anton Ertl to Aleksey Demakov   
   Re: compact and fast bzero with AVX inst   
   04 May 19 06:39:48   
   
   From: anton@nospicedham.mips.complang.tuwien.ac.at   
      
   Aleksey Demakov  writes:   
   >I tested it only on a mac Haswell laptop where it compares quite favorably =   
   >against memset(). I will appreciate any feedback for other archs.   
      
   If you provide a benchmark, I can test on several different   
   microarchitectures.  Ideally, make it as easy to build and run as   
   possible; I time with perf stat, so no timing code is necessary.   
   Also, if your benchmark also measures the library memset and/or bzero,   
   I can also give numbers on that.   
      
   Concerning your implementation, I have two suggestions:   
      
   1) I would align %rdi for the inner loop to 32 bytes.  This should   
   increase the throughput for big blocks in the unaligned case.   
      
   2) Unrolling: I would unroll by only a factor of 1 or 2.  Current   
   machines can only do one (aligned) 256-bit-store per cycle, and can do   
   the rest of the loop overhead in that cycle; IIRC Ice Lake will be   
   able to do two stores per cycle, if it ever appears.  This will reduce   
   the code size, and may also reduce branch mispredictions.   
      
   - anton   
   --   
   M. Anton Ertl                    Some things have to be seen to be believed   
   anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen   
   http://www.complang.tuwien.ac.at/anton/home.html   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca