... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.lang.asm.x86

Ahh, the lost art of x86 assembly

4,675 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 3,879 of 4,675

Anton Ertl to Aleksey Demakov

Re: compact and fast bzero with AVX inst

04 May 19 06:39:48

   From: anton@nospicedham.mips.complang.tuwien.ac.at   

   Aleksey Demakov  writes:   
   >I tested it only on a mac Haswell laptop where it compares quite favorably =   
   >against memset(). I will appreciate any feedback for other archs.   

   If you provide a benchmark, I can test on several different   
   microarchitectures.  Ideally, make it as easy to build and run as   
   possible; I time with perf stat, so no timing code is necessary.   
   Also, if your benchmark also measures the library memset and/or bzero,   
   I can also give numbers on that.   

   Concerning your implementation, I have two suggestions:   

   1) I would align %rdi for the inner loop to 32 bytes.  This should   
   increase the throughput for big blocks in the unaligned case.   

   2) Unrolling: I would unroll by only a factor of 1 or 2.  Current   
   machines can only do one (aligned) 256-bit-store per cycle, and can do   
   the rest of the loop overhead in that cycle; IIRC Ice Lake will be   
   able to do two stores per cycle, if it ever appears.  This will reduce   
   the code size, and may also reduce branch mispredictions.   

   - anton   
   --   
   M. Anton Ertl                    Some things have to be seen to be believed   
   anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen   
   http://www.complang.tuwien.ac.at/anton/home.html   

   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]