home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.asm.x86      Ahh, the lost art of x86 assembly      4,675 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 4,455 of 4,675   
   Bonita Montero to All   
   CLZERO   
   16 May 22 13:58:56   
   
   From: Bonita.Montero@nospicedham.gmail.com   
      
   x86 on AMD-CPUs since Zen1 has an instruction called CLZERO.   
   According to Wikichip this is to recover from some memory-errors,   
   but this is pure nonsense. There was a posting in the LKML that   
   reveals the correct purpose: it's to fast zero memory without   
   polluting the cache, i.e. clzero is non-temporal.   
   I thought it would be nice to have a comparison betwen a looped   
   clzero and a plain memset, which itself is usually optimized   
   very good with today's compiler. So I wrote a little benchmark   
   in C++20 to compare both:   
      
   #include    
   #include    
   #include    
   #include    
   #include    
   #include    
   #if defined(_MSC_VER)   
   	#include    
   #elif defined(__GNUC__) || defined(__clang__)   
   	#include    
   #endif   
      
   using namespace std;   
   using namespace chrono;   
      
   template   
   size_t clZeroRange( void *p, size_t n );   
      
   int main()   
   {   
   	constexpr size_t   
   		N = 0x4000000,   
   		ROUNDS = 1'000;   
   	vector vc( N, 0 );   
   	auto bench = [&]( bool_constant )   
   	{   
   		auto start = high_resolution_clock::now();   
   		size_t n = 0;   
   		for( size_t r = ROUNDS; r--; )   
   			n += clZeroRange( to_address( vc.begin() ), N );   
   		double GBS = (double)(ptrdiff_t)n / 0x1.0p30;   
   		cout << GBS / ((double)(int64_t)duration_cast(   
   high_resolution_clock::now() - start ).count() / 1.0e9)  << endl;   
   	};   
   	bench( false_type() );   
   	bench( true_type() );   
   }   
      
   template   
   size_t clZeroRange( void *p, size_t n )   
   {   
   	char *pAlign = (char *)(((size_t)p + 63) & (ptrdiff_t)-64);   
   	n -= pAlign - (char *)p;   
   	n &= (ptrdiff_t)-64;   
   	if constexpr( !MemSet )   
   		for( char *end = pAlign + n; pAlign != end; pAlign += 64 )   
   			_mm_clzero( pAlign );   
   	else   
   		memset( p, 0, n );   
   	return n;   
   }   
      
   Interestingly I get the same performance for both variants with   
   MSVC++ 2022. With g++ / glibc I get a performance of about one   
   third of with memset() than with the clzero()-solution. I think   
   the memset() of glibc just not optimized so properly. The memset()   
   of Visual C++ uses non-temporal SSE stores which explains the good   
   performance.   
      
   Would someone here be so nice to post his values ?   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca