From: Bonita.Montero@nospicedham.gmail.com   
      
   x86 on AMD-CPUs since Zen1 has an instruction called CLZERO.   
   According to Wikichip this is to recover from some memory-errors,   
   but this is pure nonsense. There was a posting in the LKML that   
   reveals the correct purpose: it's to fast zero memory without   
   polluting the cache, i.e. clzero is non-temporal.   
   I thought it would be nice to have a comparison betwen a looped   
   clzero and a plain memset, which itself is usually optimized   
   very good with today's compiler. So I wrote a little benchmark   
   in C++20 to compare both:   
      
   #include    
   #include    
   #include    
   #include    
   #include    
   #include    
   #if defined(_MSC_VER)   
    #include    
   #elif defined(__GNUC__) || defined(__clang__)   
    #include    
   #endif   
      
   using namespace std;   
   using namespace chrono;   
      
   template   
   size_t clZeroRange( void *p, size_t n );   
      
   int main()   
   {   
    constexpr size_t   
    N = 0x4000000,   
    ROUNDS = 1'000;   
    vector vc( N, 0 );   
    auto bench = [&]( bool_constant )   
    {   
    auto start = high_resolution_clock::now();   
    size_t n = 0;   
    for( size_t r = ROUNDS; r--; )   
    n += clZeroRange( to_address( vc.begin() ), N );   
    double GBS = (double)(ptrdiff_t)n / 0x1.0p30;   
    cout << GBS / ((double)(int64_t)duration_cast(   
   high_resolution_clock::now() - start ).count() / 1.0e9) << endl;   
    };   
    bench( false_type() );   
    bench( true_type() );   
   }   
      
   template   
   size_t clZeroRange( void *p, size_t n )   
   {   
    char *pAlign = (char *)(((size_t)p + 63) & (ptrdiff_t)-64);   
    n -= pAlign - (char *)p;   
    n &= (ptrdiff_t)-64;   
    if constexpr( !MemSet )   
    for( char *end = pAlign + n; pAlign != end; pAlign += 64 )   
    _mm_clzero( pAlign );   
    else   
    memset( p, 0, n );   
    return n;   
   }   
      
   Interestingly I get the same performance for both variants with   
   MSVC++ 2022. With g++ / glibc I get a performance of about one   
   third of with memset() than with the clzero()-solution. I think   
   the memset() of glibc just not optimized so properly. The memset()   
   of Visual C++ uses non-temporal SSE stores which explains the good   
   performance.   
      
   Would someone here be so nice to post his values ?   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|