home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 129,253 of 131,241   
   Terje Mathisen to Michael S   
   Re: VAX   
   05 Aug 25 17:31:34   
   
   From: terje.mathisen@tmsw.no   
      
   Michael S wrote:   
   > On Tue, 5 Aug 2025 00:14:43 +0300   
   > Michael S  wrote:   
   >   
   >> On Mon, 4 Aug 2025 22:49:23 +0200   
   >> Terje Mathisen  wrote:   
   >>   
   >>> Anton Ertl wrote:   
   >>>> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   >>>>> Michael S  writes:   
   >>>>>> Actually, in our world the latest C standard (C23) has them, but   
   >>>>>> the spelling is different: _BitInt(32) and unsigned _BitInt(32).   
   >>>>>> I'm not sure if any major compiler already has them implemented.   
   >>>>>> Bing copilot says that clang does, but I don't tend to believe   
   >>>>>> eveything Bing copilot says.   
   >>>>>   
   >>>>> I asked godbolt, and tried the following program:   
   >>>>>   
   >>>>> typedef ump unsigned _BitInt(65535);   
   >>>>   
   >>>> The actual compiling version is:   
   >>>>   
   >>>> typedef unsigned _BitInt(65535) ump;   
   >>>>   
   >>>> ump sum3(ump a, ump b, ump c)   
   >>>> {   
   >>>>      return a+b+c;   
   >>>> }   
   >>>   
   >>> I would naively expect the ump type to be defined as an array of   
   >>> unsigned (byte/short/int/long), possibly with a header defining how   
   >>> large the allocation is and how many bits are currently defined.   
   >>>   
   >>> The actual code to add three of them could be something like   
   >>>   
   >>>     xor rax,rax   
   >>> next:   
   >>>     add rax,[rsi+rcx*8]   
   >>>     adc rdx,0   
   >>>     add rax,[r8+rcx*8]   
   >>>     adc rdx,0   
   >>>     add rax,[r9+rcx*8]   
   >>>     adc rdx,0   
   >>>     mov [rdi+rcx*8],rax   
   >>>     mov rax,rdx   
   >>>     inc rcx   
   >>>     cmp rcx,r10   
   >>>      jb next   
   >>>   
   >>> The main problem here is of course that every add operation depends   
   >>> on the previous, so max speed would be 4-5 clock cycles/iteration.   
   >>>   
   >>> Terje   
   >>>   
   >>   
   >> I would guess that even a pair of x86-style loops would likely be   
   >> faster than that on most x86-64 processors made in last 15 years.   
   >> Despite doing 1.5x more memory acceses.   
   >> ; rcx = dst   
   >> ; rdx = a - dst   
   >> ; r8 = b - dst   
   >>   mov $1024, %esi   
   >>   clc   
   >> .loop1:   
   >>   mov (%rcx,%r8), %rax   
   >>   adc (%rcx,%rdx), %rax   
   >>   mov %rax, (%rcx)   
   >>   lea 8(%rcx), %rcx   
   >>   dec %esi   
   >>   jnz .loop1   
   >>   
   >>   sub $65536, %rcx   
   >>   mov ..., %rdx ; %rdx = c-dst   
   >>   mov $1024, %esi   
   >>   clc   
   >> .loop2:   
   >>   mov (%rcx,%rdx), %rax   
   >>   adc %rax, (%rcx)   
   >>   lea 8(%rcx), %rcx   
   >>   dec %esi   
   >>   jnz .loop2   
   >>   ...   
   >>   
   >   
   >   
   > For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and   
   > Intel Lion Cove, I'd do the following modification to your inner loop   
   > (back in Intel syntax):   
   >   
   >     xor ebx,ebx   
   > next:   
   >     xor edx, edx   
   >     mov rax,[rsi+rcx*8]   
   >     add rax,[r8+rcx*8]   
   >     adc edx,edx   
   >     add rax,[r9+rcx*8]   
   >     adc edx,0   
   >     add rbx,rax   
   >     jc  incremen_edx   
   >     ; eliminate data dependency between loop iteration   
   >     ; replace it by very predictable control dependency   
   > edx_ready:   
   >     mov edx, ebx   
   >     mov [rdi+rcx*8],rax   
   >     inc rcx   
   >     cmp rcx,r10   
   >     jb next   
   >     ...   
   >     ret   
   >   
   >   
   > ; that code is placed after return   
   > ; it is executed extremely rarely.For random inputs-approximately never   
   > incremen_edx:   
   >    inc edx   
   >    jmp edx_ready   
   >   
   >   
   > Less wide cores will likely benefit from reduction of the number of   
   > executed instructions (and more importantly the number of decoded and   
   > renamed instructions) through unrolling by 2, 3 or 4.   
   >   
   >   
   Interesting code, not totally sure that I understand how the   
      
   'ADC EDX,EDX'   
      
   really works, i.e. shiftin previous contents up while saving the current   
   carry.   
      
   Anyway, the three main ADD RAX,... operations still define the minimum   
   possible latency, right?   
      
   Terje   
      
   --   
   -    
   "almost all programming can be viewed as an exercise in caching"   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca