home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 129,259 of 131,241   
   Michael S to Terje Mathisen   
   Re: VAX   
   05 Aug 25 19:49:33   
   
   From: already5chosen@yahoo.com   
      
   On Tue, 5 Aug 2025 17:31:34 +0200   
   Terje Mathisen  wrote:   
      
   > Michael S wrote:   
   > > On Tue, 5 Aug 2025 00:14:43 +0300   
   > > Michael S  wrote:   
   > >   
   > >> On Mon, 4 Aug 2025 22:49:23 +0200   
   > >> Terje Mathisen  wrote:   
   > >>   
   > >>> Anton Ertl wrote:   
   > >>>> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:   
   > >>>>> Michael S  writes:   
   > >>>>>> Actually, in our world the latest C standard (C23) has them,   
   > >>>>>> but the spelling is different: _BitInt(32) and unsigned   
   > >>>>>> _BitInt(32). I'm not sure if any major compiler already has   
   > >>>>>> them implemented. Bing copilot says that clang does, but I   
   > >>>>>> don't tend to believe eveything Bing copilot says.   
   > >>>>>   
   > >>>>> I asked godbolt, and tried the following program:   
   > >>>>>   
   > >>>>> typedef ump unsigned _BitInt(65535);   
   > >>>>   
   > >>>> The actual compiling version is:   
   > >>>>   
   > >>>> typedef unsigned _BitInt(65535) ump;   
   > >>>>   
   > >>>> ump sum3(ump a, ump b, ump c)   
   > >>>> {   
   > >>>>      return a+b+c;   
   > >>>> }   
   > >>>   
   > >>> I would naively expect the ump type to be defined as an array of   
   > >>> unsigned (byte/short/int/long), possibly with a header defining   
   > >>> how large the allocation is and how many bits are currently   
   > >>> defined.   
   > >>>   
   > >>> The actual code to add three of them could be something like   
   > >>>   
   > >>>     xor rax,rax   
   > >>> next:   
   > >>>     add rax,[rsi+rcx*8]   
   > >>>     adc rdx,0   
   > >>>     add rax,[r8+rcx*8]   
   > >>>     adc rdx,0   
   > >>>     add rax,[r9+rcx*8]   
   > >>>     adc rdx,0   
   > >>>     mov [rdi+rcx*8],rax   
   > >>>     mov rax,rdx   
   > >>>     inc rcx   
   > >>>     cmp rcx,r10   
   > >>>      jb next   
   > >>>   
   > >>> The main problem here is of course that every add operation   
   > >>> depends on the previous, so max speed would be 4-5 clock   
   > >>> cycles/iteration.   
   > >>>   
   > >>> Terje   
   > >>>   
   > >>   
   > >> I would guess that even a pair of x86-style loops would likely be   
   > >> faster than that on most x86-64 processors made in last 15 years.   
   > >> Despite doing 1.5x more memory acceses.   
   > >> ; rcx = dst   
   > >> ; rdx = a - dst   
   > >> ; r8 = b - dst   
   > >>   mov $1024, %esi   
   > >>   clc   
   > >> .loop1:   
   > >>   mov (%rcx,%r8), %rax   
   > >>   adc (%rcx,%rdx), %rax   
   > >>   mov %rax, (%rcx)   
   > >>   lea 8(%rcx), %rcx   
   > >>   dec %esi   
   > >>   jnz .loop1   
   > >>   
   > >>   sub $65536, %rcx   
   > >>   mov ..., %rdx ; %rdx = c-dst   
   > >>   mov $1024, %esi   
   > >>   clc   
   > >> .loop2:   
   > >>   mov (%rcx,%rdx), %rax   
   > >>   adc %rax, (%rcx)   
   > >>   lea 8(%rcx), %rcx   
   > >>   dec %esi   
   > >>   jnz .loop2   
   > >>   ...   
   > >>   
   > >   
   > >   
   > > For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and   
   > > Intel Lion Cove, I'd do the following modification to your inner   
   > > loop (back in Intel syntax):   
   > >   
   > >     xor ebx,ebx   
   > > next:   
   > >     xor edx, edx   
   > >     mov rax,[rsi+rcx*8]   
   > >     add rax,[r8+rcx*8]   
   > >     adc edx,edx   
   > >     add rax,[r9+rcx*8]   
   > >     adc edx,0   
   > >     add rbx,rax   
   > >     jc  incremen_edx   
   > >     ; eliminate data dependency between loop iteration   
   > >     ; replace it by very predictable control dependency   
   > > edx_ready:   
   > >     mov edx, ebx   
   > >     mov [rdi+rcx*8],rax   
   > >     inc rcx   
   > >     cmp rcx,r10   
   > >     jb next   
   > >     ...   
   > >     ret   
   > >   
   > >   
   > > ; that code is placed after return   
   > > ; it is executed extremely rarely.For random inputs-approximately   
   > > never incremen_edx:   
   > >    inc edx   
   > >    jmp edx_ready   
   > >   
   > >   
   > > Less wide cores will likely benefit from reduction of the number of   
   > > executed instructions (and more importantly the number of decoded   
   > > and renamed instructions) through unrolling by 2, 3 or 4.   
   > >   
   > >   
   > Interesting code, not totally sure that I understand how the   
   >   
   > 'ADC EDX,EDX'   
   >   
   > really works, i.e. shiftin previous contents up while saving the   
   > current carry.   
   >   
      
   In this case 'adc edx,edx' is just slightly shorter encoding   
   of 'adc edx,0'. EDX register zeroize few lines above.   
      
   > Anyway, the three main ADD RAX,... operations still define the   
   > minimum possible latency, right?   
   >   
      
   I don't think so.   
   It seems to me that there is only one chains of data dependencies   
   between iterations of the loop - a trivial dependency through RCX. Some   
   modern processors are already capable to eliminate this sort of   
   dependency in renamer. Probably not yet when it is coded as 'inc', but   
   when coded as 'add' or 'lea'.   
      
   The dependency through RDX/RBX does not form a chain. The next value   
   of [rdi+rcx*8] does depend on value of rbx from previous iteration, but   
   the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8] and   
   [r9+rcx*8]. It does not depend on the previous value of rbx, except for   
   control dependency that hopefully would be speculated around.   
      
   I didn't measured it yet. Didn't finished coding as well.   
   But even when code is finished, the widest processors I have right now   
   are only Intel Raptor Cove (P-core of i7-14700) and AMD Zen3. I am   
   afraid that neither is sufficiently wide to see a full effect of   
   iterations de-coupling.   
      
      
   > Terje   
   >   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca