... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,585 of 131,241
MitchAlsup to All
Re: What is more important
04 Sep 25 21:00:36
   From: user5857@newsgrouper.org.invalid   
      
   Stephen Fuld  posted:   
      
   > On 9/4/2025 8:23 AM, MitchAlsup wrote:   
   > >   
   > > MitchAlsup  posted:   
   > >   
   > >>   
   > >> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   > >>   
   > >>> BGB  writes:   
   > >>>> But, it seems to have a few obvious weak points for RISC-V:   
   > >>>>    Crappy with arrays;   
   > >>>>    Crappy with code with lots of large immediate values;   
   > >>>>    Crappy with code which mostly works using lots of global variables;   
   > >>>>      Say, for example, a lot of Apogee / 3D Realms code;   
   > >>>>      They sure do like using lots of global variables.   
   > >>>>      id Software also likes globals, but not as much.   
   > >>>>    ...   
   > >>>   
   > >>> Let's see:   
   > >>>   
   > >>> #include    
   > >>>   
   > >>> long arrays(long *v, size_t n)   
   > >>> {   
   > >>>    long i, r;   
   > >>>    for (i=0, r=0; i >>>      r+=v[i];   
   > >>>    return r;   
   > >>> }   
   > >>   
   > >> arrays:   
   > >>       MOV  R3,#0   
   > >>       MOV  R4,#0   
   > >>       VEC  R5,{}   
   > >>       LDD  R6,[R1,R3<<3]   
   > >>       ADD  R4,R4,R6   
   > >>       LOOP LT,R3,#1,R2   
   > >>       MOV  R1,R4   
   > >>       RET   
   > >>   
   > >>>   
   > >>> long a, b, c, d;   
   > >>>   
   > >>> void globals(void)   
   > >>> {   
   > >>>    a = 0x1234567890abcdefL;   
   > >>>    b = 0xcdef1234567890abL;   
   > >>>    c = 0x567890abcdef1234L;   
   > >>>    d = 0x5678901234abcdefL;   
   > >>> }   
   > >>   
   > >> globals:   
   > >>      STD #0x1234567890abcdef,[ip,a-.]   
   > >>      STD #0xcdef1234567890ab,[ip,b-.]   
   > >>      STD #0x567890abcdef1234,[ip,c-.]   
   > >>      STD #0x5678901234abcdef,[ip,d-.]   
   > >>      RET   
   > >>>   
   > >> -----------------   
   > >>>   
   > >>> So, the overall sizes (including data size for globals() on RV64GC) are:   
   > >>>      Bytes                         Instructions   
   > >>> arrays globals    Architecture  arrays    globals   
   > >>> 28     66 (34+32) RV64GC            12          9   
   > >>> 27     69         AMD64             11          9   
   > >>> 44     84         ARM A64           11         22   
   > >>  32     68         My 66000           8          5   
   > >   
   > > In light of the above, what do people think is more important, small   
   > > code size or fewer instructions ??   
   >   
   > In general yes, but as you pointed out in another post, if you are   
   > talking about a GBOoO machine, it isn't the absolute number of   
   > instructions (because of parallel execution), but the number of cycles   
   > to execute a particular routine.  Of course, this is harder to tell at a   
   > glance from a code listing.   
      
   I can't seem to find the code examples from the snipped examples anywhere.   
      
   For arrays:   
   The inner loops are 4 instructions (3 for My 66000) and the loop is 2×   
   data dependent on the integer ADDs, so all 4 instructions can be pitched   
   at 1-cycle. Let us assume the loop is executed 10×, so 10 loop-latencies   
   is 10-cycles plus LD-latency plus ADD latency:: {using LD-latency = 4}   
      
      setup   
    | MOV #0 |   
    | MOV #0 |   
   loop[0]   | LD AGEN|rot | Cache | LD align |rot | D ADD  |   
             | LP ADD | BLT !  |   
   loop[1]            | LD AGEN|rot | Cache | LD align |rot | D ADD  |   
                      | LP ADD | BLT !  |   
   loop[2]                     | LD AGEN|rot | Cache | LD align |rot | D ADD  |   
                               | LP ADD | BLT ×    |    repair       |   
                                                                       exit   
                                                                     | MOV    |   
                                                                     | RET    |   
             |         looping          |       recovery             |   
      
   // where rot is time it takes to route AGEN to the SRAM arrays and back,   
   // and showing the exit of the loop by mispredicting the last branch back   
   // to the top of the loop, 2-cycle repairing state, and returning from   
   // subroutine.   
      
   Any µarchitecture that can start 1 LD per cycle, start 2 integer ADDs   
   per cycle, and 1 branch per cycle, has enough resources to perform   
   arrays as drawn above.   
      
   For globals:   
      
   RV64GC does 4 LDs and 4 STs, each ST being data dependent on 1 LD.   
   It is conceivable that a 10-wide machine might do 4 LDs in a cycle,   
   and figure out that the 4 values are in the same cache line, so the   
   latency of calculation is LD-latency + ST AGEN. Let's say LD-latency   
   is 4-cycles, so the calculation latency is 5-cycles. RET can probably   
   be performed simultaneous with the first LD AGEN.   
      
   My 66000 does 4 parallel ST # all of which can start on the same cycle,   
   as can RET, for a latency of 1-cycle.   
      
   On the other hand:: My 66000 implementation may only be 6-wide and   
   the 4 STs take 2-execution-cycles, but the RET still takes place in   
   cycle-1.   
      
   > > At some scale, smaller code size is beneficial, but once the implementation   
   > > has a GBOoO µarchitecture, I would think that fewer instructions is better   
   > > than smaller code--so long as the code size is less than 150% of the   
   smaller   
   > > AND so long as the ISA does not resort to sequential decode (i.e., VAX).   
   > >   
   > > What say ye !   
   >   
   > And, of course your "150%" is arbitrary,   
      
   yes, of course, completely arbitrary--but this is the typical RISC-CISC   
   instruction count ratio. Now, on the other hand, My 66000 runs closer to   
   115% size and 70% RISC-V count {although the examples above are 66% and   
   55%}.   
      
   >                                          but I agree that small   
   > differences in code size are not important, except in some small   
   > embedded applications.   
   >   
   > And I guess I would add, as a third, much lower priority, power usage.   
      
   I would suggest power has become a second order desire (up from third)   
   {maybe even a primary desire at some scales}.   
      
   But note: Nothing delivers a fixed bit-pattern as an operand at lower   
   power than plucking the bits from the instruction stream; saving a   
   good deal of the power consumed by forwarding (the multiple comparators   
   and the find youngest logic plus the buffers to drive the result-to-   
   operand multiplexers).   
      
   And certainly: plucking the bit-pattern from the instruction stream is   
   vastly lower power than LDing the bit-pattern from memory ! close to   
   4× lower.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]