From: user5857@newsgrouper.org.invalid   
      
   Stephen Fuld posted:   
      
   > On 9/4/2025 8:23 AM, MitchAlsup wrote:   
   > >   
   > > MitchAlsup posted:   
   > >   
   > >>   
   > >> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   > >>   
   > >>> BGB writes:   
   > >>>> But, it seems to have a few obvious weak points for RISC-V:   
   > >>>> Crappy with arrays;   
   > >>>> Crappy with code with lots of large immediate values;   
   > >>>> Crappy with code which mostly works using lots of global variables;   
   > >>>> Say, for example, a lot of Apogee / 3D Realms code;   
   > >>>> They sure do like using lots of global variables.   
   > >>>> id Software also likes globals, but not as much.   
   > >>>> ...   
   > >>>   
   > >>> Let's see:   
   > >>>   
   > >>> #include    
   > >>>   
   > >>> long arrays(long *v, size_t n)   
   > >>> {   
   > >>> long i, r;   
   > >>> for (i=0, r=0; i >>> r+=v[i];   
   > >>> return r;   
   > >>> }   
   > >>   
   > >> arrays:   
   > >> MOV R3,#0   
   > >> MOV R4,#0   
   > >> VEC R5,{}   
   > >> LDD R6,[R1,R3<<3]   
   > >> ADD R4,R4,R6   
   > >> LOOP LT,R3,#1,R2   
   > >> MOV R1,R4   
   > >> RET   
   > >>   
   > >>>   
   > >>> long a, b, c, d;   
   > >>>   
   > >>> void globals(void)   
   > >>> {   
   > >>> a = 0x1234567890abcdefL;   
   > >>> b = 0xcdef1234567890abL;   
   > >>> c = 0x567890abcdef1234L;   
   > >>> d = 0x5678901234abcdefL;   
   > >>> }   
   > >>   
   > >> globals:   
   > >> STD #0x1234567890abcdef,[ip,a-.]   
   > >> STD #0xcdef1234567890ab,[ip,b-.]   
   > >> STD #0x567890abcdef1234,[ip,c-.]   
   > >> STD #0x5678901234abcdef,[ip,d-.]   
   > >> RET   
   > >>>   
   > >> -----------------   
   > >>>   
   > >>> So, the overall sizes (including data size for globals() on RV64GC) are:   
   > >>> Bytes Instructions   
   > >>> arrays globals Architecture arrays globals   
   > >>> 28 66 (34+32) RV64GC 12 9   
   > >>> 27 69 AMD64 11 9   
   > >>> 44 84 ARM A64 11 22   
   > >> 32 68 My 66000 8 5   
   > >   
   > > In light of the above, what do people think is more important, small   
   > > code size or fewer instructions ??   
   >   
   > In general yes, but as you pointed out in another post, if you are   
   > talking about a GBOoO machine, it isn't the absolute number of   
   > instructions (because of parallel execution), but the number of cycles   
   > to execute a particular routine. Of course, this is harder to tell at a   
   > glance from a code listing.   
      
   I can't seem to find the code examples from the snipped examples anywhere.   
      
   For arrays:   
   The inner loops are 4 instructions (3 for My 66000) and the loop is 2×   
   data dependent on the integer ADDs, so all 4 instructions can be pitched   
   at 1-cycle. Let us assume the loop is executed 10×, so 10 loop-latencies   
   is 10-cycles plus LD-latency plus ADD latency:: {using LD-latency = 4}   
      
    setup   
    | MOV #0 |   
    | MOV #0 |   
   loop[0] | LD AGEN|rot | Cache | LD align |rot | D ADD |   
    | LP ADD | BLT ! |   
   loop[1] | LD AGEN|rot | Cache | LD align |rot | D ADD |   
    | LP ADD | BLT ! |   
   loop[2] | LD AGEN|rot | Cache | LD align |rot | D ADD |   
    | LP ADD | BLT × | repair |   
    exit   
    | MOV |   
    | RET |   
    | looping | recovery |   
      
   // where rot is time it takes to route AGEN to the SRAM arrays and back,   
   // and showing the exit of the loop by mispredicting the last branch back   
   // to the top of the loop, 2-cycle repairing state, and returning from   
   // subroutine.   
      
   Any µarchitecture that can start 1 LD per cycle, start 2 integer ADDs   
   per cycle, and 1 branch per cycle, has enough resources to perform   
   arrays as drawn above.   
      
   For globals:   
      
   RV64GC does 4 LDs and 4 STs, each ST being data dependent on 1 LD.   
   It is conceivable that a 10-wide machine might do 4 LDs in a cycle,   
   and figure out that the 4 values are in the same cache line, so the   
   latency of calculation is LD-latency + ST AGEN. Let's say LD-latency   
   is 4-cycles, so the calculation latency is 5-cycles. RET can probably   
   be performed simultaneous with the first LD AGEN.   
      
   My 66000 does 4 parallel ST # all of which can start on the same cycle,   
   as can RET, for a latency of 1-cycle.   
      
   On the other hand:: My 66000 implementation may only be 6-wide and   
   the 4 STs take 2-execution-cycles, but the RET still takes place in   
   cycle-1.   
      
   > > At some scale, smaller code size is beneficial, but once the implementation   
   > > has a GBOoO µarchitecture, I would think that fewer instructions is better   
   > > than smaller code--so long as the code size is less than 150% of the   
   smaller   
   > > AND so long as the ISA does not resort to sequential decode (i.e., VAX).   
   > >   
   > > What say ye !   
   >   
   > And, of course your "150%" is arbitrary,   
      
   yes, of course, completely arbitrary--but this is the typical RISC-CISC   
   instruction count ratio. Now, on the other hand, My 66000 runs closer to   
   115% size and 70% RISC-V count {although the examples above are 66% and   
   55%}.   
      
   > but I agree that small   
   > differences in code size are not important, except in some small   
   > embedded applications.   
   >   
   > And I guess I would add, as a third, much lower priority, power usage.   
      
   I would suggest power has become a second order desire (up from third)   
   {maybe even a primary desire at some scales}.   
      
   But note: Nothing delivers a fixed bit-pattern as an operand at lower   
   power than plucking the bits from the instruction stream; saving a   
   good deal of the power consumed by forwarding (the multiple comparators   
   and the find youngest logic plus the buffers to drive the result-to-   
   operand multiplexers).   
      
   And certainly: plucking the bit-pattern from the instruction stream is   
   vastly lower power than LDing the bit-pattern from memory ! close to   
   4× lower.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|