From: cr88192@gmail.com   
      
   On 9/4/2025 12:25 PM, Stephen Fuld wrote:   
   > On 9/4/2025 8:23 AM, MitchAlsup wrote:   
   >>   
   >> MitchAlsup posted:   
   >>   
   >>>   
   >>> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >>>   
   >>>> BGB writes:   
   >>>>> But, it seems to have a few obvious weak points for RISC-V:   
   >>>>> Crappy with arrays;   
   >>>>> Crappy with code with lots of large immediate values;   
   >>>>> Crappy with code which mostly works using lots of global variables;   
   >>>>> Say, for example, a lot of Apogee / 3D Realms code;   
   >>>>> They sure do like using lots of global variables.   
   >>>>> id Software also likes globals, but not as much.   
   >>>>> ...   
   >>>>   
   >>>> Let's see:   
   >>>>   
   >>>> #include    
   >>>>   
   >>>> long arrays(long *v, size_t n)   
   >>>> {   
   >>>> long i, r;   
   >>>> for (i=0, r=0; i>>> r+=v[i];   
   >>>> return r;   
   >>>> }   
   >>>   
   >>> arrays:   
   >>> MOV R3,#0   
   >>> MOV R4,#0   
   >>> VEC R5,{}   
   >>> LDD R6,[R1,R3<<3]   
   >>> ADD R4,R4,R6   
   >>> LOOP LT,R3,#1,R2   
   >>> MOV R1,R4   
   >>> RET   
   >>>   
   >>>>   
   >>>> long a, b, c, d;   
   >>>>   
   >>>> void globals(void)   
   >>>> {   
   >>>> a = 0x1234567890abcdefL;   
   >>>> b = 0xcdef1234567890abL;   
   >>>> c = 0x567890abcdef1234L;   
   >>>> d = 0x5678901234abcdefL;   
   >>>> }   
   >>>   
   >>> globals:   
   >>> STD #0x1234567890abcdef,[ip,a-.]   
   >>> STD #0xcdef1234567890ab,[ip,b-.]   
   >>> STD #0x567890abcdef1234,[ip,c-.]   
   >>> STD #0x5678901234abcdef,[ip,d-.]   
   >>> RET   
   >>>>   
   >>> -----------------   
   >>>>   
   >>>> So, the overall sizes (including data size for globals() on RV64GC)   
   >>>> are:   
   >>>> Bytes    
   Instructions   
   >>>> arrays globals Architecture arrays globals   
   >>>> 28 66 (34+32) RV64GC 12 9   
   >>>> 27 69 AMD64    
   11 9   
   >>>> 44 84 ARM A64 1   
    22   
   >>> 32 68 My 66000    
   8 5   
   >>   
   >> In light of the above, what do people think is more important, small   
   >> code size or fewer instructions ??   
   >>   
   >> At some scale, smaller code size is beneficial, but once the   
   >> implementation   
   >> has a GBOoO µarchitecture, I would think that fewer instructions is   
   >> better   
   >> than smaller code--so long as the code size is less than 150% of the   
   >> smaller   
   >> AND so long as the ISA does not resort to sequential decode (i.e., VAX).   
   >>   
   >> What say ye !   
   >   
   > In general yes, but as you pointed out in another post, if you are   
   > talking about a GBOoO machine, it isn't the absolute number of   
   > instructions (because of parallel execution), but the number of cycles   
   > to execute a particular routine. Of course, this is harder to tell at a   
   > glance from a code listing.   
   >   
   > And, of course your "150%" is arbitrary, but I agree that small   
   > differences in code size are not important, except in some small   
   > embedded applications.   
   >   
      
   Yeah.   
      
   Main use case where code side is a big priority is when trying to fit   
   code into a small fixed-size ROM. If loading into RAM, and the RAM is   
   non-tiny, then generally exact binary size is much less important, and   
   as long as it isn't needlessly huge/bloated, it doesn't matter too much.   
      
   For traditional software, often data/bss, stack, and heap memory, will   
   be the dominant factors for overall RAM usage.   
      
   For a lot of command-line tools, often there will be a lot of code for   
   relatively little RAM use, but then the priority is less about minimal   
   code-size (though often small code size will matter more than   
   performance for many such tools), but the overhead of creating and   
   destroying process instances.   
      
   ...   
      
      
   > And I guess I would add, as a third, much lower priority, power usage.   
   >   
      
   It depends:   
   For small embedded devices, power usage often dominates;   
   Usually, this is most effected by executing as few instructions as   
   possible while also using the least complicated hardware logic to   
   perform those instructions.   
      
      
      
   For a lot of DSP tasks, power use is a priority, while often doing lots   
   of math operations, in which case one often wants FPUs and similar with   
   the minimal sufficient precision (so, for example, rocking it with lots   
   of Binary16 math, and FPUs which natively operate on Binary16); or a lot   
   of 8 and 16 bit integer math.   
      
   While FP8 is interesting, sadly direct FP8 math is often too low of   
   precision for many tasks.   
      
      
   I guess the issue then becomes one of the cheapest-possible Binary16   
   capable FPU (both in terms of logic resources and energy use).   
      
   Ironically, one option here being to use log-scaled values (scaled to   
   mimic Binary16) and just sort of pass it off as Binary16. If one   
   switches entirely to log-scaled math, then it can be at least be   
   self-consistent. However, if mixed/matched with "real" Binary16,   
   typically only the top few digits will match up.   
      
   Where, as noted, it works well at low precision, but scales poorly (and   
   even Binary16 is pushing it).   
      
   Though, unclear about ASIC space.   
      
      
      
   For integer math, it might make sense to use a lot of zero-extended   
   16-bit math, since using sign-extended math would likely waste more   
   energy flipping all the high order bits for sign extension.   
      
   Well, or other possible even more wacky options, like zigzag-folded gray   
   coded byte values.   
      
   Though, it would be kinda wacky/nonstandard, if ALU operations fold the   
   sign into the LSB and use gray-coding for the value, then arithmetic   
   could be performed while minimizing the number of bit flips and thus   
   potentially using less total energy for registers and memory operations.   
      
   Though, potentially, the CPU could be made to look as-if it were   
   operating on normal twos complement math; since if the arithmetic   
   results are the same, it might be essentially invisible to the software   
   that numbers are being stored in a nonstandard way.   
      
   Or, say, mapping from twos complement to folded bytes (with every byte   
   being folded):   
    00->00, 01->02, 02->06, 03->04, ...   
    FF->01, FE->03, FD->07, FC->05, ...   
   So, say, a value flipping sign would typically only need to flip a small   
   fraction of the number of bits (and the encode/decode process would   
   mostly consist of bitwise XORs).   
      
      
   Though, might still make sense to keep things "normal" in the ALU and   
   CPU registers, but then apply such a transform at the level of the   
   memory caches (and in external RAM). A lot may depend on the energy cost   
   of performing this transformation though (and, it does implicitly assume   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|