... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,571 of 131,241
BGB to EricP
Re: Concedtina III May Be Returning
04 Sep 25 00:20:03
   From: cr88192@gmail.com   
      
   On 9/3/2025 9:42 PM, EricP wrote:   
   > MitchAlsup wrote:   
   >>   
   >> However, I also found that STs need an immediate and a displacement, so,   
   >> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with   
   >> potential displacement (from D12ds above) and the immediate has the   
   >> size of the ST. This provides for::   
   >>          std    #4607182418800017408,[r3,r2<<3,96]   
   >   
   > Compare and Branch can also use two immediates as it   
   > has reg-reg or reg-imm compares plus displacement.   
   > And has high enough frequency to be worth considering.   
   >   
      
   Can be done, yes.   
      High enough frequency/etc, is where the possible debate lies.   
      
      
   Checking stats, it can effect roughly 1.9% of the instructions.   
   Or, around 11% of branches; most of the rest being unconditional or   
   comparing against 0 (which can use the Zero Register). Only a relative   
   minority being compares against non-zero constants.   
      
      
   One could argue:   
   This is high enough to care, but is it cheap enough?...   
      
      
   I had experimented with this before, where if a Jumbo-Op64 prefix was   
   used, and if flagged to synthesize an immediate for a conditional branch   
   or store, it would do a Branch-With-Immediate or Store-With-Immediate.   
      
   Applies to all ISA's, though with separate configuration flags for the   
   RV+Jx and XG1/2/3 ISAs (mostly separately enabling or disabling the   
   added routing). Both depend on mostly the same plumbing internally though.   
      
      
   Though, checking:   
   My CPU core was already paying for it, as I had left it enabled it   
   seems. Whatever the case, the added cost wasn't high enough for me to   
   bother disabling it again.   
      
   Testing (in Vivado, turning it off again):   
      Cost delta: ~ 300 LUTs.   
   OK, so logic cost delta is borderline negligible it seems.   
      
      
   In the case of its support in RV+Jx, it basically implements has a   
   17-bit sign extended immediate with a 12 bit displacement.   
      
   Note that using a Jumbo_Imm prefix wont work, as this would instead   
   extend the displacement to 33 bits. So, needs to be a Jumbo_Op64 prefix.   
      
      
      
      
   Enabling it again in the Doom build for RV64GC+Jx, has a slight negative   
   effect on code-density.   
      
   Though, this stands to reason:   
      JOp64+Bcc needs 8 bytes;   
      C.LI+Bcc needs 6 bytes.   
      
   One can try to ram it into a 32-bit encoding, but then what it actually   
   achieves in terms of hit rate is low enough to make it negligible.   
      
      
   Effects on Doom:   
      No obvious change in framerates;   
      Average trace length gets 1.7% shorter.   
      
      
   Theoretically, also might make it 1.7% faster, but 1.7% is below the   
   threshold of what is easily seen in average Doom framerate.   
      
   Trying "-timedemo demo1" with Doom:   
      XG2                     : 1710 gametics / 1588 realtics   
      XG3                     : 1710 gametics / 1605 realtics   
      RV64GC+Jx (BccImm=false): 1710 gametics / 1882 realtics   
      RV64GC+Jx (BccImm=true ): 1710 gametics / 1897 realtics   
      RV64GC    (plain)       : 1710 gametics / 2387 realtics   
      
   So, plain RV64GC being ~ 50% slower than XG2 or XG3 in this test...   
      
   Granted, had been working on off on making RV64G support "less terrible"   
   (used to be slower). Note it isn't just about BGBCC being slow with   
   plain RV64, GCC output also being kinda slow.   
      
      
   And, Bcc+Imm seemingly makes it very slightly slower somehow (despite   
   making the average trace length slightly shorter...).   
      
   Then remembers another downside:   
      Bcc+Imm doesn't work with the branch predictor;   
      So, the branches cost more cycles.   
      
   So, rough estimate:   
      ~ 1.7 to 1.9% faster assuming it were supported by branch predictor;   
      ~ or, 0.8% slower without branch predictor support.   
      
      
   So, it may not be an issue of whether or not to support it, but rather   
   whether or not to add support for dealing with it to the branch predictor.   
      
      
   > But it also doesn't need two immediates.   
   > A 16-bit integer or float and a 16-bit offset packed into a   
   > single 32-bit immediate would suffice for most purposes.   
   >   
      
   This is one possible way:   
   Produce single immediate and then split it post decode (such as in the   
   RF stage).   
      
      
   Can note in my case, I just sort of awkwardly routed a second immediate   
   output out of the Lane 1 decoder, which could then optionally replace   
   the Lane 3 immediate, with a special case where Lane 1 could then pull   
   the immediate from Lane 3 (where, in this case, Lane 3 is often used   
   when we need spare register ports or another immediate field; probably   
   more often than for actual instructions, *1).   
      
   I used I added both Branch-with-Immediate and Store-with-Immediate at   
   the same time as both effectively needed the same mechanism.   
      
      
   *1: Where, Lane1 generally only sees traffic in the case of:   
      ALU | ALU | ALU   
   Or:   
      ALU | ALU | Load   
   Or similar.   
      
      
   >   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]