Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 129,571 of 131,241    |
|    BGB to EricP    |
|    Re: Concedtina III May Be Returning    |
|    04 Sep 25 00:20:03    |
      From: cr88192@gmail.com              On 9/3/2025 9:42 PM, EricP wrote:       > MitchAlsup wrote:       >>       >> However, I also found that STs need an immediate and a displacement, so,       >> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with       >> potential displacement (from D12ds above) and the immediate has the       >> size of the ST. This provides for::       >> std #4607182418800017408,[r3,r2<<3,96]       >       > Compare and Branch can also use two immediates as it       > has reg-reg or reg-imm compares plus displacement.       > And has high enough frequency to be worth considering.       >              Can be done, yes.        High enough frequency/etc, is where the possible debate lies.                     Checking stats, it can effect roughly 1.9% of the instructions.       Or, around 11% of branches; most of the rest being unconditional or       comparing against 0 (which can use the Zero Register). Only a relative       minority being compares against non-zero constants.                     One could argue:       This is high enough to care, but is it cheap enough?...                     I had experimented with this before, where if a Jumbo-Op64 prefix was       used, and if flagged to synthesize an immediate for a conditional branch       or store, it would do a Branch-With-Immediate or Store-With-Immediate.              Applies to all ISA's, though with separate configuration flags for the       RV+Jx and XG1/2/3 ISAs (mostly separately enabling or disabling the       added routing). Both depend on mostly the same plumbing internally though.                     Though, checking:       My CPU core was already paying for it, as I had left it enabled it       seems. Whatever the case, the added cost wasn't high enough for me to       bother disabling it again.              Testing (in Vivado, turning it off again):        Cost delta: ~ 300 LUTs.       OK, so logic cost delta is borderline negligible it seems.                     In the case of its support in RV+Jx, it basically implements has a       17-bit sign extended immediate with a 12 bit displacement.              Note that using a Jumbo_Imm prefix wont work, as this would instead       extend the displacement to 33 bits. So, needs to be a Jumbo_Op64 prefix.                                   Enabling it again in the Doom build for RV64GC+Jx, has a slight negative       effect on code-density.              Though, this stands to reason:        JOp64+Bcc needs 8 bytes;        C.LI+Bcc needs 6 bytes.              One can try to ram it into a 32-bit encoding, but then what it actually       achieves in terms of hit rate is low enough to make it negligible.                     Effects on Doom:        No obvious change in framerates;        Average trace length gets 1.7% shorter.                     Theoretically, also might make it 1.7% faster, but 1.7% is below the       threshold of what is easily seen in average Doom framerate.              Trying "-timedemo demo1" with Doom:        XG2 : 1710 gametics / 1588 realtics        XG3 : 1710 gametics / 1605 realtics        RV64GC+Jx (BccImm=false): 1710 gametics / 1882 realtics        RV64GC+Jx (BccImm=true ): 1710 gametics / 1897 realtics        RV64GC (plain) : 1710 gametics / 2387 realtics              So, plain RV64GC being ~ 50% slower than XG2 or XG3 in this test...              Granted, had been working on off on making RV64G support "less terrible"       (used to be slower). Note it isn't just about BGBCC being slow with       plain RV64, GCC output also being kinda slow.                     And, Bcc+Imm seemingly makes it very slightly slower somehow (despite       making the average trace length slightly shorter...).              Then remembers another downside:        Bcc+Imm doesn't work with the branch predictor;        So, the branches cost more cycles.              So, rough estimate:        ~ 1.7 to 1.9% faster assuming it were supported by branch predictor;        ~ or, 0.8% slower without branch predictor support.                     So, it may not be an issue of whether or not to support it, but rather       whether or not to add support for dealing with it to the branch predictor.                     > But it also doesn't need two immediates.       > A 16-bit integer or float and a 16-bit offset packed into a       > single 32-bit immediate would suffice for most purposes.       >              This is one possible way:       Produce single immediate and then split it post decode (such as in the       RF stage).                     Can note in my case, I just sort of awkwardly routed a second immediate       output out of the Lane 1 decoder, which could then optionally replace       the Lane 3 immediate, with a special case where Lane 1 could then pull       the immediate from Lane 3 (where, in this case, Lane 3 is often used       when we need spare register ports or another immediate field; probably       more often than for actual instructions, *1).              I used I added both Branch-with-Immediate and Store-with-Immediate at       the same time as both effectively needed the same mechanism.                     *1: Where, Lane1 generally only sees traffic in the case of:        ALU | ALU | ALU       Or:        ALU | ALU | Load       Or similar.                     >              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca