From: robfi680@gmail.com   
      
   On 2026-01-06 5:42 p.m., BGB wrote:   
   > On 12/31/2025 2:23 AM, Robert Finch wrote:   
   >>    
   >>   
   >>> One would argue that maybe prefixes are themselves wonky, but   
   >>> otherwise one needs:   
   >>> Instructions that can directly encode the presence of large immediate   
   >>> values, etc;   
   >>> Or, the use of suffix-encodings (which is IMHO worse than prefix   
   >>> encodings; at least prefix encodings make intuitive sense if one   
   >>> views the instruction stream as linear, whereas suffixes add   
   >>> weirdness and are effectively retro-causal, and for any fetch to be   
   >>> safe at the end of a cache line one would need to prove the non-   
   >>> existence of a suffix; so better to not go there).   
   >>>   
   >> I agree with this. Prefixes seem more natural, large numbers expanding   
   >> to the left, suffixes seem like a big-endian approach. But I use   
   >> suffixes for large constants. I think with most VLI constant data   
   >> follows the instruction. I find constant data easier to work with that   
   >> way and they can be processed in the same clock cycle as a decode so   
   >> they do not add to the dynamic instruction count. Just pass the   
   >> current instruction slot plus a following area of the cache-line to   
   >> the decoder.   
   >>   
   >   
   > ID stage is likely too late.   
   >   
   > For PC advance, ideally this needs to be known by the IF stage so that   
   > we can know how to advance PC for the next clock-cycle (for the PF stage).   
   >   
   > Say:   
   > PF IF ID RF E1 E2 E3 WB   
   > PF IF ID RF E1 E2 E3 WB   
   > PF IF ID RF E1 E2 E3 WB   
      
   The PC advance works okay without knowing whether there is a suffix   
   present or not. The suffix is treated like a NOP instruction. There is   
   no decode required at the fetch stage. The PC can land on a suffix. It   
   just always advances by four (N) instructions unless there is a branch.   
   >   
   > So, each IF stage producing an updated PC that needs to reach PF within   
   > the same clock-cycle (so the SRAMs can fetch data for the correct cache-   
   > line, which happens on a clock-line edge).   
   >   
   > This may also need to MUX PC's from things like the branch-predictor and   
   > branch-initiation logic, which then override the normal PC+Step handling   
   > generated from the IF->PF path (also typically at a low latency).   
   >   
   >   
   > In this case, the end of the IF stage also handles some amount of   
   > repacking;   
   > Possible:   
   > Right-justifying the fetched instructions;   
   > 16 -> 32 bit repacking (for RV-C)   
   > Current:   
   > Renormalization of XG1/XG2/XG3 into the same internal scheme;   
   > Repacking 48-bit RISC-V ops into internal 64-bit forms;   
   > ...   
   >   
   > As a partial result of this repacking, the instruction words effectively   
   > gain a few extra bits (the "internal normalized format" no longer   
   > fitting entirely into a 32-bit word; where one could almost see it as a   
   > sort of "extended instruction" that includes both ISAs in a single   
   > slightly-larger virtual instruction word).   
   >   
   >   
   > One could go further and try to re-normalize the full instruction   
   > layout, but as noted XG3 and RV would still differ enough as to make   
   > this annoying (mostly the different encoding spaces and immed formats).   
   >   
   > * zzzzzzz-ooooo-mmmmm-zzz-nnnnn-yy-yyy11   
   > * zzzz-oooooo-mmmmmm-zzzz-nnnnnn-yy-yyPw   
   >   
   >   
   > With a possible normalized format (36-bit):   
   > * zzzzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyyyPw   
   > * zzzzzzz-0ooooo-0mmmmm-yzzz-0nnnnn-1yyyy10 (RV Repack)   
   > * 000zzzz-oooooo-mmmmmm-zzzz-nnnnnn-0yyyyPw (XG3 Repack)   
   >   
   > Couldn't fully unify the encoding space within a single clock cycle   
   > though (within a reasonable cost budget).   
   >   
   >   
   > At present, the decoder handling is to essentially unify the 32-bit   
   > format for XG1/XG2/XG3 as XG2 with a few tag bits to disambiguate which   
   > ISA decoding rules should apply for the 32-bit instruction word in   
   > question. The other option would have been to normalize as XG3, but XG3   
   > loses some minor functionality from XG1 and XG2.   
   >   
   >   
   > I also went against allowing RV and XG3 jumbo prefixes to be mixed.   
   > Though, it is possible exceptions could be made.   
   >   
   > Wouldn't have needed J52I if XG3 prefixes could have been used with RV   
   > ops, but can't use XG3 prefixes in RV-C mode, which is part of why I   
   > ended up resorting to the J52I prefix hack. But, still doesn't fully   
   > address the issues that exist with hot-patching in this mode.   
   >   
   >   
   > Though, looking at options, the "cheapest but fastest" option at present   
   > likely being:   
   > Core that only does XG3, possibly dropping the RV encodings and re-   
   > adding WEX in its place (though, in such an XG3-Only mode, the 10/11   
   > modes would otherwise be identical in terms of encoding).   
   >   
   > Or, basically, XG3 being used in a way more like how XG2 was used.   
   >   
   > But, don't really want to create yet-more modes at the moment. XG3 being   
   > used as superscalar isn't too much more expensive, and arguably more   
   > flexible given the compiler doesn't need to be aware of pipeline   
   > scheduling specifics, but can still make use of this when trying to   
   > shuffle instructions around for efficiency (a mismatch will then merely   
   > result in a small reduction in efficiency rather than a potential   
   > inability of the code to run; though for XG2 there was the feature that   
   > the CPU could fall back to scalar or potential superscalar operation in   
   > cases where the compiler's bundling was incompatible with what the CPU   
   > allowed).   
   >   
   > So, it is possible that in-order superscalar may be better as a general   
   > purpose option even if not strictly the cheapest option.   
   >   
   >   
   > A case could maybe be made arguing for dropping back down to 32 GPRs   
   > (with no FPRs) for more cheapness, but as-is, trying to do 128-bit SIMD   
   > stuff in RV64 mode also tends to quickly run into issues with register   
   > pressure.   
   >   
   > Well, and I was just recently having to partly rework the mechanism for:   
   > v = (__vec4f) { x, y, z, w };   
   > To not try to load all the registers at the same time, as this was   
   > occasionally running out of free dynamic registers with the normal RV   
   > ABI (and 12 callee-save FPRs doesn't go quite so far when allocating   
   > pairs of them), which effectively causes the compiler to break.   
   >   
   >   
   > It is almost tempting to consider switching RV64 over to the XG3 ABI   
   > when using SIMD, well, and/or not use SIMD with RV64 because it kinda   
   > sucks worse than XG3.   
   >   
   > But... Comparably, for the TKRA-GL front-end (using syscalls for the   
   > back-end), using runtime calls and similar for vector operations does   
   > still put a big dent in the framerate for GLQuake (so, some sort of SIMD   
   > in RV mode may still be needed even if "kinda inferior").   
   >   
   >   
   >> Handling suffixes at the end of a cache-line is not too bad if the   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|