From: cr88192@gmail.com   
      
   On 1/6/2026 5:49 PM, MitchAlsup wrote:   
   >   
   > Robert Finch posted:   
   >   
   >>    
   >>   
   >>> One would argue that maybe prefixes are themselves wonky, but otherwise   
   >>> one needs:   
   >>> Instructions that can directly encode the presence of large immediate   
   >>> values, etc;   
   >   
   > This is the direction of My 66000.   
   >   
   > The instruction stream is a linear stream of words.   
   > The first word of each instruction encodes its total length.   
   > What follows the instruction itself are merely constants used as   
   > operands in the instruction itself. All constants are 1 or 2   
   > words in length.   
   >   
   > I would not call this means "prefixed" or "suffixed". Generally,   
   > prefixes and suffixes consume bits of the prefix/suffix so that   
   > the constant (in my case) is not equal to container size. This   
   > leads to wonky operand/displacement sizes not equal 2^(3+k).   
   >   
      
   OK.   
      
   As can be noted:   
    XG2/3: Prefix scheme, 1/2/3 x 32-bit   
    The 96-bit cases are determined by two prefixes.   
    Requires looking at 2 words to know total length.   
    RV64+Jx:   
    Total length is known from the first instruction word:   
    Base op: 32 bits;   
    J21I: 64 bits   
    J52I: 96 bits.   
    There was a J22+J22+LUI special case,   
    but I now consider this as deprecated.   
    J52I+ADDI is now considered preferable.   
      
   As for Imm/Disp sizes:   
    XG1: 9/33/57   
    XG2 and XG3: 10/33/64   
    RV+JX: 12/33/64   
      
   For XG1, the 57-bit size was rarely used and only optionally supported,   
   mostly because of the great "crap all of immediate values between 34 and   
   62 bits" gulf.   
      
      
   >>> Or, the use of suffix-encodings (which is IMHO worse than prefix   
   >>> encodings; at least prefix encodings make intuitive sense if one views   
   >>> the instruction stream as linear, whereas suffixes add weirdness and are   
   >>> effectively retro-causal, and for any fetch to be safe at the end of a   
   >>> cache line one would need to prove the non-existence of a suffix; so   
   >>> better to not go there).   
   >>>   
   >> I agree with this. Prefixes seem more natural, large numbers expanding   
   >> to the left, suffixes seem like a big-endian approach. But I use   
   >> suffixes for large constants. I think with most VLI constant data   
   >> follows the instruction.   
   >   
   > But not "self identified".   
   >   
      
   Yeah, if you can't know whether or not more instruction follows after   
   the first word by looking at the first word, this is a drawback.   
      
   Also, if you have to look at some special combination of register   
   specifiers and/or a lot of other bits, this is also a problem.   
      
      
   >> I find constant data easier to work with that   
   >> way and they can be processed in the same clock cycle as a decode so   
   >> they do not add to the dynamic instruction count. Just pass the current   
   >> instruction slot plus a following area of the cache-line to the decoder.   
   >>   
   >> Handling suffixes at the end of a cache-line is not too bad if the cache   
   >> already handles instructions spanning a cache line. Assume the maximum   
   >> number of suffixes is present and ensure the cache-line is wide enough.   
   >> Or limit the number of suffixes so they fit into the half cache-line   
   >> used for spanning.   
   >>   
   >> It is easier to handle interrupts with suffixes. The suffix can just be   
   >> treated as a NOP. Adjusting the position of the hardware interrupt to   
   >> the start of an instruction then does not have to worry about accounting   
   >> for a prefix / suffix.   
   >   
   > I would have thought that the previous instruction (last one retired) would   
   > provide the starting point of the subsequent instruction. This way you don't   
   > have to worry about counting prefixes or suffixes.   
   >   
      
   Yeah.   
      
   My thinking is, typical advance:   
    IF figures out how much to advance;   
    Next instruction gets PC+Step.   
      
   Then interrupt:   
    Figure out which position in the pipeline interrupt starts from;   
    Start there, flushing the rest of the pipeline;   
    For a faulting instruction, this is typically the EX1 or EX2 stage.   
    EX1 if it is a TRAP or SYSCALL;   
    EX2 if it is a TLB miss or similar;   
    Unless EX2 is not a valid spot (flush or bubble),   
    then look for a spot that is not a flush or bubble.   
    This case usually happens for branch-related TLB misses.]   
      
   Usually EX3 or WB is too old, as it would mean re-running previous   
   instructions.   
      
   Getting the exact stage-timing correct for interrupts is a little   
   fiddly, but worrying about prefix/suffix/etc issues with interrupts   
   isn't usually an issue, except that if somehow PC ended up pointing   
   inside another instruction, I would consider this a fault.   
      
   Usually for sake of branch-calculations in XG3 and RV, it is relative to   
   the BasePC before the prefix in the case of prefixed encodings. This   
   differs from XG1 and XG2 which defined branches relative to the PC of   
   the following instruction.   
      
   Though, this difference was partly due to a combination of   
   implementation reasons and for consistency with RISC-V (when using a   
   shared encoding space, makes sense if all the branches define PC   
   displacements in a consistent way).   
      
      
   Though, there is the difference that XG3's branches use a 32-bit scale   
   rather than a 16-bit scale. Well, and unlike RV's displacements, they   
   are not horrible confetti (*1).   
      
   *1: One can try to write a new RV decoder, and then place bets on   
   whether they will get JAL and Bcc encodings correct on the first try.   
   IME, almost invariably, one will screw these up in some way on the first   
   attempt. Like, JAL's displacement encoding is "the gift that keeps on   
   giving" in this sense.   
      
   Like, they were like:   
    ADDI / Load:   
    Yay, contiguous bits;   
    Store:   
    Well, swap the registers around and put the disp where Rd went.   
    Bcc:   
    Well, take the Store disp and just shuffle around a few more bits;   
    JAL:   
    Well, now there are some more bits, and Rd is back, ...   
    Why not keep some of the bits from Bcc,   
    but stick everything else in random places?...   
    Well, I guess some share the relative positions as LUI, but, ...   
      
   Not perfect in XG3 either, but still:   
    { opw[5] ? 11'h7FF : 11'h000, opw[11:6], opw[31:16] }   
   Is nowhere near the same level of nasty...   
      
      
   Well, nevermind if actual decoder has the reverse issue:   
   In the VL core and JX2VM, it was internally repacked back into XG2 form   
   internally, which means a little bit of hair going on here. Also I was   
   originally going to relocate it in the encoding space, but ended up   
   moving back to its original location as for reasons (mostly due to   
   sharing the same decoder) having BRA/BSR in two different locations   
   would have effectively burned more encoding space than just leaving it   
   where it had been in XG1/XG2 (even if having BRA/BSR in the F0 block is   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|