... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,563 of 131,241
BGB to John Savard
Re: Concedtina III May Be Returning (1/2
02 Sep 25 13:07:07
   From: cr88192@gmail.com   
      
   On 9/2/2025 4:15 AM, John Savard wrote:   
   > On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:   
   >   
   >> How about, say, 16/32/48/64/96:   
   >>                         xxxx-xxxx-xxxx-xxx0  //16 bit   
   >>     xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1  //32 bit   
   >>     xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111  //64/48/96 bit prefix   
   >>   
   >> Already elaborate enough...   
   >   
   > Thank you for your interesting suggestions.   
   >   
   > I'm envisaging Concertina III as closely based on Concertina II, with only   
   > minimal changes.   
   >   
   > Like Concertina II, it is to meet the overriding condition that   
   > instructions do not have to be decoded sequentially. This means that   
   > whenever an instruction, or group of instructions, spans more than 32   
   > bits, the 32 bit areas of the instruction, other than the first, must   
   > begin with a combination of bits that says "don't decode me".   
   >   
   > The first 32 bits of an instruction get decoded directly, and then trigger   
   > and control the decoding of the rest of the instruction.   
   >   
   > This has the consequence that any immediate value that is 32 bits or more   
   > in length has to be split up into smaller pieces; this is what I really   
   > don't like about giving up the block structure.   
   >   
      
   Note that tagging like that described does still allow some amount of   
   parallel decoding, since we still have combinatorial logic. Granted,   
   scalability is an issue.   
      
   As can be noted, my use of jumbo-prefixes for large immediate values   
   does have the property of allowing reusing 32-bit decoders for 64-bit   
   and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings   
   don't change the instruction being decoded, but merely extend it.   
      
   Some internal plumbing is needed to stitch the immediate values together   
   though, typically:   
      We have OpA, OpB, OpC   
      DecC gets OpC, and JBits from OpB   
      DecB gets OpB, and JBits from OpA   
      DecA gets OpA, and 0 for JBits.   
      
   In my CPU core, I had a few times considered changing how decoding   
   worked, to either reverse or right-align the instruction block to reduce   
   the amount of MUX'ing needed in the decoder. If going for   
   right-alignment, then DecC would always go to Lane1, DecB to Lane2, and   
   DecA to Lane3.   
      
   Can note that for immediate-handling, the Lane1 decoder produces the low   
   33 bits of the result. If a decoder has a jumbo prefix and is itself   
   given a jumbo-prefix, it assumes a 96 bit encoding and produces the   
   value for the high 32 bits.   
      
   At least in my designs, I only account for 33 bits of immediate per   
   lane. Instead, when a full 64-bit immediate is encoded, its value is   
   assembled in the ID2/RF stage.   
      
      
   Though, admittedly my CPU core design did fall back to sequential   
   execution for 16-bit ops, but this was partly for cost reasons.   
      
   For BJX2/XG1 originally, it was because the instructions couldn't use   
   WEX tagging, but after adding superscalar it was because I would either   
   need multiple parallel 16-bit decoders, or to change how 16 bit ops were   
   handled (likely using a 16->32 repacker).   
      
   So, say:   
      IF stage:   
        Retrieve instruction from Cache Line;   
        Determine fetch length:   
          XG1/XG2 used explicit tagging;   
          XG3 and RV use SuperScalar checks.   
        Run repackers.   
          Currently both XG3 and RISC-V 48-bit ops are handled by repacking.   
      Decode Stage:   
        Decode N parallel 32-bit ops;   
        Prefixes route to the corresponding instructions;   
        Any lane holding solely a prefix goes NOP.   
      
      
   For a repacker, it would help if there were fairly direct mappings   
   between the 16-bit and 32-bit ops. Contrary to claims, RVC does not   
   appear to fit such a pattern. Personally, there isn't much good to say   
   about RVC's encoding scheme, as it is very much ad-hoc dog chew.   
      
   The usual claim is more that it is "compressed" in that you can first   
   generate a 32-bit op internally and "squish" it down into a 16-bit form   
   if it fits. This isn't terribly novel as I see it. Repacking RVC has   
   similar problems to decoding it directly, namely that for a fair number   
   of instructions, nearly each instruction has its own idiosyncratic   
   encoding scheme (and you can't just simply shuffle some of the bits   
   around and fill others with 0s and similar to arrive back at a valid RVI   
   instruction).   
      
      
   Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though   
   there were some special cases made in the decoding rules. Though,   
   admittedly I did do more than the bare minimum here (to fit it into the   
   same encoding space as RV), mostly as I ended up going for a "Dog Chew   
   Reduction" route rather than merely a "do the bare minimum bit-shuffling   
   needed to make it fit".   
      
   For better or worse, it effectively made XG3 its own ISA as far as BGBCC   
   is concerned. Even if in theory I could have used repacking, the   
   original XG1/XG2 emitter logic is a total mess. It was written   
   originally for fixed-length 16-bit ops, so encodes and outputs   
   instructions 16 bits at a time (using big "switch()" blocks, but the   
   RISC-V and XG3 emitters also went this path; as far as BGBCC is   
   concerned, it is treating XG3 as part of RISC-V land).   
      
      
   Both the CPU core and also JX2VM handle it by repacking to XG2 though.   
   For the XG3VM (userland only emulator for now), it instead decodes XG3   
   directly, with decoders for XG3, RVI, and RVC.   
      
   Had noted the relative irony that despite XG3 having a longer   
   instruction listing (than RVI) it still ends up with a slightly shorter   
   decoder.   
      
   Some of this has to deal with one big annoyance of RISC-V's encoding scheme:   
   Its inconsistent and dog-chewed handling of immediate and displacement   
   values.   
      
      
   Though, for mixed-output, there are still a handful of cases where RVI   
   encodings can beat XG3 encodings, mostly involving cases where the RVI   
   encodings have a slightly larger displacement.   
      
   In compiler stats, this seems to mostly affect:   
      LB, LBU, LW, LWU   
      SB, SW   
      ADDI, ADDIW, LUI   
   The former:   
      Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.   
      ADDI: 12b > 10b   
   LUI: because loading a 32-bit value of the form XXXXX000 does happen   
   sometimes it seems.   
      
   Instruction counts are low enough that a "pure XG3" would likely result   
   in Doom being around 1K larger (the cases where RVI ops are used would   
   need a 64-bit jumbo-encoding in XG3).   
      
   Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using   
   separate Imm10u/Imm10n encodings, rather than an Imm10s, does have merit   
   in that this effectively gives it an Imm11s encoding; and ADD is one of   
   the main instructions that tends to be big-immediate-heavy (and in early   
   design it was a close race between ADD ImmU/ImmN, vs ADD/SUB ImmU, but   
   the current scheme has a tiny bit more range, albeit SUB-ImmU could have   
   possibly avoided the need for an ImmN case).   
      
   So, say:   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]