... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,560 of 131,241
BGB to BGB
Re: Concedtina III May Be Returning (1/2
02 Sep 25 13:10:23
   From: cr88192@gmail.com   
      
   On 9/2/2025 1:07 PM, BGB wrote:   
   > On 9/2/2025 4:15 AM, John Savard wrote:   
   >> On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:   
   >>   
   >>> How about, say, 16/32/48/64/96:   
   >>>                         xxxx-xxxx-xxxx-xxx0  //16   
   bit   
   >>>     xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1  //32 bit   
   >>>     xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111  //64/48/96 bit prefix   
   >>>   
   >>> Already elaborate enough...   
   >>   
   >> Thank you for your interesting suggestions.   
   >>   
   >> I'm envisaging Concertina III as closely based on Concertina II, with   
   >> only   
   >> minimal changes.   
   >>   
   >> Like Concertina II, it is to meet the overriding condition that   
   >> instructions do not have to be decoded sequentially. This means that   
   >> whenever an instruction, or group of instructions, spans more than 32   
   >> bits, the 32 bit areas of the instruction, other than the first, must   
   >> begin with a combination of bits that says "don't decode me".   
   >>   
   >> The first 32 bits of an instruction get decoded directly, and then   
   >> trigger   
   >> and control the decoding of the rest of the instruction.   
   >>   
   >> This has the consequence that any immediate value that is 32 bits or more   
   >> in length has to be split up into smaller pieces; this is what I really   
   >> don't like about giving up the block structure.   
   >>   
   >   
   > Note that tagging like that described does still allow some amount of   
   > parallel decoding, since we still have combinatorial logic. Granted,   
   > scalability is an issue.   
   >   
   > As can be noted, my use of jumbo-prefixes for large immediate values   
   > does have the property of allowing reusing 32-bit decoders for 64-bit   
   > and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings   
   > don't change the instruction being decoded, but merely extend it.   
   >   
   > Some internal plumbing is needed to stitch the immediate values together   
   > though, typically:   
   >    We have OpA, OpB, OpC   
   >    DecC gets OpC, and JBits from OpB   
   >    DecB gets OpB, and JBits from OpA   
   >    DecA gets OpA, and 0 for JBits.   
   >   
   > In my CPU core, I had a few times considered changing how decoding   
   > worked, to either reverse or right-align the instruction block to reduce   
   > the amount of MUX'ing needed in the decoder. If going for right-   
   > alignment, then DecC would always go to Lane1, DecB to Lane2, and DecA   
   > to Lane3.   
   >   
   > Can note that for immediate-handling, the Lane1 decoder produces the low   
   > 33 bits of the result. If a decoder has a jumbo prefix and is itself   
   > given a jumbo-prefix, it assumes a 96 bit encoding and produces the   
   > value for the high 32 bits.   
   >   
   > At least in my designs, I only account for 33 bits of immediate per   
   > lane. Instead, when a full 64-bit immediate is encoded, its value is   
   > assembled in the ID2/RF stage.   
   >   
   >   
   > Though, admittedly my CPU core design did fall back to sequential   
   > execution for 16-bit ops, but this was partly for cost reasons.   
   >   
   > For BJX2/XG1 originally, it was because the instructions couldn't use   
   > WEX tagging, but after adding superscalar it was because I would either   
   > need multiple parallel 16-bit decoders, or to change how 16 bit ops were   
   > handled (likely using a 16->32 repacker).   
   >   
   > So, say:   
   >    IF stage:   
   >      Retrieve instruction from Cache Line;   
   >      Determine fetch length:   
   >        XG1/XG2 used explicit tagging;   
   >        XG3 and RV use SuperScalar checks.   
   >      Run repackers.   
   >        Currently both XG3 and RISC-V 48-bit ops are handled by   
   repacking.   
   >    Decode Stage:   
   >      Decode N parallel 32-bit ops;   
   >      Prefixes route to the corresponding instructions;   
   >      Any lane holding solely a prefix goes NOP.   
   >   
   >   
   > For a repacker, it would help if there were fairly direct mappings   
   > between the 16-bit and 32-bit ops. Contrary to claims, RVC does not   
   > appear to fit such a pattern. Personally, there isn't much good to say   
   > about RVC's encoding scheme, as it is very much ad-hoc dog chew.   
   >   
   > The usual claim is more that it is "compressed" in that you can first   
   > generate a 32-bit op internally and "squish" it down into a 16-bit form   
   > if it fits. This isn't terribly novel as I see it. Repacking RVC has   
   > similar problems to decoding it directly, namely that for a fair number   
   > of instructions, nearly each instruction has its own idiosyncratic   
   > encoding scheme (and you can't just simply shuffle some of the bits   
   > around and fill others with 0s and similar to arrive back at a valid RVI   
   > instruction).   
   >   
   >   
   > Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though   
   > there were some special cases made in the decoding rules. Though,   
   > admittedly I did do more than the bare minimum here (to fit it into the   
   > same encoding space as RV), mostly as I ended up going for a "Dog Chew   
   > Reduction" route rather than merely a "do the bare minimum bit-shuffling   
   > needed to make it fit".   
   >   
   > For better or worse, it effectively made XG3 its own ISA as far as BGBCC   
   > is concerned. Even if in theory I could have used repacking, the   
   > original XG1/XG2 emitter logic is a total mess. It was written   
   > originally for fixed-length 16-bit ops, so encodes and outputs   
   > instructions 16 bits at a time (using big "switch()" blocks, but the   
   > RISC-V and XG3 emitters also went this path; as far as BGBCC is   
   > concerned, it is treating XG3 as part of RISC-V land).   
   >   
   >   
   > Both the CPU core and also JX2VM handle it by repacking to XG2 though.   
   > For the XG3VM (userland only emulator for now), it instead decodes XG3   
   > directly, with decoders for XG3, RVI, and RVC.   
   >   
   > Had noted the relative irony that despite XG3 having a longer   
   > instruction listing (than RVI) it still ends up with a slightly shorter   
   > decoder.   
   >   
   > Some of this has to deal with one big annoyance of RISC-V's encoding   
   > scheme:   
   > Its inconsistent and dog-chewed handling of immediate and displacement   
   > values.   
   >   
   >   
   > Though, for mixed-output, there are still a handful of cases where RVI   
   > encodings can beat XG3 encodings, mostly involving cases where the RVI   
   > encodings have a slightly larger displacement.   
   >   
   > In compiler stats, this seems to mostly affect:   
   >    LB, LBU, LW, LWU   
   >    SB, SW   
   >    ADDI, ADDIW, LUI   
   > The former:   
   >    Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.   
   >    ADDI: 12b > 10b   
   > LUI: because loading a 32-bit value of the form XXXXX000 does happen   
   > sometimes it seems.   
   >   
   > Instruction counts are low enough that a "pure XG3" would likely result   
   > in Doom being around 1K larger (the cases where RVI ops are used would   
   > need a 64-bit jumbo-encoding in XG3).   
   >   
   > Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]