Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 129,563 of 131,241    |
|    BGB to John Savard    |
|    Re: Concedtina III May Be Returning (1/2    |
|    02 Sep 25 13:07:07    |
      From: cr88192@gmail.com              On 9/2/2025 4:15 AM, John Savard wrote:       > On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:       >       >> How about, say, 16/32/48/64/96:       >> xxxx-xxxx-xxxx-xxx0 //16 bit       >> xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit       >> xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix       >>       >> Already elaborate enough...       >       > Thank you for your interesting suggestions.       >       > I'm envisaging Concertina III as closely based on Concertina II, with only       > minimal changes.       >       > Like Concertina II, it is to meet the overriding condition that       > instructions do not have to be decoded sequentially. This means that       > whenever an instruction, or group of instructions, spans more than 32       > bits, the 32 bit areas of the instruction, other than the first, must       > begin with a combination of bits that says "don't decode me".       >       > The first 32 bits of an instruction get decoded directly, and then trigger       > and control the decoding of the rest of the instruction.       >       > This has the consequence that any immediate value that is 32 bits or more       > in length has to be split up into smaller pieces; this is what I really       > don't like about giving up the block structure.       >              Note that tagging like that described does still allow some amount of       parallel decoding, since we still have combinatorial logic. Granted,       scalability is an issue.              As can be noted, my use of jumbo-prefixes for large immediate values       does have the property of allowing reusing 32-bit decoders for 64-bit       and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings       don't change the instruction being decoded, but merely extend it.              Some internal plumbing is needed to stitch the immediate values together       though, typically:        We have OpA, OpB, OpC        DecC gets OpC, and JBits from OpB        DecB gets OpB, and JBits from OpA        DecA gets OpA, and 0 for JBits.              In my CPU core, I had a few times considered changing how decoding       worked, to either reverse or right-align the instruction block to reduce       the amount of MUX'ing needed in the decoder. If going for       right-alignment, then DecC would always go to Lane1, DecB to Lane2, and       DecA to Lane3.              Can note that for immediate-handling, the Lane1 decoder produces the low       33 bits of the result. If a decoder has a jumbo prefix and is itself       given a jumbo-prefix, it assumes a 96 bit encoding and produces the       value for the high 32 bits.              At least in my designs, I only account for 33 bits of immediate per       lane. Instead, when a full 64-bit immediate is encoded, its value is       assembled in the ID2/RF stage.                     Though, admittedly my CPU core design did fall back to sequential       execution for 16-bit ops, but this was partly for cost reasons.              For BJX2/XG1 originally, it was because the instructions couldn't use       WEX tagging, but after adding superscalar it was because I would either       need multiple parallel 16-bit decoders, or to change how 16 bit ops were       handled (likely using a 16->32 repacker).              So, say:        IF stage:        Retrieve instruction from Cache Line;        Determine fetch length:        XG1/XG2 used explicit tagging;        XG3 and RV use SuperScalar checks.        Run repackers.        Currently both XG3 and RISC-V 48-bit ops are handled by repacking.        Decode Stage:        Decode N parallel 32-bit ops;        Prefixes route to the corresponding instructions;        Any lane holding solely a prefix goes NOP.                     For a repacker, it would help if there were fairly direct mappings       between the 16-bit and 32-bit ops. Contrary to claims, RVC does not       appear to fit such a pattern. Personally, there isn't much good to say       about RVC's encoding scheme, as it is very much ad-hoc dog chew.              The usual claim is more that it is "compressed" in that you can first       generate a 32-bit op internally and "squish" it down into a 16-bit form       if it fits. This isn't terribly novel as I see it. Repacking RVC has       similar problems to decoding it directly, namely that for a fair number       of instructions, nearly each instruction has its own idiosyncratic       encoding scheme (and you can't just simply shuffle some of the bits       around and fill others with 0s and similar to arrive back at a valid RVI       instruction).                     Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though       there were some special cases made in the decoding rules. Though,       admittedly I did do more than the bare minimum here (to fit it into the       same encoding space as RV), mostly as I ended up going for a "Dog Chew       Reduction" route rather than merely a "do the bare minimum bit-shuffling       needed to make it fit".              For better or worse, it effectively made XG3 its own ISA as far as BGBCC       is concerned. Even if in theory I could have used repacking, the       original XG1/XG2 emitter logic is a total mess. It was written       originally for fixed-length 16-bit ops, so encodes and outputs       instructions 16 bits at a time (using big "switch()" blocks, but the       RISC-V and XG3 emitters also went this path; as far as BGBCC is       concerned, it is treating XG3 as part of RISC-V land).                     Both the CPU core and also JX2VM handle it by repacking to XG2 though.       For the XG3VM (userland only emulator for now), it instead decodes XG3       directly, with decoders for XG3, RVI, and RVC.              Had noted the relative irony that despite XG3 having a longer       instruction listing (than RVI) it still ends up with a slightly shorter       decoder.              Some of this has to deal with one big annoyance of RISC-V's encoding scheme:       Its inconsistent and dog-chewed handling of immediate and displacement       values.                     Though, for mixed-output, there are still a handful of cases where RVI       encodings can beat XG3 encodings, mostly involving cases where the RVI       encodings have a slightly larger displacement.              In compiler stats, this seems to mostly affect:        LB, LBU, LW, LWU        SB, SW        ADDI, ADDIW, LUI       The former:        Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.        ADDI: 12b > 10b       LUI: because loading a 32-bit value of the form XXXXX000 does happen       sometimes it seems.              Instruction counts are low enough that a "pure XG3" would likely result       in Doom being around 1K larger (the cases where RVI ops are used would       need a 64-bit jumbo-encoding in XG3).              Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using       separate Imm10u/Imm10n encodings, rather than an Imm10s, does have merit       in that this effectively gives it an Imm11s encoding; and ADD is one of       the main instructions that tends to be big-immediate-heavy (and in early       design it was a close race between ADD ImmU/ImmN, vs ADD/SUB ImmU, but       the current scheme has a tiny bit more range, albeit SUB-ImmU could have       possibly avoided the need for an ImmN case).              So, say:              [continued in next message]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca