Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 129,560 of 131,241    |
|    BGB to BGB    |
|    Re: Concedtina III May Be Returning (1/2    |
|    02 Sep 25 13:10:23    |
      From: cr88192@gmail.com              On 9/2/2025 1:07 PM, BGB wrote:       > On 9/2/2025 4:15 AM, John Savard wrote:       >> On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:       >>       >>> How about, say, 16/32/48/64/96:       >>> xxxx-xxxx-xxxx-xxx0 //16       bit       >>> xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit       >>> xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix       >>>       >>> Already elaborate enough...       >>       >> Thank you for your interesting suggestions.       >>       >> I'm envisaging Concertina III as closely based on Concertina II, with       >> only       >> minimal changes.       >>       >> Like Concertina II, it is to meet the overriding condition that       >> instructions do not have to be decoded sequentially. This means that       >> whenever an instruction, or group of instructions, spans more than 32       >> bits, the 32 bit areas of the instruction, other than the first, must       >> begin with a combination of bits that says "don't decode me".       >>       >> The first 32 bits of an instruction get decoded directly, and then       >> trigger       >> and control the decoding of the rest of the instruction.       >>       >> This has the consequence that any immediate value that is 32 bits or more       >> in length has to be split up into smaller pieces; this is what I really       >> don't like about giving up the block structure.       >>       >       > Note that tagging like that described does still allow some amount of       > parallel decoding, since we still have combinatorial logic. Granted,       > scalability is an issue.       >       > As can be noted, my use of jumbo-prefixes for large immediate values       > does have the property of allowing reusing 32-bit decoders for 64-bit       > and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings       > don't change the instruction being decoded, but merely extend it.       >       > Some internal plumbing is needed to stitch the immediate values together       > though, typically:       > We have OpA, OpB, OpC       > DecC gets OpC, and JBits from OpB       > DecB gets OpB, and JBits from OpA       > DecA gets OpA, and 0 for JBits.       >       > In my CPU core, I had a few times considered changing how decoding       > worked, to either reverse or right-align the instruction block to reduce       > the amount of MUX'ing needed in the decoder. If going for right-       > alignment, then DecC would always go to Lane1, DecB to Lane2, and DecA       > to Lane3.       >       > Can note that for immediate-handling, the Lane1 decoder produces the low       > 33 bits of the result. If a decoder has a jumbo prefix and is itself       > given a jumbo-prefix, it assumes a 96 bit encoding and produces the       > value for the high 32 bits.       >       > At least in my designs, I only account for 33 bits of immediate per       > lane. Instead, when a full 64-bit immediate is encoded, its value is       > assembled in the ID2/RF stage.       >       >       > Though, admittedly my CPU core design did fall back to sequential       > execution for 16-bit ops, but this was partly for cost reasons.       >       > For BJX2/XG1 originally, it was because the instructions couldn't use       > WEX tagging, but after adding superscalar it was because I would either       > need multiple parallel 16-bit decoders, or to change how 16 bit ops were       > handled (likely using a 16->32 repacker).       >       > So, say:       > IF stage:       > Retrieve instruction from Cache Line;       > Determine fetch length:       > XG1/XG2 used explicit tagging;       > XG3 and RV use SuperScalar checks.       > Run repackers.       > Currently both XG3 and RISC-V 48-bit ops are handled by       repacking.       > Decode Stage:       > Decode N parallel 32-bit ops;       > Prefixes route to the corresponding instructions;       > Any lane holding solely a prefix goes NOP.       >       >       > For a repacker, it would help if there were fairly direct mappings       > between the 16-bit and 32-bit ops. Contrary to claims, RVC does not       > appear to fit such a pattern. Personally, there isn't much good to say       > about RVC's encoding scheme, as it is very much ad-hoc dog chew.       >       > The usual claim is more that it is "compressed" in that you can first       > generate a 32-bit op internally and "squish" it down into a 16-bit form       > if it fits. This isn't terribly novel as I see it. Repacking RVC has       > similar problems to decoding it directly, namely that for a fair number       > of instructions, nearly each instruction has its own idiosyncratic       > encoding scheme (and you can't just simply shuffle some of the bits       > around and fill others with 0s and similar to arrive back at a valid RVI       > instruction).       >       >       > Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though       > there were some special cases made in the decoding rules. Though,       > admittedly I did do more than the bare minimum here (to fit it into the       > same encoding space as RV), mostly as I ended up going for a "Dog Chew       > Reduction" route rather than merely a "do the bare minimum bit-shuffling       > needed to make it fit".       >       > For better or worse, it effectively made XG3 its own ISA as far as BGBCC       > is concerned. Even if in theory I could have used repacking, the       > original XG1/XG2 emitter logic is a total mess. It was written       > originally for fixed-length 16-bit ops, so encodes and outputs       > instructions 16 bits at a time (using big "switch()" blocks, but the       > RISC-V and XG3 emitters also went this path; as far as BGBCC is       > concerned, it is treating XG3 as part of RISC-V land).       >       >       > Both the CPU core and also JX2VM handle it by repacking to XG2 though.       > For the XG3VM (userland only emulator for now), it instead decodes XG3       > directly, with decoders for XG3, RVI, and RVC.       >       > Had noted the relative irony that despite XG3 having a longer       > instruction listing (than RVI) it still ends up with a slightly shorter       > decoder.       >       > Some of this has to deal with one big annoyance of RISC-V's encoding       > scheme:       > Its inconsistent and dog-chewed handling of immediate and displacement       > values.       >       >       > Though, for mixed-output, there are still a handful of cases where RVI       > encodings can beat XG3 encodings, mostly involving cases where the RVI       > encodings have a slightly larger displacement.       >       > In compiler stats, this seems to mostly affect:       > LB, LBU, LW, LWU       > SB, SW       > ADDI, ADDIW, LUI       > The former:       > Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.       > ADDI: 12b > 10b       > LUI: because loading a 32-bit value of the form XXXXX000 does happen       > sometimes it seems.       >       > Instruction counts are low enough that a "pure XG3" would likely result       > in Doom being around 1K larger (the cases where RVI ops are used would       > need a 64-bit jumbo-encoding in XG3).       >       > Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using              [continued in next message]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca