Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 129,568 of 131,241    |
|    BGB to John Savard    |
|    Re: Concedtina III May Be Returning    |
|    03 Sep 25 17:29:30    |
      From: cr88192@gmail.com              On 9/2/2025 6:55 PM, John Savard wrote:       > On Tue, 02 Sep 2025 18:40:16 +0000, MitchAlsup wrote:       >       >> Lest one thinks this results in serial decoding, consider that the       >> pattern decoder is 40 gates (just larger than 3-flip-flops) so one can       >> afford to put this pattern decoder on every word in the inst- buffer       >       > Yes, given sufficiently simple decoding, one could allow backtracking when       > the second word of an instruction is decoded as if it was the first.       >       > Of course, though, it wastes electricity and produces heat, but a       > negligible amount, I agree.       >       > I'm designing my ISA, though, to make it simple to implement... in one       > specific sense. It's horribly large and complicated, but at least it       > doesn't demand that imlementors understand any fancy tricks.       >              Usual strategy is, say, for each cache line:       Detect which 16-bit words represent 16 or 32 bit instructions, and which       are prefixes.                     Logic isn't too unreasonable, except that it needs to deal with multiple       ISAs in my case; and so the current ISA mode needs to be visible to the       L1 I$, have time to settle, and may need to flush cache lines when the       ISA changes. The relevant tagging data exists alongside the cacheline       proper, and is resolved when each line arrives into the L1 (following an       I$ Miss). This has looser latency than determining with and similar       during the IF stage proper (but, it is only reasonable to store a few       bits per word).              So, length determination:        XG1:        Is32 = (15:13)==111 || (15:12)==0111 || (15:12)==1001        IsJX = (15: 9)==1111_111        IsWEX = ((15:12)==1111 && (10)==1) ||        ((15:12)==1110 && (11:8)==1z1z) //PrWEX        XG2:        Is32 = 1        IsJX = (12: 9)==1111        IsWEX = ((12)==1 && (10)==1) ||        ((12)==0 && (11:8)==1z1z) //PrWEX        XG3:        Is32 = 1        IsJX =        (5:0)== 11z10 || //XG3 prefix        (6:0)==1111111 //RISC-V (longer instruction)        IsWEX = 0        RV64GC:        Is32 = (1:0)==11        IsJX = (6:0)==1111111        IsWEX = 0              For odd-numbered 16-bit words, only the XG1 and RV64GC cases are       relevant. Actual logic overlaps the logic for the ISA modes to some extent.              For IF, it is a case of MUX'ing the bits for the low order part of PC,       then feeding them through a "casez()" to arrive at the target length.              For XG3 and RV64GC, the IsWEX flag would instead be provided external to       the length-determination module by the superscalar logic. As noted, this       only works for 32-bit aligned instructions (always 0 if misaligned).              Here, one sub-module checks for register dependencies, and the other       checks for which instructions are allowed in which context. These are       used to determine a virtual WEX bit.              Typically, then, the IsJX and IsWEX bits are OR'ed together to get an       IsWJX bit, which is what is used during IF.              Implicitly, this adds another constraint based on PC(3:2) for       superscalar operation:        00: 1-3 wide bundle        01: 1-3 wide bundle        10: 1/2 wide bundle        11: scalar only       Though, this restriction is N/A for jumbo prefixes.              Superscalar can't infer across cache line boundaries as it doesn't       necessarily know what exists in the following cache line. This situation       would be less bad with 32B cache lines, but then you would need twice as       much logic for the superscalar checks. Similar problem if trying to deal       with misaligned instructions.              Also, trying to deal with RVC here would make it "kinda evil". Currently       the register-check logic only dealing with RVI/RVI or XG3/XG3 pairs. Had       experimented with RVI/XG3 pairs, but this added a fair bit of additional       cost (and wasn't worth it, cheaper to assume "sparse mixing" of the       encodings).              More analysis would be needed to try to formalize the cost curve, but it       seems to be fairly steep in any case, so the number of possible paths       (between potential pairs of source and destination register ports) needs       to be kept as small as possible. Which in this case, was best served by       limiting things to fixed-length aligned-only and keeping RVI and XG3       instructions separate (in the case of a mixed pair, it always assumes       that register aliasing may exist).                     Well, for similar reasons, "opcode fusion" as a general solution to ISA       level inefficiencies (in the CPU) would have a "stupidly bad" cost curve       (would likely make normal superscalar look "almost free" in comparison).              And, there are "less stupidly bad" possibilities, like trying to "hot       patch" the instruction-sequences at load-time.              Say, for example, we have an Indexed-Load instruction in the CPU, and a       "PNOP" (Special NOP designated to have an 0-cycle latency, vs the       implied 1-cycle cost for a normal NOP).              Then, say, program loader hot patches an SLLI+ADD+LD pair into       PNOP+PNOP+LD_Ix.              But, does still mean the CPU needs to have the instruction, and there is       still a potential non-zero cost to the PNOPs (unless as a hack they       actively behave like they were a WEX'ed NOP; in which case it might be       illegal to have more than 2 PNOPs in a row on a 3-wide machine, ...).              ...                     Though, even with everything, superscalar might still be a better       general option than my older explicit WEX tagging system (say, by       allowing 2 and 3 wide implementations to share the same binaries without       a potentially steep performance penalty of needing to fall back to       scalar operation in the case of a pipeline width mismatch).                     > John Savard              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca