... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 129,568 of 131,241
BGB to John Savard
Re: Concedtina III May Be Returning
03 Sep 25 17:29:30
   From: cr88192@gmail.com   
      
   On 9/2/2025 6:55 PM, John Savard wrote:   
   > On Tue, 02 Sep 2025 18:40:16 +0000, MitchAlsup wrote:   
   >   
   >> Lest one thinks this results in serial decoding, consider that the   
   >> pattern decoder is 40 gates (just larger than 3-flip-flops) so one can   
   >> afford to put this pattern decoder on every word in the inst- buffer   
   >   
   > Yes, given sufficiently simple decoding, one could allow backtracking when   
   > the second word of an instruction is decoded as if it was the first.   
   >   
   > Of course, though, it wastes electricity and produces heat, but a   
   > negligible amount, I agree.   
   >   
   > I'm designing my ISA, though, to make it simple to implement... in one   
   > specific sense. It's horribly large and complicated, but at least it   
   > doesn't demand that imlementors understand any fancy tricks.   
   >   
      
   Usual strategy is, say, for each cache line:   
   Detect which 16-bit words represent 16 or 32 bit instructions, and which   
   are prefixes.   
      
      
   Logic isn't too unreasonable, except that it needs to deal with multiple   
   ISAs in my case; and so the current ISA mode needs to be visible to the   
   L1 I$, have time to settle, and may need to flush cache lines when the   
   ISA changes. The relevant tagging data exists alongside the cacheline   
   proper, and is resolved when each line arrives into the L1 (following an   
   I$ Miss). This has looser latency than determining with and similar   
   during the IF stage proper (but, it is only reasonable to store a few   
   bits per word).   
      
   So, length determination:   
      XG1:   
        Is32 = (15:13)==111 || (15:12)==0111 || (15:12)==1001   
        IsJX = (15: 9)==1111_111   
        IsWEX = ((15:12)==1111 && (10)==1) ||   
          ((15:12)==1110 && (11:8)==1z1z)  //PrWEX   
      XG2:   
        Is32 = 1   
        IsJX = (12: 9)==1111   
        IsWEX = ((12)==1 && (10)==1) ||   
          ((12)==0 && (11:8)==1z1z)  //PrWEX   
      XG3:   
        Is32 = 1   
        IsJX =   
          (5:0)==  11z10 ||  //XG3 prefix   
          (6:0)==1111111     //RISC-V (longer instruction)   
        IsWEX = 0   
      RV64GC:   
        Is32 = (1:0)==11   
        IsJX = (6:0)==1111111   
        IsWEX = 0   
      
   For odd-numbered 16-bit words, only the XG1 and RV64GC cases are   
   relevant. Actual logic overlaps the logic for the ISA modes to some extent.   
      
   For IF, it is a case of MUX'ing the bits for the low order part of PC,   
   then feeding them through a "casez()" to arrive at the target length.   
      
   For XG3 and RV64GC, the IsWEX flag would instead be provided external to   
   the length-determination module by the superscalar logic. As noted, this   
   only works for 32-bit aligned instructions (always 0 if misaligned).   
      
   Here, one sub-module checks for register dependencies, and the other   
   checks for which instructions are allowed in which context. These are   
   used to determine a virtual WEX bit.   
      
   Typically, then, the IsJX and IsWEX bits are OR'ed together to get an   
   IsWJX bit, which is what is used during IF.   
      
   Implicitly, this adds another constraint based on PC(3:2) for   
   superscalar operation:   
      00: 1-3 wide bundle   
      01: 1-3 wide bundle   
      10: 1/2 wide bundle   
      11: scalar only   
   Though, this restriction is N/A for jumbo prefixes.   
      
   Superscalar can't infer across cache line boundaries as it doesn't   
   necessarily know what exists in the following cache line. This situation   
   would be less bad with 32B cache lines, but then you would need twice as   
   much logic for the superscalar checks. Similar problem if trying to deal   
   with misaligned instructions.   
      
   Also, trying to deal with RVC here would make it "kinda evil". Currently   
   the register-check logic only dealing with RVI/RVI or XG3/XG3 pairs. Had   
   experimented with RVI/XG3 pairs, but this added a fair bit of additional   
   cost (and wasn't worth it, cheaper to assume "sparse mixing" of the   
   encodings).   
      
   More analysis would be needed to try to formalize the cost curve, but it   
   seems to be fairly steep in any case, so the number of possible paths   
   (between potential pairs of source and destination register ports) needs   
   to be kept as small as possible. Which in this case, was best served by   
   limiting things to fixed-length aligned-only and keeping RVI and XG3   
   instructions separate (in the case of a mixed pair, it always assumes   
   that register aliasing may exist).   
      
      
   Well, for similar reasons, "opcode fusion" as a general solution to ISA   
   level inefficiencies (in the CPU) would have a "stupidly bad" cost curve   
   (would likely make normal superscalar look "almost free" in comparison).   
      
   And, there are "less stupidly bad" possibilities, like trying to "hot   
   patch" the instruction-sequences at load-time.   
      
   Say, for example, we have an Indexed-Load instruction in the CPU, and a   
   "PNOP" (Special NOP designated to have an 0-cycle latency, vs the   
   implied 1-cycle cost for a normal NOP).   
      
   Then, say, program loader hot patches an SLLI+ADD+LD pair into   
   PNOP+PNOP+LD_Ix.   
      
   But, does still mean the CPU needs to have the instruction, and there is   
   still a potential non-zero cost to the PNOPs (unless as a hack they   
   actively behave like they were a WEX'ed NOP; in which case it might be   
   illegal to have more than 2 PNOPs in a row on a 3-wide machine, ...).   
      
   ...   
      
      
   Though, even with everything, superscalar might still be a better   
   general option than my older explicit WEX tagging system (say, by   
   allowing 2 and 3 wide implementations to share the same binaries without   
   a potentially steep performance penalty of needing to fall back to   
   scalar operation in the case of a pipeline width mismatch).   
      
      
   > John Savard   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]