Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.arch    |    Apparently more than just beeps & boops    |    131,241 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 130,576 of 131,241    |
|    BGB to Anton Ertl    |
|    Re: Variable-length instructions (2/2)    |
|    19 Dec 25 15:07:38    |
      [continued from previous message]              Or, at least, 96-bit ops are not frequent enough to make superscalar       worthwhile (but may still be "moderate frequency" on the grand scale,       mostly for Imm64 ops and similar).                            > OTOH, on AMD64/IA-32 Intel's recent E-Cores do not use an uop cache       > either, but instead the most recent instances have 3 decoders each of       > which can decode 3 instructions per cycle (i.e., they attempt to       > decode at many more positions and then select 3 per cycle out of       > those); so apparently even byte-oriented variable-length encoding can       > be decoded quickly enough.       >              It is possible with x86 / x86-64, just ugly and kinda expensive.              Likely:        Note lengths for a Mod/RM at each position (1..5);        Note lengths for an opcode at each position (0..3);        Note lengths for a prefix at each position (0..3).              Then say:        Prefix Length (Lp) at RIP (Potentially 0..5, usually 0/1);        Opcode Length (Lo) at RIP+Lp (Usually 1 or 2);        Mod/Rm Length (Lrm) at RIP+Lp+Lo (1..6);        Add 4 or 8 if Opcode has an Immediate.              One trick here could be to precompute a lot of this when fetching cache       lines, though a full instruction length could not be determined at fetch       time if the instruction crosses a cache line unless we have also fetched       the next cache line. Full instruction length could be determine in       advance (at fetch time) if it always fetches both cache-lines and then       determines the lengths for one of them before writing to the cache       (possibly if the next line is fetched, it contents are not written to       the cache as lengths can't be fully determined yet).                     At this stage, I sorta have an idea how one could implement an x86 core,       but not particularly inclined to do so.              Even if one decodes x86 efficiently, there are a few other drawbacks:        2 register encodings for everything;        excessive numbers of memory accesses (particularly to the stack);        ...              Would still be hard pressed to make the performance good absent ugly       tricks like resorting to OoO.                     > - anton              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca