... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.arch

Apparently more than just beeps & boops

131,241 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 130,576 of 131,241

BGB to Anton Ertl

Re: Variable-length instructions (2/2)

19 Dec 25 15:07:38

   [continued from previous message]   
      
   Or, at least, 96-bit ops are not frequent enough to make superscalar   
   worthwhile (but may still be "moderate frequency" on the grand scale,   
   mostly for Imm64 ops and similar).   
      
      
      
   > OTOH, on AMD64/IA-32 Intel's recent E-Cores do not use an uop cache   
   > either, but instead the most recent instances have 3 decoders each of   
   > which can decode 3 instructions per cycle (i.e., they attempt to   
   > decode at many more positions and then select 3 per cycle out of   
   > those); so apparently even byte-oriented variable-length encoding can   
   > be decoded quickly enough.   
   >   
      
   It is possible with x86 / x86-64, just ugly and kinda expensive.   
      
   Likely:   
      Note lengths for a Mod/RM at each position (1..5);   
      Note lengths for an opcode at each position (0..3);   
      Note lengths for a prefix at each position (0..3).   
      
   Then say:   
      Prefix Length (Lp) at RIP (Potentially 0..5, usually 0/1);   
      Opcode Length (Lo) at RIP+Lp (Usually 1 or 2);   
      Mod/Rm Length (Lrm) at RIP+Lp+Lo (1..6);   
      Add 4 or 8 if Opcode has an Immediate.   
      
   One trick here could be to precompute a lot of this when fetching cache   
   lines, though a full instruction length could not be determined at fetch   
   time if the instruction crosses a cache line unless we have also fetched   
   the next cache line. Full instruction length could be determine in   
   advance (at fetch time) if it always fetches both cache-lines and then   
   determines the lengths for one of them before writing to the cache   
   (possibly if the next line is fetched, it contents are not written to   
   the cache as lengths can't be fully determined yet).   
      
      
   At this stage, I sorta have an idea how one could implement an x86 core,   
   but not particularly inclined to do so.   
      
   Even if one decodes x86 efficiently, there are a few other drawbacks:   
      2 register encodings for everything;   
      excessive numbers of memory accesses (particularly to the stack);   
      ...   
      
   Would still be hard pressed to make the performance good absent ugly   
   tricks like resorting to OoO.   
      
      
   > - anton   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]