home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.arch      Apparently more than just beeps & boops      131,241 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 130,562 of 131,241   
   BGB to All   
   Re: trap and emulate, Lessons from the A   
   18 Dec 25 14:41:31   
   
   From: cr88192@gmail.com   
      
   On 12/17/2025 6:43 PM, Lawrence D’Oliveiro wrote:   
   > On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:   
   >   
   >> Lawrence =?iso-8859-13?q?D=FFOliveiro?=  posted:   
   >>   
   >>> On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:   
   >>>   
   >>>> Misaligned access is common enough here that, if it were not   
   >>>> supported natively, this would likely tank performance...   
   >>>   
   >>> Still there are/were some architectures that refused to support it.   
   >>   
   >> There are smart fools everywhere.   
   >   
   > No doubt another lesson learned from instruction traces: misaligned   
   > accesses occurred so rarely, it made sense to simplify the hardware by   
   > leaving them out.   
   >   
      
   They occur rarely, or not at all, if avoided.   
      
   They seem to occur (on average) for 1-5% of the loads/stores if the code   
   makes use of them in cases where doing so would be beneficial.   
      
   Well, because the naive/portable approaches can often be unacceptably slow.   
      
   Granted, dealing with misaligned access does add cost and complexity to   
   the L1 D$ (particularly due to the case of dealing with misaligned   
   values crossing a cache-line boundary).   
      
      
   > The same conclusion was drawn about integer multiplication and   
   > division in those early days, wasn’t it.   
      
      
   Some timings from my case:   
      32 bit multiply (3 cycle latency):   
        0.70% of clock cycles;   
      MOD and DIV (36 and 68 cycles):   
        0.36%   
      
   Kinda need multiply, but divide is a bit more optional, and doesn't need   
   to be all that fast.   
      
   However, if DIV and MOD were implemented using traps, they are still   
   common enough to where there would be reason to care (this would still   
   have an obvious performance impact).   
      
   In the absence of hardware DIV/MOD, the better option is mostly to   
   handle it with runtime calls.   
      
      
      
   Otherwise, went and added some special case logic to allow for   
   transparent hot-patching via TLB trickery. Was actually simpler/cheaper   
   than it seemed at first.   
      
   Does have some restrictions though, namely in that it will only work   
   with read-only pages.   
      
   Ended up adding a special case with the Dirty flag:   
      D+NR+NW: X only, no hit for D$   
      D+NX+NW: R only, no hit for I$   
   But:   
      D+NX: Normal Read/Write Memory   
      D (along): Normal R/W/X Memory   
      
   The D / Dirty flag is used for PTE's, but not changed by HW. In effect   
   its main use would be more for write barriers (setting up and handling a   
   trap the first time the page is written to).   
      
      
      
   Also the issue that one can't encode a branch to an arbitrary address in   
   32 bits. If effect, hot-patching in this way would need somewhere to   
   branch-to that could be put within the window of what is reachable.   
      
   For plain RISC-V, there is another issue:   
   There is no way to do longer distance branches that wont stomp a register.   
      
   Jumbo prefixes and XG3 at least allow other options:   
   XG3's branches have a 32MB range, so more likely to be able to reach   
   something as most binaries are not that large (but, N/E in RV64GC mode);   
   Jumbo Prefixes: Can encode a +/- 4GB branch in 64 bits (but then needs   
   to patch at least 2 instructions).   
      
   Another issue being that any such logic needs to be able to operate with   
   zero free registers, so at least in this sense isn't much better off   
   than an interrupt handler. But, main difference being that any hot   
   patching doesn't need to decode an instruction and can be a special   
   sequence representing the instructions that originally generated the   
   trap (rather than a general purpose handler).   
      
   In the relevant ABIs, could assume that memory below SP is always safe   
   to use though (to save/restore any working registers).   
      
      
      
   In premise, one could put the hot-patching area before the loaded   
   binary, but generally this would only be usable (in RV64GC or similar)   
   if ".text" is somewhat less than 1MB.   
      
   Likely would make sense to handle it as, say:   
      SD X1, -8(SP)   
      SD X5, -16(SP)   
      LUI X5, AddrHi   
      JALR X1, DispLo(X5)   
      LD X5, -16(SP)   
      LD X1, -8(SP)   
      JAL  X0, RetAddr   
      
   With another area (somewhere in the low 2GB) handling the actual traces   
   (trying to keep the area just before ".text" mostly limited to   
   trampoline handlers, except for extra short sequences).   
      
   Or, AUIPC if the handler is placed +/- 2GB from this table.   
      
      
   Or, loading a 64-bit address from memory and then possibly running the   
   handler code in XG3 mode (would have access to 128-bit arithmetic and   
   some other things lacking in RV mode).   
      
   Likely would make sense to handle it as, say:   
      SD X1, -8(SP)   
      SD X5, -16(SP)   
      AUIPC X5, AddrHi   
      LD X5, DispLo(X5)  //load address of entry point, PC-rel (*1)   
      JALR X1, Disp2(X5)   
      LD X5, -16(SP)   
      LD X1, -8(SP)   
      JAL  X0, RetAddr   
      
   *1: Also no way in RV64GC to directly include a PC-rel load in a single   
   instruction, so need an AUIPC to do so. In this case, need to jump   
   through a full 64-bit pointer to be able to perform the mode switch   
   (AUIPC+JALR would merely branch within RV64GC mode).   
      
   Though, in some cases could make sense to keep the handlers in RV64G   
   mode, in which case no mode-change is needed.   
      
   ...   
      
      
   The initial setup for these cases would likely be the same as that for   
   ("normal") trap and emulate, just with the option of replacing some   
   instructions with alternative handlers that are not quite as inefficient...   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca