... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
alt.os.development
Operating system development chatter
4,255 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 3,809 of 4,255
BGB to Dan Cross
Re: x86-S
23 May 23 14:40:11
   From: cr88192@gmail.com   
      
   On 5/23/2023 1:26 PM, Dan Cross wrote:   
   > In article , BGB   wrote:   
   >> On 5/23/2023 11:22 AM, Dan Cross wrote:   
   >>> In article , BGB   wrote:   
   >>>> On 5/22/2023 3:10 PM, Dan Cross wrote:   
   >>>> [snip]   
   >>>>> L2PT's like the EPT and NPT are wins here; even in the nested   
   >>>>> VM case, where we have to resort to shadow paging techniques, we   
   >>>>> can handle L2 page faults in the top-level hypervisor.   
   >>>>>   
   >>>>   
   >>>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need   
   >>>> to exist...   
   >>>   
   >>> Yes, at great expense.   
   >>>   
   >>   
   >> Doesn't seem all that expensive.   
   >>   
   >>   
   >> In terms of LUTs, a soft TLB uses far less than a page walker.   
   >   
   > You're thinking in terms of hardware, not software or   
   > performance.   
   >   
      
   Software cost is usually considered "virtually free" in comparison to   
   hardware. So what if the virtual memory subsystem needs a few more kB   
   due to a more complex hardware interface?, ...   
      
   Would care more about performance if my benchmarks showed it "actually   
   mattered" in terms of macro-scale performance.   
      
      
   Spending a few orders of magnitude more clock-cycles on a TLB miss   
   doesn't matter if the TLB miss rate is low enough that it disappears in   
   the noise.   
      
   It is like making a big fuss over the clock-cycle budget of having a   
   1kHz clock-timer IRQ...   
      
   Yes, those clock IRQs eat clock-cycles, but mostly, there isn't too much   
   reason to care.   
      
      
   Well, except if one wants to do a 32kHz clock-timer like on the MSP430,   
   currently this is an area where the MSP430 wins (trying to do a 32kHz   
   timer IRQ basically eats the CPU...).   
      
      
      
   >> And, the TLB doesn't need to have a mechanism to send memory requests   
   >> and handle memory responses, ...   
   >>   
   >> It uses some Block-RAM's for the TLB, but those aren't too expensive.   
   >>   
   >>   
   >> In terms of performance, it is generally around 1.5 kilocycle per TLB   
   >> miss (*1), but as-is these typically happen roughly 50 or 100 times per   
   >> second or so.   
   >>   
   >> On a 50 MHz core, only about 0.2% of the CPU time is going into handling   
   >> TLB misses.   
   >   
   > That's not the issue.   
   >   
   > The hypervisor has to invoke the guest's   
   > TLB miss handler, which will have to fault _again_ once it tries   
   > to write to the TLB to insert an entry; this can lead to several   
   > round-trips, bouncing between the host and guest several times.   
   > With nested VMs, this gets significantly worse.   
   >   
      
   So?...   
      
   If it is maybe only happening 50 to 100 times a second or so, it doesn't   
   matter. Thousands or more per second, it does, but in the  general case   
   it does not, provided the CPU has a reasonable sized TLB.   
      
   If it did start to be an issue (with programs with a fairly large   
   working set), one can make the TLB bigger (and/or go to 64K pages, but   
   this has its own drawbacks).   
      
      
   It maybe matters more if the OS also swaps page tables for multitasking   
   and if each page-table swap involves a TLB flush, but I am not doing it   
   that way (one could use ASIDs; in my case I am just using a huge   
   monolithic virtual address space).   
      
      
   Granted, the use of a monolithic address space does make a serious   
   annoyance for trying to run RISC-V ELF objects on top of this, as GCC   
   apparently doesn't support either PIE or Shared Objects, ...   
      
   At least, my PEL4 binaries were designed to be able to deal with a   
   monolithic virtual address space (and also use in a NO-MMU environment).   
      
   But, in this case, there is also the fallback that I have a 96-bit   
   address-mode extension with a 65C816-like addressing scheme, which can   
   mimic having a number of 48-bit spaces within an otherwise monolithic   
   address space.   
      
   Though, does currently have the limitation of effectively dropping the   
   TLB to 2-way associative when active.   
      
      
   >> [snip]   
   >> One could also have the guest OS use page-tables FWIW.   
   >   
   > How does the hypervisor know the format of the guest's page   
   > tables, in general?   
   >   
      
   They have designated registers and the tree formats are documented as   
   part of the ISA/ABI specs...   
      
      
   One could define it such that if page tables are used, one of the   
   defined formats, and the page is present, the hypervisor could be   
   allowed to translate the page itself and skip the TLB Miss ISR (falling   
   back to the ISR if the page-table is flagged as an unknown format).   
      
   Though, generally, things like ACL Miss ISR's would still need to be   
   forwarded to the guest, but these are much less common (it is generally   
   sufficient to use a 4 or 8 entry cache for ACL checks).   
      
      
   As-is, the defined formats are:   
      xxx: N-level Page-Table   
        3 levels for 48b address and 16K pages.   
        4 levels for 48b address and 4K pages.   
        Bit pattern encodes tree depth and format.   
      013: AVL Tree (Full Address)   
      113: B-Tree (Full Address)   
      213: Hybrid B-Tree (last-level page table)   
      313: Hybrid B-Tree (last 2-levels page table)   
      
      
   The B-Tree cases being mostly intended for 96-bit modes, since:   
      48-bit mode works fine with a conventional page-table;   
      As noted, 8 level page tables suck...   
      
   At present, most of the page-table formats assume 64-bit entries with a   
   48-bit physical address.   
      Low order bits are control flags;   
      Upper 16 bits are typically the ACLID.   
        The ACLID indirectly encoding "who can do what with this page".   
      
   There was an older VUGID system, but this system requires more bits to   
   encode (user/group/other, rwxrwxrwx). So, it has been partially   
   deprecated in favor of using ACL checks for everything.   
      
      
   > 	- Dan C.   
   >   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]