From: cr88192@gmail.com   
      
   On 5/23/2023 11:22 AM, Dan Cross wrote:   
   > In article , BGB wrote:   
   >> On 5/22/2023 3:10 PM, Dan Cross wrote:   
   >> [snip]   
   >>> L2PT's like the EPT and NPT are wins here; even in the nested   
   >>> VM case, where we have to resort to shadow paging techniques, we   
   >>> can handle L2 page faults in the top-level hypervisor.   
   >>>   
   >>   
   >> But, if one uses SW TLB, then NPT (as a concept) has no reason to need   
   >> to exist...   
   >   
   > Yes, at great expense.   
   >   
      
   Doesn't seem all that expensive.   
      
      
   In terms of LUTs, a soft TLB uses far less than a page walker.   
      
   And, the TLB doesn't need to have a mechanism to send memory requests   
   and handle memory responses, ...   
      
   It uses some Block-RAM's for the TLB, but those aren't too expensive.   
      
      
   In terms of performance, it is generally around 1.5 kilocycle per TLB   
   miss (*1), but as-is these typically happen roughly 50 or 100 times per   
   second or so.   
      
   On a 50 MHz core, only about 0.2% of the CPU time is going into handling   
   TLB misses.   
      
      
   Note that a page-fault (saving a memory page to an SD card and loading a   
   different page) is around 1 megacycle.   
      
      
   *1: Much of this goes into the cost of saving and restoring all the   
   GPRs, where my ISA has 64x 64-bit GPRs. The per-interrupt cost could be   
   reduced significantly via register banking, but then one pays a lot more   
   for registers which are only ever used during interrupt handling.   
      
      
   >>> There's a reason soft-TLBs have basically disappeared. :-)   
   >>   
   >> Probably depends some on how the software-managed TLB is implemented.   
   >   
   > Not really; the design issues and the impact are both   
   > well-known. Think through how a nested guest (note, not a   
   > nested page table, but a recursive instance of a hypervisor)   
   > would be handled.   
   >   
      
   The emulators for my ISA use SW TLB, and I don't imagine a hypervisor   
   would be that much different, except that they would likely use TLB ->   
   TLB remapping, rather than abstracting the whole memory subsystem.   
      
   One could also have the guest OS use page-tables FWIW.   
      
      
   I had originally intended to use firmware managed TLB with the OS using   
   page-tables, but this switched to plain software TLB mostly because I   
   ran out of space in the 32K Boot ROM (mostly due to things like   
   boot-time CPU sanity testing, *).   
      
   *: Idea being that during boot, the CPU tests many of the core ISA   
   features to verify they are working as intended (say, to detect things   
   like if a change to the Verilog broke the ALU or similar, ...).   
      
      
   Besides the sanity testing, the Boot ROM also contains a FAT filesystem   
   interface and PE/COFF / PEL4 loader (well, and also technically an ELF   
   loaded, but I am mostly using PEL4).   
      
      
   Where PEL4 is:   
    PC/COFF but without the MZ stub;   
    Compresses most of the image using LZ4.   
    Decompressing LZ4 being faster than reading in more data.   
      
   The LZ4 compression seems to work well with binary code vs my own RP2   
   compression (which works better for general data, but not as well for   
   machine-code). Both formats being byte-oriented LZ variants (but they   
   differ in terms of how LZ matches are encoded and similar).   
      
   Have observed that LZ4 decompression tends to be slightly faster on   
   conventional machines (like x86-64), but on my ISA, RP2 is a little faster.   
      
   Note that Deflate can give slightly better compression, but is around an   
   order of magnitude slower.   
      
      
   Generally, in PEL4, the file headers are left in an uncompressed state,   
   but all of the section data and similar is LZ compressed.   
   Where, header magic:   
    PE\0\0: Uncompressed   
    PEL0: Also uncompressed (similar to PE\0\0)   
    PEL3: RP2 Compression (Not generally used)   
    PEL4: LZ4 Compression   
    PEL6: LZ4LLB (Modified LZ4, Length-Limited Encoding)   
      
   If the header is 'MZ', it checks for an offset to the start of the PE   
   header, but then assumes normal (uncompressed) PE/COFF.   
      
      
   Also PEL4 uses a different checksum algorithm from normal PE/COFF, as   
   the original checksum algorithm sucked and could not detect some of the   
   main types of corruption that result from LZ screw-ups.   
      
   The "linear sum with carry-folding" was instead replaced with a "linear   
   sum and sum-of-linear-sums with carry-folding XORed together". It is   
   significantly faster than something like Adler32 (or CRC32), while still   
   providing many of the same benefits (namely, better error detection than   
   the original checksums).   
      
   Checksum is verified after the whole image is loaded/decompressed into RAM.   
      
      
   For my ABI, the "Global Pointer" entry in the Data directory was   
   repurposed into handling a floating "data section" which may be loaded   
   at a different address from ".text" and friends (so multiple program or   
   DLL instances can share the same copy of the ".text" and similar), with   
   the base-relocation table being internally split in this area (there is   
   a GBR register which points to the start of ".data", which in turn   
   points to a table which can be used for the program or DLLs to reload   
   their own corresponding data section into GBR; for "simple case" images,   
   this is simply a self-pointer).   
      
   Some sections, like the resource section, were effectively replaced (the   
   resource section now uses a format resembling the "Quake WAD2" format,   
   just with a different header and the offsets in terms of RVA's). Things   
   like "resource lumps" could then be identified with a 16-chacracter name   
   (typically uncompressed, apart from any compression due to the PEL4   
   compression, with bitmap images typically stored in the DIB/BMP format,   
   audio using RIFF/WAVE, ...).   
      
      
   Otherwise, the format is mostly similar to normal PE/COFF.   
      
      
   >> In my case, TLB miss triggers an interrupt, and there is an "LDTLB"   
   >> instruction which basically means "Take the TLBE from these two   
   >> registers and shove it into the TLB at the appropriate place".   
   >   
   > That's pretty much the way they all work, yes.   
   >   
      
   I think there were some that exposed the TLB as MMIO or similar, and the   
   ISR handler would then be expected to write the new TLBE into a MMIO array.   
      
   The SH-4 ISA also had something like this (in addition to the LDTLB   
   instruction), but I didn't keep this feature, and from what I could   
   tell, the existing OS's (such as the Linux kernel) didn't appear to use   
   it...   
      
   They also used a fully-associative TLB, which is absurdly expensive, so   
   I dropped to a 4-way set-associative TLB (while also making the TLB a   
   bit larger).   
      
      
   They had used a 64-entry fully-associative array, I ended up switching   
   to 256x 4-way, which is a total of around 1024 TLBEs.   
      
   So, in this case, the main TLB ends up as roughly half the size of an L1   
   cache (in terms of Block RAM), but uses less LUTs than an L1 cache.   
      
      
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|