... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
alt.os.development
Operating system development chatter
4,255 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 4,055 of 4,255
BGB to Scott Lurndal
Re: This newsgroup. (1/2)
13 Dec 23 13:28:41
   From: cr88192@gmail.com   
      
   On 12/13/2023 9:01 AM, Scott Lurndal wrote:   
   > BGB  writes:   
   >> On 12/12/2023 5:56 PM, Scott Lurndal wrote:   
   >>> James Harris  writes:   
   >>>> On 23/03/2023 19:49, Dan Cross wrote:   
   >>>>> In article ,   
   >>>>> Scott Lurndal  wrote:   
   >>>>>> cross@spitfire.i.gajendra.net (Dan Cross) writes:   
   >>>>   
   >>>> ...   
   >>>>   
   >>>>>>>     It was never clear to me   
   >>>>>>> how a hypervisor could, in general, know the format of the guest   
   >>>>>>> page tables.  I know the Disco folks had to make some changes to   
   >>>>>>> Irix to get it to work.   
   >>>>>>   
   >>>>>> When I was working on IRIX, I was not fond of either the software   
   >>>>>> managed TLB, coloring or the Kseg stuff; the MIPS project I worked on   
   was called   
   >>>>>> Teak and was a distributed version of Irix (eventually cancelled)   
   >>>>>> for networks of R10k boxes.   
   >>>>>   
   >>>>> I get it from a hardware perspective: fewer transistors with a   
   >>>>> software-managed TLB, but man...so many drawbacks.   
   >>>>   
   >>>> Handling a software-managed TLB may be more work, in a sense, but it   
   >>>> gives an OS developer more control, more feedback, more freedom, and   
   >>>> perhaps better opportunities for performance gains - as long as the TLB   
   >>>> is large enough.   
   >>>>   
   >>>> Having the hardware carry out a walk of page tables (the only option if   
   >>>> the TLB can is updated by hardware) has long seemed to me like a bad   
   >>>> idea, and it doesn't scale very well as addresses get wider.   
   >>>   
   >>> Having worked extensively with both models (SW: MIPS, HW: pretty much   
   >>> every other single mass-produced microprocessor), there is, hands down,   
   >>> no benefit to software table walks.   Zero.  Zilch.  Don't even bother.   
   >>>   
   >>> Hardware translation table walks scale rather well, and in modern   
   >>> incarnations (e.g. ARMv8) are very flexible, supporting multiple   
   >>> fundamental unit of translation sizes (e.g. 4k, 16k, 64k) and   
   >>> higher level "huge pages".   Add in the second level of walks   
   >>> required for hardware VM guest page table walkers[*] and the software   
   >>> walker becomes fragile and slow.      The hardware walkers have   
   >>> things like content addressible memory and intermediate translation   
   >>> walk caches that software cannot do as effectively or efficiently.   
   >>>   
   >>> [*] 22 distinct memory accesses to translate a guest VA using 4k pages for   
   >>> both the guest and nested page tables.   
   >>   
   >>   
   >> I will partly disagree:   
   >> Software TLB makes the hardware cheaper to implement (it raises an TLB   
   >> Miss exception and its job is done);   
   >   
   > No, it really doesn't make hardware cheaper to implement.  Not   
   > since the late 80's, where the MMU chip was optional.   
   >   
      
   Depends probably on scale here...   
      
      
   >> Also one can make the interrupt mechanism cheap (for hardware) as well   
   >> by treating it as a glorified branch-with-link (with the interrupt   
   >> handler needing to deal with everything else itself).   
   >   
   > That doesn't matter.   
   >   
   > Unless you are developing for a consumer grade FGPA, the   
   > hardware walker is far superior in every way.   
   >   
      
   It makes a difference on a lot of the Spartan and Artix chips.   
   Particularly on the lower end, like XC7S25.   
      
   But, a lot of the boards for the XC7S25 don't even bother with things   
   like RAM modules, so one is hard-pressed to go too far outside of   
   microcontroller territory with these as a result.   
      
   Where, these are FPGA's with around 15K LUT.   
      
   IOW: Your RAM+ROM space is measured in kB; so may make sense, say, to go   
   with 48K of ROM and 16K of RAM, using the low 64K of address space as   
   the main memory map. One finding out (after buying it) that the 512K   
   QSPI RAM module only came on the board with an XC7A35T instead.   
      
      
   There are also the cheaper ICE40 FPGA's, but doing that much beyond an   
   8/16 microcontroller on an ICE40 is pushing it (the bigger end of this   
   range tends to be around 5K LUTs). But, say, one can implement something   
   similar to a 6502 or similar on this (and a few people managed to get   
   RV32I cores and similar to fit, though usually cutting some corners).   
      
      
   On the bigger and more expensive FPGA boards, RAM sizes are typically   
   measured in MB.   
      
      
   Though, one of the major FPGAs I was targeting was the comparably   
   massive XC7A100T, with around 63K LUTs.   
      
      
   The comparison between LUTs and transistors is a bit ambiguous, but a   
   "reasonable estimate" seems to be 1 LUT ~= 10 transistors.   
      
      
   So, say, 63K LUT ~= 630K transistors.   
   Though, there are a lot of "cheat" features here, like DSP48's and Block   
   RAM.   
      
      
   Going much bigger than this, Xilinx wants money to use Vivado, and the   
   licenses cost more than the dev-boards... (Say, managed to get a   
   XC7K325T board for around $100 off of AliExpress, but can't use it   
   because this isn't one of the free options in Vivado).   
      
      
   The bare FPGAs would be a fair bit cheaper than the dev-boards, but the   
   BGA packaging (and need for multi-layer PCBs) is a bit more of an ask   
   for hobbyist-level development (granted, KiKad and services like PCBWay   
   and similar exist, but this is still a bit of an ask).   
      
   Some of the smaller FPGA boards are available in form factors that can   
   fit into the larger DIP sockets (and others can be plugged on top of   
   wire-wrap perfboard or similar if needed via a pin-header interface).   
      
      
   >   
   >> Claiming that there is no possible advantage to software TLB is like   
   >> arguing that there is no possible advantage to CPU's which only support   
   >> aligned memory access (and use slow emulation traps to deal with   
   >> unaligned access).   
   >   
   > There is no advantage to software alignment handling.   
   >   
      
   Besides making the L1 cache cheaper, for similar reasons.   
      
   I did spend this cost though, because it can make a big difference for   
   things like "memcpy()" and LZ decompression and similar.   
      
      
   Dealing with misaligned access does nearly double the size of the L1   
   cache though, for a direct-mapped design:   
   One may need to access two cache lines at a time, and check two   
   different addresses for hit/miss, rather than a single cache line and a   
   single address check.   
      
      
   But, on the positive side, it can make LZ decompression roughly 8x   
   faster on a 64-bit CPU (because, loading/storing values 1 byte at a   
   time, would be slow).   
      
   Though, this situation would have been even worse on a word-oriented   
   design (such as the DEC Alpha).   
      
      
   >>   
   >> If there were no advantage, people wouldn't do it that way...   
   >   
   > No new design in the last two decades has done that.   
   >   
      
   SiFive is selling chips that take around 500 cycles whenever one makes a   
   misaligned access...   
      
      
   Meanwhile, what one can pull off an an Artix-7 or similar is mostly   
   limited to what would have been "state of the art" 30+ years ago.   
      
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]