From: cr88192@gmail.com   
      
   On 9/17/2025 4:33 PM, John Levine wrote:   
   > According to BGB :   
   >> Still sometimes it seems like it is only a matter of time until Intel or   
   >> AMD releases a new CPU that just sort of jettisons x86 entirely at the   
   >> hardware level, but then pretends to still be an x86 chip by running   
   >> *everything* in a firmware level emulator via dynamic translation.   
   >   
   > That sounds a whole lot like what Transmeta did 25 years ago:   
   >   
   > https://en.wikipedia.org/wiki/Transmeta_Crusoe   
   >   
   > They failed but perhaps things are different now. Their   
   > native architecture was VLIW which might have been part   
   > of the problem.   
   >   
      
   Might be different now:   
   25 years ago, Moore's law was still going strong, and the general   
   concern was more about maximizing scalar performance rather than energy   
   efficiency or core count (and, in those days, processors were generally   
   single-core).   
      
      
   Now we have a different situation:   
    Moore's law is dying off;   
    Scalar CPU performance has hit a plateau;   
    And, for many uses, performance is "good enough";   
    A lot more software can make use of multi-threading;   
    ...   
      
      
   Likewise, x86 tends to need a lot of the "big CPU" stuff to perform   
   well, whereas something like a RISC style ISA can get better performance   
   on a comparably smaller and cheaper core, and with a somewhat better   
   "performance per watt" metric.   
      
      
   So, one possibility could be, rather than a small number of big/fast   
   cores (either VLIW or OoO), possibly a larger number of smaller cores.   
      
   The cores could maybe be LIW or in-order RISC.   
      
      
      
      
   One possibility could be that virtual processors don't run on a single   
   core, say:   
   The logical cores exist more as VMs each running a virtual x86 processor   
   core;   
   The dynamic translation doesn't JIT translate to a linear program.   
      
   Say:   
   Breaks code into traces;   
   Each trace uses something akin to CSP mixed with Pi-Calculus;   
   Address translation is explicit in the ISA, with specialized ISA level   
   memory-ordering and control-flow primitives.   
      
   For example, there could be special ISA level mechanisms for submitting   
   a job to a local job-queue, and pulling a job from the queue.   
   Memory accesses could use a special "perform a memory access or   
   branch-subroutine" instruction ("MEMorBSR"), where the MEMorBSR   
   operations will try to access memory, either continuing to the next   
   instruction (success) or Branching-to-Subroutine (access failed).   
      
   Where the failure cases could include (but not limited to) TLB miss;   
   access fault; memory ordering fault; ...   
      
   The "memory ordering fault" case could be, when traces are submitted to   
   the queue, if they access memory, they are assigned sequence numbers   
   based on Load and Store operations. When memory is accessed, the memory   
   blocks in the cache could be marked with sequence numbers when read or   
   modified. On access, it could detect if/when memory access have   
   out-of-order sequence numbers, and then fall back to special-case   
   handling to restore the intended order (reverting any "uncommitted"   
   writes, and putting the offending blocks back into the queue to be   
   re-run after the preceding blocks have finished).   
      
   Possibly, the caches wouldn't directly commit stores to memory, but   
   instead could keep track of a group of cache lines as an "in-flight"   
   transaction. In this case, it could be possible for a "logically older"   
   block to see the memory as it was before a more recent transaction, but   
   an out-of-order write could be detected via sequence numbers (if seen,   
   it would mean a "future" block had run but had essentially read stale data).   
      
   Once a block is fully committed (after all preceding blocks are   
   finished) its contents can be written back out to main RAM.   
   Could be held in an area of RAM local to the group of cores running the   
   logical core.   
      
   Possibly, such a core might actually operate in multiple address spaces:   
    Virtual Memory, via the transaction oriented MEMorBSR mechanism;   
    There would likely be an explicit TLB here.   
    So, TLB Miss handling could be essentially a runtime call.   
    Local Memory:   
    Physical Address, small non-externally-visible SRAM;   
    Divided into Core-Local and Group-Shared areas;   
    Physical Memory:   
    External DRAM or similar;   
    Resembles more traditional RAM access (via Load/Store Ops);   
    Could be used for VM tasks and page-table walks.   
      
      
   Would likely require significant hardware level support for things like   
   job-queues and synchronization mechanisms.   
      
   One possibility could be that some devices could exist local to a group   
   of cores, which then have a synchronous "first come, first serve" access   
   pattern (possibly similar to how my existing core design manages MMIO).   
      
   Possibly it could work by passing fixed-size messages over a bus, with   
   each request/response pair to a device being synchronous.   
      
      
   Possibly the JIT could try to infer possible memory aliasing between   
   traces, and enforce sequential ordering if alias is likely. This because   
   performing the operations in the correct order the first time is likely   
   to be cheaper than detecting an ordering violation and rolling back a   
   transaction.   
      
   Whereas proving that traces can't alias is likely to be a much harder   
   problem than inferring a probable absence of aliasing. If no order   
   violations occur during operation, it can be safely assumed that no   
   memory aliasing happened.   
      
   Maintaining transactions would complicate the cache design though, since   
   now there is a problem that the cache line can't be written back or   
   evicted until its write-associated sequence is fully committed.   
      
   Might also need to be separate queue spots for "tasks currently being   
   worked on" vs "to be done after the current jobs are done". Say, for   
   example, if a job needs to be rolled-back and re-run, it would still   
   need to come before jobs that are further in the future relative to itself.   
      
   Unlike memory, register ordering is easier to infer statically, at least   
   in the absence of dynamic branching.   
      
   Might need to enforce ordering in cases where:   
   Dynamic branch occurs and the path can't be followed statically;   
   A following trace would depend on a register modified in a preceding trace;   
   ...   
      
      
      
   As for how viable any of this is, I don't know...   
      
   The VM could be a lot simpler if one assumes a single threaded VM.   
      
      
   Also unclear is if an ISA could be designed in a way to keep overheads   
   low enough (would be a waste if the multi-threaded VM is slower than a   
   single threaded VM would have been). But, this would require a lot of   
   exotic mechanisms, so dunno...   
      
   ...   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|