... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,668 of 131,241
Chris M. Thomasson to BGB
Re: Variable-length instructions (1/2)
29 Dec 25 16:44:35
   From: chris.m.thomasson.1@gmail.com   
      
   On 12/28/2025 4:41 PM, BGB wrote:   
   > On 12/28/2025 5:53 PM, MitchAlsup wrote:   
   >>   
   >> "Chris M. Thomasson"  posted:   
   >>   
   >>> On 12/28/2025 2:04 PM, MitchAlsup wrote:   
   >>>>   
   >>>> "Chris M. Thomasson"  posted:   
   >>>>   
   >>>>> On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:   
   >>>>>> On 12/21/2025 1:21 PM, MitchAlsup wrote:   
   >>>>>>>   
   >>>>>>> "Chris M. Thomasson"  posted:   
   >>>>>>>   
   >>>>>>>> On 12/21/2025 10:12 AM, MitchAlsup wrote:   
   >>>>>>>>>   
   >>>>>>>>> John Savard  posted:   
   >>>>>>>>>   
   >>>>>>>>>> On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:   
   >>>>>>>>>>   
   >>>>>>>>>>> For argument setup (calling side) one needs MOV   
   >>>>>>>>>>> {R1..R5},{Rm,Rn,Rj,Rk,Rl}   
   >>>>>>>>>>> For returning values (calling side)Â Â  needs MOV {Rm,Rn,Rj},   
   >>>>>>>>>>> {R1..R3}   
   >>>>>>>>>>>   
   >>>>>>>>>>> For loop iterationsÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  needs MOV   
   {Rm,Rn,Rj},   
   >>>>>>>>>>> {Ra,Rb,Rc}   
   >>>>>>>>>>>   
   >>>>>>>>>>> I just can't see how to make these run reasonably fast within   
   >>>>>>>>>>> the   
   >>>>>>>>>>> constraints of the GBOoO Data Path.   
   >>>>>>>>>>   
   >>>>>>>>>> Since you actually worked at AMD, presumably you know why I'm   
   >>>>>>>>>> mistaken   
   >>>>>>>>>> here...   
   >>>>>>>>>>   
   >>>>>>>>>> when I read this, I thought that there was a standard   
   >>>>>>>>>> technique for   
   >>>>>>>>>> doing   
   >>>>>>>>>> stuff like that in a GBOoO machine.   
   >>>>>>>>>   
   >>>>>>>>> There is::: it is called "load 'em up, pass 'em through". That   
   >>>>>>>>> is no   
   >>>>>>>>> different than any other calculation, except that no mangling   
   >>>>>>>>> of the   
   >>>>>>>>> bits is going on.   
   >>>>>>>>>   
   >>>>>>>>>> Â  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â    
    Â Â Â Â Â Â Â Â Â Â  Just break down all the   
   >>>>>>>>>> fancy   
   >>>>>>>>>> instructions into RISC-style pseudo-ops. But apparently, since   
   >>>>>>>>>> you   
   >>>>>>>>>> would   
   >>>>>>>>>> know all about that, there must be a reason why it doesn't   
   >>>>>>>>>> apply in   
   >>>>>>>>>> these   
   >>>>>>>>>> cases.   
   >>>>>>>>>   
   >>>>>>>>> x86 has short/small MOV instructions, Not so with RISCs.   
   >>>>>>>>   
   >>>>>>>> Does your EMS use a so called LOCK MOV? For some damn reason I   
   >>>>>>>> remember   
   >>>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B,   
   >>>>>>>> ect..   
   >>>>>>>   
   >>>>>>> The 2-operand+displacement LD/STs have a lock bit in the   
   >>>>>>> instruction--   
   >>>>>>> that   
   >>>>>>> is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.   
   >>>>>>>   
   >>>>>>> Oh, and its ESM not EMS. Exotic Synchronization Method.   
   >>>>>>>   
   >>>>>>> In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-   
   >>>>>>> Modifier   
   >>>>>>> {A.K.A. a prefix}.   
   >>>>>>   
   >>>>>> Thanks for the clarification.   
   >>>>>   
   >>>>> On x86/x64 LOCK XADD is a loopless wait free operation.   
   >>>>>   
   >>>>> I need to clarify. Okay, on the x86 a LOCK XADD will make for a   
   >>>>> loopless   
   >>>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC   
   >>>>> "style" loop, well, that causes damage to my loopless claim... ;^o   
   >>>>>   
   >>>>> So, can your system get wait free semantics for RMW atomics?   
   >>>>   
   >>>> A::   
   >>>>   
   >>>> Â Â Â Â Â Â  ATOMIC-to-Memory-sizeÂ  [address]   
   >>>> Â Â Â Â Â Â  ADDÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Rd,--,#1   
   >>>>   
   >>>> Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is   
   >>>> performed and line updated. Otherwise, the Add-to-memory #1 is shipped   
   >>>> out over the memory hierarchy. When the operation runs into a cache   
   >>>> containing [address] in the writeable-state the add is performed and   
   >>>> the previous value returned. If [address] is not writeable the cache   
   >>>> line in invalidated and the search continues outward. {This protocol   
   >>>> depends on writeable implying {exclusive or modified} which is   
   >>>> typical.}   
   >>>>   
   >>>> When [address] reached Memory-Controller it is scheduled in arrival   
   >>>> order, other caches system wide will receive CI, and modified lines   
   >>>> will be pushed back to DRAM-Controller. When CI is "performed" MC/   
   >>>> DRC will perform add #1 to [address] and previous value is returned   
   >>>> as its result.   
   >>>>   
   >>>> {{That is the ADD is performed where the data is found in the   
   >>>> memory hierarchy, and the previous value is returned as result;   
   >>>> with all cache-effects and coherence considered.}}   
   >>>>   
   >>>> A HW guy would not call this wait free--since the CPU is waiting   
   >>>> until all the nuances get sorted out, but SW will consider this   
   >>>> wait free since SW does not see the waiting time unless it uses   
   >>>> a high precision timer to measure delay.   
   >>>   
   >>> Good point. Humm. Well, I just don't want to see the disassembly of   
   >>> atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)   
   >>   
   >> If you do it LL/SC-style you HAVE to bring data to "this" particular   
   >> CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under   
   >> contention. So you DON"T DO IT LIKE THAT.   
   >>   
   >> Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not   
   >> Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}   
   >   
   > IMHO:   
   > No-Cache + CAS is probably a better bet than LL/SC;   
      
   Fwiw, there is a "weak" CAS in C++ std. I think its there to handle when   
   a LL/SC can spuriously fail, aka it can fail even though it should have   
   succeeded... A strong CAS means if it fails the value observed is   
   different than the comparand. This is how a LOCK   
   CMPXCHG/CMPXCH8b/CMPXCHG16b acts via x86/x64.   
      
   No, having just CAS is not ideal... Akin to what the PellesC guys did to   
   implement a LOCK XADD using a LOCK CMPXCHG loop! I noticed it and   
   brought it up:   
      
   https://forum.pellesc.de/index.php?topic=7167.msg27217#msg27217   
      
   CAS always implies a loop, unless it CANNOT fail spuriously. In that   
   case, aka LOCK CMPXCHG, it can be used in a state machine. We know a   
   failure means what it means. Not, oh shit it failed but we don't exactly   
   know why... ala LL/SC...   
      
      
      
   > LL/sC: Depends on the existence of explicit memory-coherency features.   
   > No-Cache + CAS: Can be made to work independent of the underlying memory   
   > model.   
   >   
   > Granted, No-Cache is its own feature:   
   > Need some way to indicate to the L1 cache that special handling is   
   > needed for this memory access and cache line (that it should not use a   
   > previously cached value and should be flushed immediately once the   
   > operation completes).   
   >   
   >   
   > But, No-Cache behavior is much easier to fake on a TSO capable memory   
   > subsystem, than it is to accurately fake LL/SC on top of weak-model   
   > write-back caches.   
   >   
   > If the memory system implements TSO or similar, then one can simply   
   > ignore the No-Cache behavior and achieve the same effect.   
   >   
   > ...   
   >   
   >   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]