From: chris.m.thomasson.1@gmail.com   
      
   On 12/28/2025 2:17 PM, Chris M. Thomasson wrote:   
   > On 12/28/2025 2:04 PM, MitchAlsup wrote:   
   >>   
   >> "Chris M. Thomasson" posted:   
   >>   
   >>> On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:   
   >>>> On 12/21/2025 1:21 PM, MitchAlsup wrote:   
   >>>>>   
   >>>>> "Chris M. Thomasson" posted:   
   >>>>>   
   >>>>>> On 12/21/2025 10:12 AM, MitchAlsup wrote:   
   >>>>>>>   
   >>>>>>> John Savard posted:   
   >>>>>>>   
   >>>>>>>> On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:   
   >>>>>>>>   
   >>>>>>>>> For argument setup (calling side) one needs MOV   
   >>>>>>>>> {R1..R5},{Rm,Rn,Rj,Rk,Rl}   
   >>>>>>>>> For returning values (calling side)Â Â needs MOV {Rm,Rn,Rj},   
   >>>>>>>>> {R1..R3}   
   >>>>>>>>>   
   >>>>>>>>> For loop iterations                  needs MOV   
   {Rm,Rn,Rj},   
   >>>>>>>>> {Ra,Rb,Rc}   
   >>>>>>>>>   
   >>>>>>>>> I just can't see how to make these run reasonably fast within the   
   >>>>>>>>> constraints of the GBOoO Data Path.   
   >>>>>>>>   
   >>>>>>>> Since you actually worked at AMD, presumably you know why I'm   
   >>>>>>>> mistaken   
   >>>>>>>> here...   
   >>>>>>>>   
   >>>>>>>> when I read this, I thought that there was a standard technique for   
   >>>>>>>> doing   
   >>>>>>>> stuff like that in a GBOoO machine.   
   >>>>>>>   
   >>>>>>> There is::: it is called "load 'em up, pass 'em through". That is no   
   >>>>>>> different than any other calculation, except that no mangling of the   
   >>>>>>> bits is going on.   
   >>>>>>>   
   >>>>>>>>                               
   Â Â Â Â Â Â Â Â Â Just break down all the fancy   
   >>>>>>>> instructions into RISC-style pseudo-ops. But apparently, since you   
   >>>>>>>> would   
   >>>>>>>> know all about that, there must be a reason why it doesn't apply in   
   >>>>>>>> these   
   >>>>>>>> cases.   
   >>>>>>>   
   >>>>>>> x86 has short/small MOV instructions, Not so with RISCs.   
   >>>>>>   
   >>>>>> Does your EMS use a so called LOCK MOV? For some damn reason I   
   >>>>>> remember   
   >>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..   
   >>>>>   
   >>>>> The 2-operand+displacement LD/STs have a lock bit in the instruction--   
   >>>>> that   
   >>>>> is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.   
   >>>>>   
   >>>>> Oh, and its ESM not EMS. Exotic Synchronization Method.   
   >>>>>   
   >>>>> In order to get ATOMIC-ADD-to-Memory; I will need an Instruction-   
   >>>>> Modifier   
   >>>>> {A.K.A. a prefix}.   
   >>>>   
   >>>> Thanks for the clarification.   
   >>>   
   >>> On x86/x64 LOCK XADD is a loopless wait free operation.   
   >>>   
   >>> I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless   
   >>> impl. If we on another system and that LOCK XADD is some sort of LL/SC   
   >>> "style" loop, well, that causes damage to my loopless claim... ;^o   
   >>>   
   >>> So, can your system get wait free semantics for RMW atomics?   
   >>   
   >> A::   
   >>   
   >>      ATOMIC-to-Memory-size [address]   
   >> Â Â Â Â Â ADDÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Rd,--,#1   
   >>   
   >> Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is   
   >> performed and line updated. Otherwise, the Add-to-memory #1 is shipped   
   >> out over the memory hierarchy. When the operation runs into a cache   
   >> containing [address] in the writeable-state the add is performed and   
   >> the previous value returned. If [address] is not writeable the cache   
   >> line in invalidated and the search continues outward. {This protocol   
   >> depends on writeable implying {exclusive or modified} which is typical.}   
   >>   
   >> When [address] reached Memory-Controller it is scheduled in arrival   
   >> order, other caches system wide will receive CI, and modified lines   
   >> will be pushed back to DRAM-Controller. When CI is "performed" MC/   
   >> DRC will perform add #1 to [address] and previous value is returned   
   >> as its result.   
   >>   
   >> {{That is the ADD is performed where the data is found in the   
   >> memory hierarchy, and the previous value is returned as result;   
   >> with all cache-effects and coherence considered.}}   
   >>   
   >> A HW guy would not call this wait free--since the CPU is waiting   
   >> until all the nuances get sorted out, but SW will consider this   
   >> wait free since SW does not see the waiting time unless it uses   
   >> a high precision timer to measure delay.   
   >   
   > Good point. Humm. Well, I just don't want to see the disassembly of   
   > atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)   
      
   Fwiw, I noticed that a certain compiler was implementing LOCK XADD with   
   a LOCK CMPXCHG loop and got a little pissed. Had to tell them about it:   
      
   read all when you get some free time to burn:   
      
   https://forum.pellesc.de/index.php?topic=7167.msg27217#msg27217   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|