From: cr88192@gmail.com   
      
   On 12/29/2025 1:55 PM, MitchAlsup wrote:   
   >   
   > BGB posted:   
   >   
   >> On 12/28/2025 5:53 PM, MitchAlsup wrote:   
   >>>   
   >>> "Chris M. Thomasson" posted:   
   >>>   
   >>>> On 12/28/2025 2:04 PM, MitchAlsup wrote:   
   >>>>>   
   >>>>> "Chris M. Thomasson" posted:   
   >>>>>   
   >>>>>> On 12/22/2025 1:49 PM, Chris M. Thomasson wrote:   
   >>>>>>> On 12/21/2025 1:21 PM, MitchAlsup wrote:   
   >>>>>>>>   
   >>>>>>>> "Chris M. Thomasson" posted:   
   >>>>>>>>   
   >>>>>>>>> On 12/21/2025 10:12 AM, MitchAlsup wrote:   
   >>>>>>>>>>   
   >>>>>>>>>> John Savard posted:   
   >>>>>>>>>>   
   >>>>>>>>>>> On Sat, 20 Dec 2025 20:15:51 +0000, MitchAlsup wrote:   
   >>>>>>>>>>>   
   >>>>>>>>>>>> For argument setup (calling side) one needs MOV   
   >>>>>>>>>>>> {R1..R5},{Rm,Rn,Rj,Rk,Rl}   
   >>>>>>>>>>>> For returning values (calling side)Â Â needs MOV {   
   m,Rn,Rj},{R1..R3}   
   >>>>>>>>>>>>   
   >>>>>>>>>>>> For loop iterations                  needs MOV   
   {Rm,Rn,Rj},{Ra,Rb,Rc}   
   >>>>>>>>>>>>   
   >>>>>>>>>>>> I just can't see how to make these run reasonably fast within the   
   >>>>>>>>>>>> constraints of the GBOoO Data Path.   
   >>>>>>>>>>>   
   >>>>>>>>>>> Since you actually worked at AMD, presumably you know why I'm   
   mistaken   
   >>>>>>>>>>> here...   
   >>>>>>>>>>>   
   >>>>>>>>>>> when I read this, I thought that there was a standard technique for   
   >>>>>>>>>>> doing   
   >>>>>>>>>>> stuff like that in a GBOoO machine.   
   >>>>>>>>>>   
   >>>>>>>>>> There is::: it is called "load 'em up, pass 'em through". That is no   
   >>>>>>>>>> different than any other calculation, except that no mangling of the   
   >>>>>>>>>> bits is going on.   
   >>>>>>>>>>   
   >>>>>>>>>>>                            
   Â Â Â Â Â Â Â Â Â Â Â Just break down all the fancy   
   >>>>>>>>>>> instructions into RISC-style pseudo-ops. But apparently, since you   
   >>>>>>>>>>> would   
   >>>>>>>>>>> know all about that, there must be a reason why it doesn't apply in   
   >>>>>>>>>>> these   
   >>>>>>>>>>> cases.   
   >>>>>>>>>>   
   >>>>>>>>>> x86 has short/small MOV instructions, Not so with RISCs.   
   >>>>>>>>>   
   >>>>>>>>> Does your EMS use a so called LOCK MOV? For some damn reason I   
   remember   
   >>>>>>>>> something like that. The LOCK "prefix" for say XADD, CMPXCHG8B, ect..   
   >>>>>>>>   
   >>>>>>>> The 2-operand+displacement LD/STs have a lock bit in the instruction--   
   >>>>>>>> that   
   >>>>>>>> is it is not a Prefix. MOV in My 66000 is reg-reg or reg-constant.   
   >>>>>>>>   
   >>>>>>>> Oh, and its ESM not EMS. Exotic Synchronization Method.   
   >>>>>>>>   
   >>>>>>>> In order to get ATOMIC-ADD-to-Memory; I will need an In   
   truction-Modifier   
   >>>>>>>> {A.K.A. a prefix}.   
   >>>>>>>   
   >>>>>>> Thanks for the clarification.   
   >>>>>>   
   >>>>>> On x86/x64 LOCK XADD is a loopless wait free operation.   
   >>>>>>   
   >>>>>> I need to clarify. Okay, on the x86 a LOCK XADD will make for a loopless   
   >>>>>> impl. If we on another system and that LOCK XADD is some sort of LL/SC   
   >>>>>> "style" loop, well, that causes damage to my loopless claim... ;^o   
   >>>>>>   
   >>>>>> So, can your system get wait free semantics for RMW atomics?   
   >>>>>   
   >>>>> A::   
   >>>>>   
   >>>>> ATOMIC-to-Memory-size [address]   
   >>>>> ADD Rd,--,#1   
   >>>>>   
   >>>>> Will attempt a ATOMIC add to L1 cache. If line is writeable, ADD is   
   >>>>> performed and line updated. Otherwise, the Add-to-memory #1 is shipped   
   >>>>> out over the memory hierarchy. When the operation runs into a cache   
   >>>>> containing [address] in the writeable-state the add is performed and   
   >>>>> the previous value returned. If [address] is not writeable the cache   
   >>>>> line in invalidated and the search continues outward. {This protocol   
   >>>>> depends on writeable implying {exclusive or modified} which is typical.}   
   >>>>>   
   >>>>> When [address] reached Memory-Controller it is scheduled in arrival   
   >>>>> order, other caches system wide will receive CI, and modified lines   
   >>>>> will be pushed back to DRAM-Controller. When CI is "performed" MC/   
   >>>>> DRC will perform add #1 to [address] and previous value is returned   
   >>>>> as its result.   
   >>>>>   
   >>>>> {{That is the ADD is performed where the data is found in the   
   >>>>> memory hierarchy, and the previous value is returned as result;   
   >>>>> with all cache-effects and coherence considered.}}   
   >>>>>   
   >>>>> A HW guy would not call this wait free--since the CPU is waiting   
   >>>>> until all the nuances get sorted out, but SW will consider this   
   >>>>> wait free since SW does not see the waiting time unless it uses   
   >>>>> a high precision timer to measure delay.   
   >>>>   
   >>>> Good point. Humm. Well, I just don't want to see the disassembly of   
   >>>> atomic fetch-and-add (aka LOCK XADD) go into a LL/SC loop. ;^)   
   >>>   
   >>> If you do it LL/SC-style you HAVE to bring data to "this" particular   
   >>> CPU, and that (all by itself) causes n^2 to n^3 "buss" traffic under   
   >>> contention. So you DON"T DO IT LIKE THAT.   
   >>>   
   >>> Atomic-to-Memory HAS to be done outside of THIS-CPU or it is not   
   >>> Atomic-to-Memory. {{Thus it deserves its own instruction or prefix}}   
   >>   
   >> IMHO:   
   >> No-Cache + CAS is probably a better bet than LL/SC;   
   >> LL/sC: Depends on the existence of explicit memory-coherency features.   
   >> No-Cache + CAS: Can be made to work independent of the underlying memory   
   >> model.   
   >>   
   >> Granted, No-Cache is its own feature:   
   >> Need some way to indicate to the L1 cache that special handling is   
   >> needed for this memory access and cache line (that it should not use a   
   >> previously cached value and should be flushed immediately once the   
   >> operation completes).   
   >>   
   >>   
   >> But, No-Cache behavior is much easier to fake on a TSO capable memory   
   >> subsystem, than it is to accurately fake LL/SC on top of weak-model   
   >> write-back caches.   
   >   
   > My 66000 does not have a TSO memory system, but when one of these   
   > things shows up, it goes sequential consistency, and when it is done   
   > it flips back to causal consistency.   
   >   
   > TSO is cycle-wasteful.   
      
   But, yeah, was not arguing for using TSO here, rather noting that if one   
   has it, then No-Cache can be ignored for CAS.   
      
      
   But, then again, weak model is cheaper to implement and generally   
   faster, although explicit synchronization is annoying and such a model   
   is incompatible with "lock free data structures" (which tend to   
   implicitly assume that memory accesses occur in the same order as   
   written and that any memory stores are immediately visible across threads).   
      
      
   But, then again, one is left with one of several options:   
   Ask that people use a mutex whenever accessing any resource that may be   
   modified between threads and where such modifications are functionally   
   important;   
   Or, alternatively, use a message passing scheme, where message passing   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|