... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.arch
Apparently more than just beeps & boops
131,241 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 130,464 of 131,241
David Brown to MitchAlsup
Re: Memory ordering (Re: Multi-precision
08 Dec 25 10:07:25
   From: david.brown@hesbynett.no   
      
   On 06/12/2025 18:44, MitchAlsup wrote:   
   >   
   > David Brown  posted:   
   >   
   >> On 05/12/2025 21:54, MitchAlsup wrote:   
   >>>   
   >>> David Brown  posted:   
   >>>   
   >>>> On 05/12/2025 18:57, MitchAlsup wrote:   
   >>>>>   
   >>>>> anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:   
   >>>>>   
   >>>>>> David Brown  writes:   
   >>>>>>> "volatile" /does/ provide guarantees - it just doesn't provide enough   
   >>>>>>> guarantees for multi-threaded coding on multi-core systems.  Basically,   
   >>>>>>> it only works at the C abstract machine level - it does nothing that   
   >>>>>>> affects the hardware.  So volatile writes are ordered at the C level,   
   >>>>>>> but that says nothing about how they might progress through storage   
   >>>>>>> queues, caches, inter-processor communication buses, or whatever.   
   >>>>>>   
   >>>>>> You describe in many words and not really to the point what can be   
   >>>>>> explained concisely as: "volatile says nothing about memory ordering   
   >>>>>> on hardware with weaker memory ordering than sequential consistency".   
   >>>>>> If hardware guaranteed sequential consistency, volatile would provide   
   >>>>>> guarantees that are as good on multi-core machines as on single-core   
   >>>>>> machines.   
   >>>>>>   
   >>>>>> However, for concurrent manipulations of data structures, one wants   
   >>>>>> atomic operations beyond load and store (even on single-core systems),   
   >>>>>   
   >>>>> Such as ????   
   >>>>   
   >>>> Atomic increment, compare-and-swap, locks, loads and stores of sizes   
   >>>> bigger than the maximum load/store size of the processor.   
   >>>   
   >>> My 66000 ISA can::   
   >>>   
   >>> LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction.   
   >>> MM      can MOV   up to 8192 bytes as a single ATOMIC instruction.   
   >>>   
   >>   
   >> The functions below rely on more than that - to make the work, as far as   
   >> I can see, you need the first "esmLOCKload" to lock the bus and also   
   >> lock the core from any kind of interrupt or other pre-emption, lasting   
   >> until the esmLOCKstore instruction.  Or am I missing something here?   
   >   
   > In the above, I was stating that the maximum width of LD/ST can be a lot   
   > bigger than the size of a single register, not that the above instructions   
   > make writing ATOMIC events easier.   
   >   
      
   That's what I assumed.   
      
   Certainly there are situations where it can be helpful to have longer   
   atomic reads and writes.  I am not so sure about allowing 8 KB atomic   
   accesses, especially in a system with multiple cores - that sounds like   
   letting user programs DoS everything else on the system.   
      
   > These is no bus!   
      
   I think there's a typo or some missing words there?   
      
   >   
   > The esmLOCKload causes the  address to be 'monitored'   
   > for interference, and to announce participation in the ATOMIC event.   
   >   
   > The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,   
   > AND sets up a default control point (This instruction itself) so that   
   > if interference is detected at esmLOCKstore control is transferred to   
   > that control point.   
   >   
   > So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set   
   > for free.   
      
   If I understand you correctly here, you basically have a "load-reserve /   
   store-conditional" sequence as commonly found in RISC architectures, but   
   you have the associated loop built into the hardware?  I can see that   
   potentially improving efficiency, but I also find it very difficult to   
   read or write C code that has hidden loops.  And I worry about how it   
   would all work if another thread on the same core or a different core   
   was running similar code in the middle of these sequences.  It also   
   reduces the flexibility - in some use-cases, you want to have software   
   limits on the number of attempts of a lr/sc loop to detect serious   
   synchronisation problems.   
      
   >   
   > There is a branch-on-interference instruction that   
   > a) does what it says,   
   > b) sets up an alternate atomic control point.   
   >   
   >> It is not easy to have atomic or lock mechanisms on multi-core systems   
   >> that are convenient to use, efficient even in the worst cases, and don't   
   >> require additional hardware.   
   >   
   > I am using the "Miss Buffer" as the point of monitoring for interference.   
   > a) it already has to monitor "other hits" from outside accesses to deal   
   >     with the coherence mechanism.   
   > b) that esm additions to Miss Buffer are on the order of 2%   
   >   
   > c) there are other means to strengthen guarantees of forward progress.   
   >>   
   >>   
   >>> Compare Double, Swap Double::   
   >>>   
   >>> BOOLEAN DCAS( type oldp, type_t oldq,   
   >>>                 type *p,   type_t *q,   
   >>>                 type newp, type newq )   
   >>> {   
   >>>        type t = esmLOCKload( *p );   
   >>>        type r = esmLOCKload( *q );   
   >>>        if( t == oldp && r == oldq )   
   >>>        {   
   >>>                           *p = newp;   
   >>>             esmLOCKstore( *q,  newq );   
   >>>             return TRUE;   
   >>>        }   
   >>>        return FALSE;   
   >>> }   
   >>>   
   >>> Move Element from one place to another:   
   >>>   
   >>> BOOLEAN MoveElement( Element *fr, Element *to )   
   >>> {   
   >>>        Element *fn = esmLOCKload( fr->next );   
   >>>        Element *fp = esmLOCKload( fr->prev );   
   >>>        Element *tn = esmLOCKload( to->next );   
   >>>        esmLOCKprefetch( fn );   
   >>>        esmLOCKprefetch( fp );   
   >>>        esmLOCKprefetch( tn );   
   >>>        if( !esmINTERFERENCE() )   
   >>>        {   
   >>>                      fp->next = fn;   
   >>>                      fn->prev = fp;   
   >>>                      to->next = fr;   
   >>>                      tn->prev = fr;   
   >>>                      fr->prev = to;   
   >>>        esmLOCKstore( fr->next,  tn );   
   >>>                      return TRUE;   
   >>>        }   
   >>>        return FALSE;   
   >>> }   
   >>>   
   >>> So, I guess, you are not talking about what My 66000 cannot do, but   
   >>> only what other ISAs cannot do.   
   >>   
   >> Of course.  It is interesting to speculate about possible features of an   
   >> architecture like yours, but it is not likely to be available to anyone   
   >> else in practice (unless perhaps it can be implemented as an extension   
   >> for RISC-V).   
   >>   
   >>>>                                                              Even with a   
   >>>> single core system you can have pre-emptive multi-threading, or at least   
   >>>> interrupt routines that may need to cooperate with other tasks on data.   
   >>>>   
   >>>>>   
   >>>>>> and I don't think that C with just volatile gives you such guarantees.   
   >>>>>>   
   >>>>>> - anton   
   >>>>   
   >>   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]