From: user5857@newsgrouper.org.invalid   
      
   David Brown posted:   
      
   > On 10/12/2025 21:10, MitchAlsup wrote:   
   > >   
   > > David Brown posted:   
   > >   
   >   
   > >> OK. I can see the advantages of that - though there are disadvantages   
   > >> too (such as being unable to control a limit on the number of retries,   
   > >> or add SW tracking of retry counts for metrics).   
   > >   
   > > esm attempts to allow SW to program with features previously available   
   > > only at the µCode level. µCode allows for many µinstructions to execute   
   > > before/between any real instructions.   
   > >   
   > >> My main concern was   
   > >> the disconnect between how the code was written and what it actually does.   
   > >   
   >   
   > Perhaps it would be better to think of these sequences in assembler   
   > rather than C - you want tighter control than C normally allows, and you   
   > don't want optimisers re-arranging things too much.   
      
   Heck, there are assemblers that rearrange code like this too much--   
   until they can be taught not to.   
      
   > > There is a 26 page specification the programmer needs to read and   
   understand.   
   > > This includes things we have not talked about--such as::   
   > > a) terminating an event without writing anything   
   > > b) proactively minimizing future interference   
   > > c) modifications to cache coherence model   
   > > at the architectural level.   
   >   
   > Fair enough. This is not a minor or simple feature!   
      
   No, it is a design that allows for ISA to remain static while all sorts of   
   synchronization stuff gets written, tested, and tuned.   
      
   > >   
   > > The architectural specification allows for various scales of µArchitecture   
   > > to independently choose how to implement esm and provide the architectural   
   > > features at SW level. For example the kinds of esm activities for a 1-wide   
   > > In-Order µController are vastly different that those suitable for a server   
   > > scale rack of processor ensembles. What we want is one SW model that covers   
   > > the whole gamut.   
   > >   
   > >>> 4th:: one cannot test esm with a random code generator, since the   
   probability   
   > >>> that the random code generator creates a legal esm event is exceedingly   
   low.   
   > >>   
   > >>   
   > >> Testing and debugging any kind of locking or atomic access solution is   
   > >> always very difficult. You can rarely try out conflicts or potential   
   > >> race conditions in the lab - they only ever turn up at customer demos!   
   > >   
   > > Right at Christmas time !! {Ask me how I know}.   
   >   
   > We can gather round the fire, and Grampa can settle in his rocking chair   
   > to tell us war stories from the olden days :-)   
   >   
   > A good story is always nice, so go for it!   
      
   Year:: 1997, time 7 days before Christmas:: situation, Customer is   
   having (and has had) strange bugs that happen about once a week.   
   Customer is unhappy, we have had a senior engineer on sight for   
   4 months without forward progress. We were told "You don't come home   
   until the problem is fixed".   
      
   System:: 2 (or more) of our cache coherent motherboards, connected   
   with a proven cache coherent buss.   
      
   On the flight from Austin to Manchester England, I decide that what   
   we have is a physics experiment. So, when we arrive, I had their SW   
   guy code up a routine that as soon as it got a time slice, it would   
   signal it no longer needed time. While we hooked up the logic analyzer   
   to our motherboards and to their bus. When SW was ready (about 30 minutes)   
   we tried the case--Instantly, the time delay between the bug showing up   
   went from once a week to milliseconds. We spent the afternoon taking   
   logic analyzer traces, and went to dinner.   
      
   The next day, we went through the traces with a fine tooth comb and   
   found a smoking gun--so we ran more experiments and this same smoking   
   gun was found in each track. After a couple of hours, we found that   
   their proven coherent bus was allowing 1 single cycle where our bus   
   could be seen in an inconsistent state. and it was only a dozen   
   cycles downstream that the crash was transpiring.   
      
   It turns out that their bus was only coherent when the attached bus   
   was slower than 4 cycles to do "random coherent message", whereas   
   our bus was times at 2 cycles for this response.   
      
   So, we took their FPGA which ran the bus apart and found out how to   
   delay one signal, reprogrammed it--ONLY to run into another message   
   that was off by 1 or 2 cycles. This one took a whole day to find and   
   program around.   
      
   We both made it home for Christmas, and in some part saved the company...   
      
   > (We once had a system where there was a bug that not only triggered only   
   > at the customer's site, but did so only on the 30th of September. It   
   > took years before we made the connection to the date and found the bug.)   
   >   
   >   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|