home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   sci.electronics.design      Electronic circuit design      143,102 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 141,550 of 143,102   
   Don Y to John R Walliker   
   Re: bit flips?   
   07 Dec 25 12:29:29   
   
   From: blockedofcourse@foo.invalid   
      
   On 12/7/2025 5:11 AM, John R Walliker wrote:   
   > It probably is being ecc checked.  However, an energetic neutron   
   > or a cascade of charged particles might cause multiple bit errors   
   > which would not necessarily get detected or corrected.  The fix   
      
   Note that we don't know if it is in memory, processor or other   
   sensing/control electronics.  All we can infer is that changing   
   the way the processor addresses this "flight aspect" makes a   
   measurable difference.   
      
   > could be as simple as doing memory scrubs far more frequently   
   > than usual to reduce the probability of multi-bit errors   
   > accumulating to the point where they are not detected.   
      
   This is one one of the problems with error correcting technologies;   
   if you don't *access* the memory, you have no way of knowing if   
   it has been corrupted (since the last time you accessed/checked it)!   
      
   The same holds true of RAID arrays; without a patrol read, disk errors   
   can *accumulate* between accesses -- you miss the first one and more   
   crop up before you ever get around to noticing ANY of them!   
      
   Unfortunately, scrubbing memory increases power consumption, slows   
   down performance AND can introduce other disturb events related to   
   access, in general.   (remember, errors are not the sole privilege of   
   memory /or one particular type of memory/ -- SRAM fails, too, along   
   with data buffers, etc.)   
      
   On top of all that, you have to actively USE the data from the   
   protected subsystem to see how "assertive" it is being.  And, have   
   to  know what that information *might* be indicating about the   
   number of undetectable (not just uncorrectable) errors you might   
   be encountering.  This, of course, is a probabilistic function   
   that relies on the actual devices used, access patterns, etc.   
      
   For devices that can't simply be powered down and serviced,   
   you want to take action to move the "dubious" resources out of   
   your critical path.  E.g., memory systems now dynamically   
   remap physical *pages* in much the same way discs remap sectors.   
   "Old thinking" puts new designs at risk.   
      
   [Expecting ECC to be a panacea is just naive -- imagine you got   
   an error on EVERY access to a resource.  How confident would   
   you be that these are all being successfully corrected??  :> ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca