Forums before death by AOL, social media and spammers... "We can't have nice things"
|    sci.electronics.design    |    Electronic circuit design    |    143,102 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 141,550 of 143,102    |
|    Don Y to John R Walliker    |
|    Re: bit flips?    |
|    07 Dec 25 12:29:29    |
      From: blockedofcourse@foo.invalid              On 12/7/2025 5:11 AM, John R Walliker wrote:       > It probably is being ecc checked. However, an energetic neutron       > or a cascade of charged particles might cause multiple bit errors       > which would not necessarily get detected or corrected. The fix              Note that we don't know if it is in memory, processor or other       sensing/control electronics. All we can infer is that changing       the way the processor addresses this "flight aspect" makes a       measurable difference.              > could be as simple as doing memory scrubs far more frequently       > than usual to reduce the probability of multi-bit errors       > accumulating to the point where they are not detected.              This is one one of the problems with error correcting technologies;       if you don't *access* the memory, you have no way of knowing if       it has been corrupted (since the last time you accessed/checked it)!              The same holds true of RAID arrays; without a patrol read, disk errors       can *accumulate* between accesses -- you miss the first one and more       crop up before you ever get around to noticing ANY of them!              Unfortunately, scrubbing memory increases power consumption, slows       down performance AND can introduce other disturb events related to       access, in general. (remember, errors are not the sole privilege of       memory /or one particular type of memory/ -- SRAM fails, too, along       with data buffers, etc.)              On top of all that, you have to actively USE the data from the       protected subsystem to see how "assertive" it is being. And, have       to know what that information *might* be indicating about the       number of undetectable (not just uncorrectable) errors you might       be encountering. This, of course, is a probabilistic function       that relies on the actual devices used, access patterns, etc.              For devices that can't simply be powered down and serviced,       you want to take action to move the "dubious" resources out of       your critical path. E.g., memory systems now dynamically       remap physical *pages* in much the same way discs remap sectors.       "Old thinking" puts new designs at risk.              [Expecting ECC to be a panacea is just naive -- imagine you got       an error on EVERY access to a resource. How confident would       you be that these are all being successfully corrected?? :> ]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca