1 July 2010 - 1:48Flaky RAM bit flips

A few days ago, a Ksplice blog entry titled “Attack of the Cosmic Rays!” described the process of tracking down a program error caused by a flipped bit in RAM. Basically, a code page (an ELF text section page) of the “expr” binary was sitting in buffer cache in RAM and was affected by a flipped bit. This changed bit caused the program to segfault consistently, and the problem was solved by manually clearing the OS buffer cache (causing the expr binary to be reloaded from disk).

My own experience
This reminded me of a similar incident that occurred when I was an undergraduate in college taking my first non-introductory systems class. We had to write a multi-threaded web server, and I had been banging on the project pretty regularly for at least a week. I had gotten to the point where I was confident my code was stable and ready to be turned in when the program suddenly started segfaulting frequently and without much perturbation. Opening the core file in gdb showed that segfaults were happening when an obviously invalid pointer was being dereferenced, which made no sense. I was reasonably certain that there is no way I could have missed an frequent, obvious segfault problem, so I was a bit perplexed. Then, while I was trying to debug the issue, other applications started to crash, like emacs, gaim and xterm. At that point I was actually relieved that it was probably a hardware issue. I rebooted into memtest86 and thousands of memory errors were detected within minutes. I took out each DIMM and retested and eventually narrowed it down to a single module, and continued using my system with half the RAM until the memory was RMA’d.

NULL with one bit flipped!
So suddenly what I was seeing in the debugger made sense. All of the invalid pointers were powers of 2 and unusual addresses like 0×00000008. After I ran memtest, it clicked — on x86 (and many platforms) a C NULL pointer’s runtime representation is the address with all 0 bits. If one of those bits gets flipped, the standard NULL pointer checks fail and it will probably end up getting dereferenced like a valid pointer shortly thereafter.

ECC and more advanced things
I’m not an expert on DRAM hardware, but it’s possible that the bit error rate is not decreasing as fast as RAM capacity is increasing.* As more and more data is being exclusively kept in RAM (like giant memcached clusters), it is possible bit-errors could becoming increasingly problematic. ECC has generally been considered essential in servers (a paper from Google and University of Toronto in SIGMETRICS ’09 on this is interesting). And I make sure any machines with large storage have ECC RAM (especially ones with software RAID, since so much data manipulation happens in host memory). But now high-end machines are getting much more advanced memory error correction capabilities. For example, IBM’s xSeries servers have features they call “Active Memory”, including memory mirroring (like RAID-1 for RAM), Chipkill — a very strong ECC implementation capable of handling multiple bit errors even from the same memory module, and memory scrubbing. The latter is very neat — basically the RAM equivalent of RAID scrubbing, which is an important process for maintaining disk arrays. Scrubbing detects latent errors before a failure; otherwise they will be detected during rebuild when it may be impossible to recover.

* This problem can affect non-volatile storage, too (the error rate on large media is decreasing slightly slower than storage density is growing; that means single parity redundancy is increasingly likely to encounter an error during rebuild as disk sizes increase. See “Triple-Parity RAID and Beyond” for a bit on that issue as well as the more important throughput to capacity ratio consideration).

No Comments | Tags: Uncategorized

Add a Comment