Running Linux with Broken Memory
There are several common causes for failure of memory modules. Before solving the problems in software, let's turn to these causes. To understand them, please take a look at Figure 1, which shows schematically what the structure of a (faultless) memory is. Note how the memory is laid out in a (roughly equal) number of rows and columns. Each crossing marks the location of a memory cell, usually storing a single bit. As a result, a memory cell's address is the combination of its row address and column address. Memory modules are fed with these two halves of the address separately (sequenced), so that their address bus is only half the size you would expect. This halving of addresses is handled a by the hardware on your motherboard.
The first place where things can go wrong is during the manufacture of these memory modules. This is a very sensitive process, mainly because the dimensions of the chip layouts are getting smaller as the world grows hungry for more memory. Chip manufacture is also quite wasteful of environmental resources because a lot of highly purified chemicals are needed to construct the tiny, sand-derivative product that you eventually buy. Moreover, the sensitivity of the chip manufacturing process makes many chips fall out due to error.
One method of partially solving this problem is to build redundancy into the chips, say including 33 rows of memory cells for every 32 rows; when one row fails, route the signals around it to the extra row. Indeed, this technique is applied by some manufacturers. Nevertheless, unsustained rumors claim there still is a 60% drop-out rate. Needless to say, this factor makes memory expensive.
Errors caused during production can take the shape of a speck of dust that sat on the chip while enlightening or developing one of its layers, as in Figure 2. The grayed-out area is the part that may malfunction as a result of that speck of dust.
Another source of error is static discharge. This is the same effect that makes a woolen or nylon sweater spark a shock when taken off. The reason for the spark is a high voltage between two nearby surfaces. When they come close enough, the spark suddenly jumps over, ionizing the air it passes and, thus, suddenly making it conduct quite well, resulting in a high current for a very short period. Actually, this is the same effect as lightning, only not as powerful, of course. The small structures in a chip are quite sensitive to these aggressive discharges. Static discharge usually damages the “buffers”, the connections between rows/columns and the address selection logic, as shown in Figure 3. This effectively means that a whole row, or a whole column, or even a few of them, become unusable, shown again by the grayed-out area. There are the memory fault patterns that I have seen the most.
A last source of error, and up to now I have no certain proof of ever having encountered it, is the gradual decay that all chips are subjected to. This is the process by which the sharply distinct regions of different materials on the chip blur and mix together. This is a natural effect, and if a chip is kept sufficiently cool, it usually takes decades to happen. In the case of memory, no problem, it usually gets outdated for its speed and size long before this point.
The important point of all these errors is that they stick—I am not aware of any technical cause that could lead to random errors. Mind you, it may happen that errors only occur in particular patterns of the surrounding bits, but even then, the error once settled, always occurs in that situation.
Furthermore, errors don't happen for no reason. In all of the above situations, a sudden burst of energy causes the errors, or they are caused by a very slow process. So, we need not expect errors to jump up all of the time, not even on a defective memory module.
Because errors in memories do not evolve, it is usually possible to circumvent the errors by making smart use of dynamic memory allocation techniques.
There are basically two ways of detecting errors in memories. One approach is the use of a special program that scans all of the memory in the machine and reports the failing addresses. One such program that excels in finding errors is memtest86, targeted at i386-based machines. Naturally, trusting the results from such a memory checker relies on the stability of memory errors, which usually turns out to be an okay assumption; I've been running my machine for months without a change in the detected error patterns.
The alternative is based on hardware detection. In the 30-pin SIMM days, it was quite common to include a ninth bit with “parity” information on the memory module. The idea is that the ninth bit, which is actually stored separately on the memory module, contains the bit to make the parity of the whole chip even (or odd, whatever the design of the motherboard wants). When writing, this bit is generated, and when reading back, it is checked. If this read-time check fails, a parity error is thrown.
A modern alternative to parity bits is ECC, short for Error Correction (and detection) Code. These are usually based on some CRC hash from the bits stored, and they can be used to detect up to three faulty bits out of 32, as well as to correct up to two faulty bits. (I am actually uncertain about these precise values, but it's the principle that counts.) So, with ECC RAM, it is possible to detect errors, as well as correct them if they are not too savage.
Modern memory modules as we use them in PC workstations usually come with neither parity nor ECC. On the other hand, it is standard to find ECC in high-end (server) systems as sold by VA Linux and Sun. These usually work with ECC memory, or at least with parity memory. I believe that most common PCs are capable of working with ECC memory, they just don't get that memory plugged in because ECC RAM is more expensive. Yes, this is the low-quality world of PCs blinking an eye at you.
With parity RAM or ECC RAM it would be possible to detect errors on the fly, without the need for a special program like memtest86. Currently however, the BadRAM patch we present here is kept as simple as possible (to increase chances of acceptance in mainstream kernels) and, therefore, does not deal with the added possibilities of ECC RAM.
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
|Non-Linux FOSS: Seashore||May 10, 2013|
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Dynamic DNS—an Object Lesson in Problem Solving
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Keeping track of IP address
1 hour 2 min ago
- Roll your own dynamic dns
6 hours 16 min ago
- Please correct the URL for Salt Stack's web site
9 hours 27 min ago
- Android is Linux -- why no better inter-operation
11 hours 43 min ago
- Connecting Android device to desktop Linux via USB
12 hours 11 min ago
- Find new cell phone and tablet pc
13 hours 9 min ago
14 hours 38 min ago
- Automatically updating Guest Additions
15 hours 47 min ago
- I like your topic on android
16 hours 33 min ago
- This is the easiest tutorial
23 hours 9 min ago