Cluster Hardware Torture Tests
The systems placed on the table were evaluated based on several criteria: quality of construction, physical design, accessibility, quality of power supply and cooling design. To start, the systems varied greatly in quality of construction. We found bent-over, jammed ribbon cables, blocked airflow, flexible cases and cheap, multiscrew access points that were unbelievably bad for a professional product. We found poor design decisions, including a power switch offset in the back of a system that was nearly inaccessible once the system was racked. On the positive side, we came across a few well-engineered systems.
Our evaluation included quality of airflow and cooling, rackability, size/weight and system layout. Features such as drive bays at the front also would be noted. Airflow is a big problem with hot x86 CPUs, especially in restricted spaces such as a 1U-rack system. Some systems had blocked airflow or little to no circulation. Heat can cause instability in systems and reduce operational lifetimes, so good airflow is critical.
Rigidity of the case, no sharp edges, how the system fits together and cabling also belong in this category. These might seem small, uninteresting factors until you get cut by a system case or have a large percentage of “dead on arrivals”, because the systems were mishandled by the shipper and the cases were too weak to take the abuse. We have to use these systems for a number of years; a simple yet glaring problem is a pain and potentially expensive to maintain.
Tool-less access should be a standard on all clustered systems. When you have thousands of systems, you are always servicing some of them. To keep the cost of that service low, parts should be quickly and easily replaceable. Unscrewing and screwing six to eight tiny machine screws slows down access to the hardware. Parts that fit so one part does not have to come out to get to another part and that provide easy access to drives are pluses. Some features we did not ask for, like keyboard and monitor connections on the front of the case, are fine but not really necessary.
We tested the quality of the power supply using a Dranetz-BMI Power Quality Analyzer (see Sidebar). Power correction often is noted in the literature for a system, but we have seen radically different measurements relative to the published number. For example, one power supply, with a published power factor correction of .96, actually had a .49 correction. This can have terrible consequences when multiplied by 512 systems. We tested the systems at idle and under heavy load. The range of quality was dramatic and an important factor in choosing a manageable system.
The physical inspection, features, cooling and power-supply quality tests weeded out a number of systems early in the process. Eliminating these right away reduced the number of systems that needed extensive testing, thereby reducing the amount of time spent on testing overall. System engineering, design and quality of parts ranged broadly.
Measuring Power Supply Quality
Power supplies often come with inaccurate quality claims. We have experienced a number of problems due to poor-quality power supplies, so we test every system's power supply. For an accurate measurement, a Power Quality Analyzer is used to measure systems at idle and under heavy load.
Prior to employing our test methods, SCS built a cluster with a poor power supply and experienced a range of problems. One of the most expensive problems was current being mismanaged by the power supply. Three phase distribution power systems often are designed with the assumption of nicely balanced loads across the three phases, which results in the neutral current approaching zero. The resulting designs usually used the same gauge wiring on the neutral as on the supply.
Unfortunately, low-quality power supplies generate large third harmonic currents, which are additive in the the neutral line. The potential result of this is neutral current loads in excess of the rated capacity of the wiring, to say nothing of the transformers that were not rated for such loading. And, the neutral cannot be fused by code, so it was possible to exceed the neutral wiring capacity without tripping a breaker on the supply lines. This required a derating on all parts of the infrastructure to remain within spec. Derating is expensive, time consuming, and the cluster cannot be used during that time.
Thanks to Gary Buhrmaster for help on this Sidebar.
Run-in (often called burn-in) is the process manufacturers use to stress test systems to find faulty hardware before they put them in the field. A number of open-source run-in programs are available. One common program is the Cerberus Test Control System sourceforge.net/projects/va-ctcs. It is a series of tests and configurable wrapper scripts originally designed for VA Linux Systems' manufacturing. Cerberus is ideal for run-in tests, but we also developed specific tests based on our knowledge of system faults. We were successful in crashing systems with our scripts more often than when using a more general tool. Testing by using programs developed from system work experience can be more effective than using Cerberus alone, so consider creating a repository of testing tools.
Read the instructions carefully, and understand that run-in programs can damage a system; you assume the risk by running Cerberus. Also, there are a number of software knobs to turn, so consider what you are doing before you launch the program. But if you are going to build a cluster, you need to test system stability, and run-in scripts are designed to test exactly that quality.
At the time that we were testing systems, two members of our group wrote their own run-in scripts, based on some of the problems we have seen in our production systems. Whereas benchmarks try to measure system performance and often have sophisticated methods, run-in scripts are simple processes. A system is put under load and either passes or fails. A failure crashes the system or reports an error; a pass often does not report information. We also ran production code, which uncovered more problems. Production code always should be run whenever possible. For instance, one of the systems that passed the initial design inspection tests with flying colors failed under heavy load.
|Designing Electronics with Linux||May 22, 2013|
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
- Linux Systems Administrator
- New Products
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Have you tried Boxen? It's a
1 hour 28 min ago
- seo services in india
5 hours 59 min ago
- For KDE install kio-mtp
6 hours 35 sec ago
- Evernote is much more...
8 hours 41 sec ago
- Reply to comment | Linux Journal
16 hours 46 min ago
- Dynamic DNS
17 hours 20 min ago
- Reply to comment | Linux Journal
18 hours 18 min ago
- Reply to comment | Linux Journal
19 hours 8 min ago
- Not free anymore
23 hours 10 min ago
1 day 2 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?