One of our best servers at work was inherited from a previous engagement for x64 testing: it's a dual Opteron 250 with 8 gigabytes of RAM. Even after a year of service, those are still decent specs. And it has a nice upgrade path, too: the Tyan Thunder K8W motherboard it's based on supports up to 16 gigabytes of memory, and the latest dual core Opterons.
Anyway, we have it set up for Virtual Server 2005 R2 duties, running Windows Server 2003 x64. However, there was some anomalous behavior:
We've used this server for over a year and never experienced anything problematic with it. The weirdness only started with the server's new role.
The first thing we did was update the BIOS to the latest version, and make sure we had all the latest x64 chipset and platform drivers installed. This is always a good first troubleshooting step-- it's the hardware equivalent of taking two asprins and calling in the morning. This resolved the "some nodes of this machine do not have local memory" error. However, the machine still spontaneously rebooted overnight, even with the latest BIOS and drivers.
At this point I began to suspect a hardware problem. Troubleshooting hardware stability can be difficult. But you can troubleshoot hardware stability quite effectively with the right software: Memtest86+ and Prime95.
We started with Memtest86+ because we already suspected the memory. Memtest86+ isn't the only memory testing diagnostic out there, but it's probably the most well-known. Microsoft also offers their Windows Memory Diagnostic utility, which works exactly the same way. Memtest86+ is available in several forms from the Memtest86+ web site. We chose the ISO image, which we burned to CD. Boot from the Memtest86 CD, and it'll kick off the test run.
It took about 30-45 minutes to test 4 gigabytes of memory. The progress bar at the top right gives you an indication of how long the test has to run; there are 8 total tests in the standard test run. Beware, because it'll start repeating at test #1 after the first pass!
Prime95 is my single favorite PC stability testing tool. If your PC can't pass an overight Prime95 run, it absolutely, positively has a hardware problem.* Although Prime95 is primarily a CPU test, it can also be a pretty good memory test, too. After downloading it, go to the Options menu and select Torture Test.
If you have a Dual (or Quad) CPU machine, you must run multiple instances of Prime95 to load each CPU. The easiest way to do this is to copy the Prime95 folder and run multiple executables, each one from a unique folder. You may want to set CPU affinity on the executables with Task Manager, but the scheduler will take care of loading all the CPUs just fine by itself.
A bit of warning, though: when Prime95 says "lots of RAM tested", they mean it. We tried running two instances of "Blend" with only 4 gigabytes of memory installed on the server and we nearly crushed the pagefile; both instances allocated nearly 6 gigabytes!
In my experience, Prime95 will error out almost immediately if your CPUs or memory are unstable. This is great for troubleshooting because you know quickly if there's a problem or not. If you can run Prime95 "small FFTs" for an hour, it's highly likely that the CPU isn't your problem. And if you can run the same test overnight, CPU problems can be definitively ruled out.
In the case of our wayward server, Memtest86+ showed us rare, intermittent memory problems. But Prime95 consistently failed almost immediately when running the "blend" test. When we switched Prime95 to "small FFTs", it ran two instances for an hour just fine. Clearly a memory issue! Using a combination of Memtest86+ and Prime95, we found that our server was totally stable with 4 gigabytes of memory installed; the minute we put in all 8 gigabytes, we couldn't pass one or both tests.
Since 8 gigabytes of memory is essential for a VM server, removing memory wasn't an option. On a hunch, I switched the memory speed from 200 MHz to 166 MHz in the BIOS. Now both Prime95 blend and Memtest86+ pass without incident.
Although software is notoriously unreliable, we can't always blame the software. Sometimes you really do have a hardware problem.
* CPUs are almost never defective; it's usually a heat or power supply related failure.
Funny coda to this story: this server was shipped to us as-is directly from the CPU manufacturer. Although this is clearly a motherboard (Tyan) problem, it's still funny!
Jeff Atwood on August 13, 2006 4:11 AMIn 4 1/2 years of building computers in a store many years ago, I've only once seen a bad Intel CPU; the most common cause of hardware crashes were motherboards, followed by RAM. However, during that same time, I did see quite a few dead (never malfunctioning) AMD chips.
David M. Kean on August 13, 2006 7:27 AMSo the link to gamepc is just to showcase the specs? They seem to have decent machines. Have you had any dealing with them?
Stephen Patten on August 13, 2006 9:33 AM I once got a phone call that a server I was responsible for had stopped responding. I VPN'ed to it, and it was processing transactions just fine. I immediately suspected that it was a router problem because we'd seen the same problem before with the routers. (We used Tibco Rendezvous for messaging, and it uses broadcast UDP packets. Our routers understood this and intelligently forwarded packets appropriately.) Infrastructure checked the router in question and it was fine. I then logged into my test server, which was on the same segment and it thought that the production server was dead, too.
I then logged back into the production server, and checked the event log. One of the CPUs had died. Windows continued chugging along merrily, although it would no longer send out UDP packets. Rebooting fixed the network issue, and we replaced the dead CPU over the weekend. It still amazes me that a CPU dying would cause such an odd error.
Nice work Inspector Gadget!
Haacked on August 13, 2006 12:36 PMVery interesting. Now if only there was a way to test hard disks as thoroughly (although we all know how long that would take).
Josh Lewis on August 14, 2006 5:44 AMJosh:
SpinRite for hard drives!
http://www.grc.com/spinrite.htm
Those are the hard to find problems, hardware is the very last place you look at when something goes wrong in an app
any good tools to test network cards?
Eber Irigoyen on August 14, 2006 9:15 AMI forgot to mention that we used CPU-Z to identify the mainboard and the SPD/memory timings:
Eber, as for testing network cards, I use pcattcp:
http://www.codinghorror.com/blog/archives/000339.html
Jeff Atwood on August 14, 2006 9:52 AMI'd forgotten all about Prime95. Thanks for the reminder!
As for SpinRite, it's all junk science. I think it was Steve Gibson's laughable assertions about raw sockets that first tipped people off about his motivations, but in any event, his claims about SpinRite make about as much sense as ouija boards and autointoxication remedies. Marketing rhetoric couched in pseudoscience and buzzwords, with lots of testimonials but no hard evidence.
It's like acupuncture for your computer. At best it's nothing but a placebo effect, at worst it could do serious damage by virtue of re-exposing bad sectors.
Aaron G on August 14, 2006 10:14 AMI'll concede that Steve Gibson stirs up hysteria at times, but the basic premise of SpinRite is sound:
If you lose data on your hard drive to a bad sector, and you want to recover that data, SpinRite does the trick. Is that worth $90? It depends on your valuation of the lost data.
The second assertion of being able to predict hard drive failure is plausible as well. In NAND flash memory, you can certainly tell when a block is beginning to fail, as the number of error corrections begins to increase disproportionate to the rest of the device. It seems reasonable to assume the same predictive conclusions can be made for a hard drive.
jdkludge on August 15, 2006 7:29 AM| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |