January 7, 2007
The server this blog is hosted on is having some very bizarre hardware problems. I can only blame myself, since I helped build the server. I'll document the problems later in the comments to this post.
I'm moving the VM to another, temporary host in the meantime, so there may be a bit more downtime today.
Posted by Jeff Atwood
First of all: I enjoy your blog!
Any idea yet of what might be the problem? Recent updates? Perhaps a virus? Can you try a System State Restore or do a reinstall like Grant Johnson said...
The fact that Microsoft has "Virtual Server 2005 R2 SP1 – Beta 2" now available means that the software has issues. There has been one thing that I have learned about software over my many years in this industry, and that is, NEVER USE THE FIRST VERSION FROM MICROSOFT OR ANYONE WHO IS A PARTNER OF MICROSOFT.
Personally, Jeff, my vote is on it being a software issue rather than hardware. I agree with Grant's comment -- after this long, it's pretty unlikely that hardware would be an issue AND not be a fatal/blantant problem.
Then again, if you've had the error filling up your event log for a while and only recently did it increase in frequency, maybe it is a hardware problem, afterall...
Who knows... Good luck to you. Until then, maybe you can rename the blog to "Server Horror" instead ;)
1.)Never use Windows for things that have to be reliable. Use Linux, preferably Red Hat or SuSE.
2.)Try a system burn-in for 5 hours or more (check out www.ultimatebootcd.com for a CD full of burn-in tools)
3.)You should change "the word" from "orange" to something else, once in a while.
The man provides a great service by writing this blog.
It would be great if everyone would just provided him help with his problem.
Telling him he should use Linux does absolutely nothing for his current problem. Obviously he his a very smart person and knows what he is doing by choosing Windows.
....and just for your information
I think windows server might be a little stable since the 2nd most visited site in the world uses it.
@Gabriel J. Smolnycki
Here's some responses to your flame-bait:
"1.)Never use Windows for things that have to be reliable. Use Linux, preferably Red Hat or SuSE."
Uggh - http://www.codinghorror.com/blog/archives/000021.html Jeff's a Microsoft programmer, he does what he knows. Also, I'm sure companies like Dell, Marketwatch, and many other sites would disagree with this statement.
"2.)Try a system burn-in for 5 hours or more (check out www.ultimatebootcd.com for a CD full of burn-in tools)"
"3.)You should change "the word" from "orange" to something else, once in a while."
Read Jeff's post on Captcha and how effective the key word has been for him - http://www.codinghorror.com/blog/archives/000712.html
Flame on Linux lovers! Down with Microsoft! Down with Bill Gates! Devil spawn they are! Cursed be they make a pact with the devil! Arrgh, sorry matey, me pirate lingo got the best of me.
I have never liked Windows or Microsoft or Bill Gates. Too flakey, too flashey, too cumbersome, too rich, too successful. Real men program in machine code, we don't need no fancy gooey eye-dee-ease.
I will say this for Bill: he did bring computers to the masses. My big question is why has no one been able to challenge him? Maybe he really does have a brain.
I have used Linux occasionally, but for someone who has other things to do than maintain their computer system, a Windows machine makes it possible use a computer and have a life. Cursed, I am!
I guess I should have given some help too while ragging on Microsoft:
The short of it is, your hardware could be fine, but the OS can react differently depending on load and memory usage which is extremely difficult to reproduce when you are trying to debug it. As you mentioned, the machine is running multiple VMs and that means a load on the memory. Prevent the VMs and the machine will appear stable, reinstate all the VMs and it will appear stable. Then when the machine is under heavy duress it will start to fail again.
I think using a product like VMWare GSX Server might handle the load better.
OK, everything should be back to normal now.
codinghorror.com runs on a shared hosting server with the following specs:
- Virtual Server 2005 R2
- Windows Server 2003 Enterprise x64 Host OS
- Athlon X2 4200+ (dual core, 2.2 GHz, 1mb cache)
- Dual WD Raptor 10,000 RPM 150GB hard drives, in RAID mirror
- nVidia nForce 430 motherboard chipset w/ onboard 6150 video
- 4GB RAM
The actual VM itself has these specs:
- Windows Server 2003 Standard
- 512mb RAM
- ActiveState PERL
- Movable Type 2.66
- MySQL 5.0
The problem we're having is with the host server OS. It started out perfectly stable, but over time it's generating more and more faults. No bluescreens, just odd errors in ntdll.dll and services.exe. More detail in a bit.
Hey, I think I figured out the problem. It seems your server has the MS_BUG:
"- Windows Server 2003 Standard"
If you switch over to Linux, this problem should go away.
Remember, there's almost nothing installed on the host. Just the OS, Virtual Server 2005 R2, and the platform drivers.
Here are more details from the Host's Event Log.
"Fault bucket 00040944."
"Faulting application services.exe, version 5.2.3790.1830, faulting module ntdll.dll, version 5.2.3790.1830, fault address 0x000000000001baef."
This one is the killer. It's always the exact same error in the exact same sequence. It started happening with more and more frequency. I cleared the app log late Saturday night, and I got these exact same errors all Sunday (1/7)
2:46 am, 3:46 am, 4:16am, 4:27 am, 4:46 am, 5:16 am, 5:56 am, 7:46 am, 8:56 am, 9:36 am, 10:47 am
After that, the Host OS was down for the count until we physically restarted it at around 9:15 on Monday (1/8).
We had observed increasing problems with this server, always the same error, but never this frequent, and rarely crashing the Host. It does lead to the server "greyscreening", where the Host OS is unresponsive to remote logins of any kind, but the VM keeps running just fine. It's nearly identical to what is described here:
Over time, we pared down the Host to only the one VM, and removed every bit of hardware we could from it to isolate the problem. This server was stable through 2+ hours of Prime95 torture test, both large and small, on Friday (1/5).
I really have no idea what's happening. I want to keep an eye on the server now that it's running ZERO virtual machines.
Yep, I couldn't imagine doing VMs the Microsoft way. *shudder*.
I found this German-language post which references the exact same error, in the exact same config (Virtual Server 2005 R2, Windows Server 2003 x64 host, Windows Server 2003 32-bit client)
here's the translation:
I have Virtual Server 2005 R2 on Windows server 2003 standard x64 on a Dell PowerEdge 2850 (dual Xeon) to run. The day before yesterday I installed the whole and put some virtual machines on (Windows server 2003 standard R2). Sometime yesterday I could not reach the Administrations website any longer. By remote Desktop I still came on the computer drauf, had however no chance to vffnen the Event Viewer or the service dialogue. On the command line both iis reset, and "net start" after the call simply [failed]. I switched off and again raised the computer then. Since that time it [did the same thing] again. In the Eventlog I found then the following:
Event type: Error
Event SOURCE: Application error
Event Category: (100)
Event ID: 1000
time: 12:05:52 PM
Description: Faulting application services.exe, version of 5.2.3790.1830, faulting modules ntdll.dll, version 5.2.3790.1830, putrefies to ADDRESS 0x000000000001baef.
Does someone have an idea, which are missing to my server kvnnte?
Gru _ Markus
Just out of curiosity, I looked at the code on the home page and it looks suspiciously like pascal, are you a Delphi guy?
I ran into a lot of memory problems with a 430/6150 AMD dual core I built over xmas. memtestx86 diagnosed the faults reliably during the first hour of running, and clocking the memory down from its DDR2/667 spec to DDR2/533 fixed my issues.
Today I typed in the url for this blog instead of just googeling the name of it. And I was greeted by the splash page and I thought something was broken because of all the code in the background. Then I looked for a second and realized I was looking at a splash page. lolz.
Best splash page ever.
Just an aside, I know it's not even the same set up (mine's a laptop), but an area to check is also the hard drive. Mine just went out and it was causing the OS (Windows XP Home: Most of the underlying OS is in Win Server 2003/XP Pro...) to blue screen with what looked like memory errors. Basically, your swap file IS memory. If your drive can't access it, the OS can't use and will throw errors.
One cool thing though, my desktop PC is pretty tough. I hooked the 2.5" drive up to it and after a short time it went slower and slower and slower to boot. Then the REALLY COOL stuff happened. The adapter I used to connect the drive produced a bright greenish blue flash! In my hand! I quickly shut down the PC and investigated what had happened. The power lead to the drive reacted like a fuse! Hence my discovery the drive was trash. I reconnected my other drives and after a little playing with DVD drive jumpers and BIOS settings I was able to boot back up and run normally on the desktop. The laptop however is going to receive and upgrade now: 100GB drive instead of the 28GB.
Just a little food for thought... Now back to our regularly scheduled blog.
Somethign really scary would be a stable kernel produced by Microsoft. How could they do that w/o a single unshaven geek? [shiver!]
I vote for replacing the hard drive. What do you expect from a hard disk made by people working for $8.00 per day???
Like there are no bugs in Linux, man, they release pacthes ten times a day. Like someone above said, Windows makes it possible to use computers and have a life at the same time.
I would vote on this being either an update (dont use autoupdate!) or faulty memory modules.
Random weird stuff always happens with faulty modules.
We rebuilt the server from scratch using Windows Server 2003 *R2* x64, and the latest nForce 430 platform drivers.
So far so good. Nothing unusual in the event logs.
My gut instinct is that we had a driver / software problem; all my tests confirmed that the hardware was perfectly stable.
When event viever return:
"ntdll.dll, version 5.2.3790.1830, fault address 0x0004cd7d"
yow know this code error???
Actually, the problem came back. Exactly the same problem as before (see first few posts), everything is identical. It's like a bad zombie movie.
I filed an incident with microsoft PSS and we're troubleshooting it now. We did full memory dumps and perfmon logs, etc. Based on the new data, so far they think it was a problem with nvraidservice.exe trying to write to the registry and getting "access denied" for some reason.
We've disabled nvraidservice.exe from running in the machine startup via msconfig. Let's see if that helps at all.
Disabling nvraidservice.exe didn't help, nor did removing the few things that Microsoft PSS wanted removed from startup and drivers and so forth. The same error listed at the top of the comments came back with a vengeance. Exceptions and forced restarts every hour on the hour.
I also determined that the server passes Microsoft's Memory Diagnostic program, too:
What *did* finally help (eg, zero exceptions or restarts in the last 18 hours) is uprgrading from Virtual Server 2005 R2 to Virtual Server 2005 R2 Service Pack 1 (beta 2).
Fingers crossed, I think this is it.
jeu ve chamach ste jeu paly starfighter- disputed galaxy,afree online
game on kongregale
1) Sometimes Windows gets twitchy with bad or intermittent memory. Start with Memtest86. I usually let it run ofr 24 hours to be sure.
2) If that does not show up anything, try a fresh install to isolate it as a hardware issue. Most often it is not. It is usually a corrupted or version mismatched DLL.
Usually hardware faults are quite terminal and not just little errors that most things keep working. They also usually do not show up like this long after burn in.
I appreciate you putting your problems up here. There are days that I think I am the ONLY one on the face of the planet who can't figure out why something doesn't work. It's nice to know the gurus have bad computer days also.
I used to tell my clients that the reason that this stuff is so quirky is because "nobody has ever done this before". It still seems to be true every time something goes wrong for me, but knowing there are others bailing in a similar boat is heartening.
I know what the problem is... the blog has been corrupted by the recent influx of slashdotters and the disease has spread all the way to the host OS. ;)