As programmers, it is our responsibility to ensure that when something goes horribly wrong with our software, the user has a reasonable escape plan. It's an issue of fundamental safety in software error handling that I liken to those ubiquitous airline safety cards.
Which one accurately depicts the way your software treats the user in the event of an emergency?
If I've learned anything in the last thirty years, it's that I write shitty software -- with bugs. I not only need to protect my users from my errors, I need to protect myself from my errors, too. That's why the first thing I do on any new project is set up an error handling framework. Errors are inevitable, but ignorance shouldn't be. If you know about the problems, you can fix them and respond to them.
Note that when I say "errors", I don't mean mundane, workaday problems like empty form values, no results, or file not found. Those kinds of errors are covered quite well in 37 Signals' Defensive Design for the Web: How to Improve Error Messages, Help, Forms, and Other Crisis Points.
It's a great book; a quick read with lots of visual do's and don'ts side by side. Despite the giant exclamation point icon on the cover, however, it's mostly about fundamental web usability, not error handling per se.
I'm talking about catastrophic errors -- real disasters. Cases where a previously unknown bug in your code causes the application to crash and burn in spectacular fashion. It happens in all applications, whether they're websites or traditional executables.
The situation is pretty dire at this point, but some disaster recovery is possible, if you plan ahead.
If users have to tell you when your app crashes, and why, you have utterly failed your users. I cannot emphasize this enough.
It's bad enough that the user has to use our crashy software; are we really going to add insult to injury by pressing them into service as QA staff, too? If you're relying on users to tell you about problems with your software, you'll only see a tiny fraction of the overall errors. Most users won't bother telling you about problems. They'll just quietly stop using your application.
Whatever error handling solution you choose, it should automatically log everything necessary to troubleshoot the crash -- and ideally send a complete set of diagnostic information back to your server. This is fundamental. If you don't have something like this in place yet, do so immediately.
It's true that we can't do much to recover from these kinds of crashes, but relying on the underlying operating system or webserver to deliver the generic bad news to the user is rude and thoughtless. Override the default crash screen and provide something customized, something relevant to your application and your users. Here are a few ideas:
In my experience, nothing motivates a team better than a detailed public record of all crashes. There should of course be a searchable, sortable database of errors somewhere, but active notifications are also a good idea. Crashes are incredibly annoying to your users. It's only fair that the team behind the software share a little of that pain for each crash. You could broadcast an error email, text message, or instant message to everyone on the team. Or maybe have every crash automatically open a bug ticket in your bug tracking software. Tired of dealing with all those error emails and/or bug tickets? Fix the software so you don't have to!
Once you have a comprehensive record of every crash, you can sort that data by frequency and spend your coding effort resolving the most common problems. Microsoft, based on data from their Windows Error Reporting Service, found that fixing 20 percent of the top reported bugs solved 80 percent of customer issues, and fixing 1 percent of the top reported bugs solved 50 percent of customer issues. That's huge! Let the Pareto principle work for you, not against you.
As software professionals, we should protect our users -- and ourselves -- from our mistakes. Crash responsibly!
"That might be easy in the nice and friendly world of managed languages, where every error is an exception, and the OS doesn't pull the rug out from under you when you dereference a null pointer."
Well, let me see, what could you do to solve that problem...
Hmmm....
Every time you dereference a pointer you should check if it's null. Every Time You Dereference A Pointer You Should Check If It's Null. EVERY TIME YOU DEREFERENCE A POINTER YOU SHOULD CHECK IF IT'S NULL.
If it's null, and that's a major problem (i.e. you don't know why it's null), up pops the screen saying "Null pointer dereferencing at line at line squiddlybeep, I'm a lazy sonofab***h, and didn't consider this possibility. Press "Oi, Loser!" to inform me of my screwup." Because the user can't feed NULL pointers into your program. That's all you, baby.
And while we're at it, Initialize Your Damn Pointers, because random garbage often offends.
"I think Visual Studio has the best variety of crash symptoms - my favourite is when it just vanishes usually when you get to the critical point in a debug session."
To be fair, VS encounters more than its fair share of the 80% of problems that effect 20% of users.
For example, the critical point in a debug session is generally the point where you've decided to do some crazy crap that nobody's ever done before.
Sorry Jeff, but I think I must disagree with you. You say: "If users have to tell you when your app crashes, and why, you have utterly failed your users."
Yeah, it's true. But, how can I know when and why my software crash? Do you mean that I must use an automatic error reporting? But, how about the privacy? I cannot do it without asking the users, so how do I do? I mean, thinking of MS: not all software crashes depends from software, nor they all depend from the os. So, if it's due to os, and it send the report without asking the user, MS is a Big Brother that send personal data without the user's agree. If they ask for notification, then MS has failed to their users. So, how can you resolve it?
Blackstorm on May 18, 2008 2:51 AMI find it interestingly sad that global error handling in .NET Winforms is rather a chore.
steve on May 18, 2008 2:57 AMgreat post jeff
you should always be able to see what the application is doing or at the very least see it is doing something. I hate when the application stalls, thank god when windows can show the "not responding" message.
Peter Palludan on May 18, 2008 3:10 AMGood point. I particularly enjoy putting complete core dumps in my automatically generated error reports so I can go through financial reports from companies all over the world.
Privacy? Letting users opt-in to sending me information on what they're doing with my precious intellectual property? Yeah, like that'll ever happen.
Bob on May 18, 2008 3:15 AMBlackstorm writes:
But, how about the privacy? I cannot do it without asking the users, so how do I do?
Good points. The solution is a basic opt-in or partially opt-out choice at install. During install, or upon first launch, ask the user if it is okay to automatically send anonymous error reports in the event of a catastrophic error. Then give then the options of "always" and "ask before sending each time". (Worded a little better of course.) As long as it is a rare occurrence for such this dialog to appear, a "never send" option shouldn't be needed. You could also present the opt-in /opt-out screen the first time a major error ever occurs. But the risk here is you are asking at a point in time when the user is frustrated, and perhaps very upset with the product.
When the error occurs, if they opted in, you just send it. If they partially opted out, you present with a dialog asking if it is ok to send the report. Include a button that displays the full detail of the report. IMHO, that is the best way to deal with this.
I agree with Jeff's points. We as consumers would never accept problems in other products that we as programmers sometimes expect users to accept. II unfortunately have worked on far too many teams where error handling is given little if any consideration. To a user, an application that crashes without providing any feedback or apologies is beyond annoying.
Mark on May 18, 2008 3:20 AMFor C++ and native apps developpers, it is worth mentioning the excellent blackbox utility by Jim Crafton :
(a href="http://www.codeproject.com/KB/applications/blackbox.aspx"http://www.codeproject.com/KB/applications/blackbox.aspx/a)
That might be easy in the nice and friendly world of managed languages, where every error is an exception, and the OS doesn't pull the rug out from under you when you dereference a null pointer. Not so much in languages like C++. Never the less, I am implementing an error reporting system for my software. On every platform it runs on.
In my opinion it isn't the handling of fatal errors that's difficult, doing it crossplatform is tough however.
For an opensource project I'm working on ( http:/hwz2100.net ) I've created a fatal error aka exceptionhandler that works quite well crossplatform.
On Windows it uses the unhandled exception handler framework as exposed by the Windows API to catch falatl errors, in Unix systems it uses POSIX signal handling for that purpose. For the production of stacktraces it uses gdb when available on Unix, or glibc's backtrace facility as fallback. On Windows it uses the Windows API to retreive the function addresses from the stack and then uses a demangling library (libbfd) to retrieve function names. Some other info describing the user's system is also dumped.
You can find this error handler here, it might give you some clues for how to implement something similar yourself (or use this implementation if the license (GPLv2+ and I'm planning on releasing some parts of it as LGPL) is fine with you):
http://trac.wz2100.net/browser/trunk/lib/exceptionhandler (Subversion URL: http://svn.gna.org/svn/warzone/trunk/lib/exceptionhandler)
In addition to the privacy concerns, this solution (logging all information necessary to debug the error) does not scale.
The type of errors we're talking about are because "a previously unknown bug in your code causes the application to crash and burn in spectacular fashion". So there's no way we can trust the application to write a neat, comprehensive error log when that happens: we can't trust it to be sane at all. The only way to meet this goal is to log *everything, in huge detail* β enough to debug the application in the event of a catastrophic error.
That sort of logging information, in comprehensive, verbose detail, takes *lots* of space. We're talking hundreds or thousands of log entries for even simple operations, that might take a minute or even a few seconds. Multiply that by months and years of countless operations like that and the log files are *huge*, even if they're rotated out regularly.
Then, if the advice of this article is to be followed, every program on your operating system β hundreds of them β should be keeping such logs all the time they're in operation. It simply can't scale.
So a compromise must be made: programs run with little or no logging, unless the person responsible for the disk space makes a decision to start chewing it up with verbose log output from a particular program.
If you log *every* program verbosely, all the time, that's just as much a failure as the failure this article talks about. But if you don't, then you can't get the debugging information without the user being inconvenienced further.
Provided it's genuinely opt-in, even a stack trace can often be enough to help identify the place where some work is needed. It's better than nothing.
Also, if you license the patented high-grade intellectual property from Microsoft you're allowed to make a hash of the stack trace and use that to determine if the crash has already been reported (which can even save on bandwidth) and even immediately offer work-arounds and pointers to updates. But Microsoft invented and owns that idea, so don't do it.
Still, I am glad to learn one thing: I used to believe that users didn't read installer dialog boxes, but now it turns out that as long as you ask the question once when the application is installed you can be certain that 100% of your users will remember their choice and never be bothered about your gathering of allegedly anonymous information.
For all the Microsoft bashing (really just bashing patents) I'ld be just as happy to sign up for access to all the data they collect and be done with it. Sure, it really is opt-in but at least its there, implemented, and gets the job done. The problem is developers who don't roll their own solution (a valid option) and ignore the existing solution. They have the data available but never use it.
Bob on May 18, 2008 6:34 AMIt would be nice if every time your app crashed it was your fault.
For my field, sometimes people try to manually alter querystrings / url's. It's my fault if the application gave them a broken url, but it's their fault if they fat-fingered something. We still must provide a nice error page mind you.
There need to be different scenario's for different types of applications. ASP.Net web applications are particularly easy so there's no need to ask what they were doing, but older desktop apps, and apps that don't or can't provide a meaningful stack trace may need to ask what the user was doing when it crashed and/or instrument their code in a less transparent way. I really hate instrumenting an entire application with logging, and more often than not it's completely useless information.
Another issue is you must consider is who comprises the installed user base? Corporate intranet, small office app, customised commercial app, shrinkwrap, www?
Ha! I still get a kick out of those 'EOF and BOF are true' errors that still happen from time to time with really old crappy 'classic' asp websites that fail to check if a recordset contains any data.
EOF and BOF are true on May 18, 2008 8:12 AMThat's the great thing about Open Source software, it's ok if it crashes because hey i'm not getting paid for it. :)
I'm just kidding.
Bobby on May 18, 2008 8:15 AMJeff,
I have been reading your blog for about six months now and have listened to all of the StackOverflow podcasts.
I must say that the ELMAH tip has been by far the most valuable piece of information I have picked up from you.
Thank you for doing your thing,
Yonah
Yonah on May 18, 2008 8:23 AMThat might be easy in the nice and friendly world of managed languages, where every error is an exception, and the OS doesn't pull the rug out from under you when you dereference a null pointer. Not so much in languages like C++. Never the less, I am implementing an error reporting system for my software. On every platform it runs on.
(And it's not that easy to do!)
Owen S on May 18, 2008 10:25 AMRails has Exception Notifier, which emails you a bug report, request headers, and full stack trace. This has changed my applications dramatically. By the time a client informs me of a problem I usually have already patched and updated the application.
http://svn.rubyonrails.org/rails/plugins/exception_notification/README
I've moved away from emails to RSS subscriptions. My favorite for ASP.NET apps is definitely ELMAH
http://code.google.com/p/elmah/
Terrible name, but great implementation!
Jeff Atwood on May 18, 2008 10:39 AM"...the first thing I do on any new project is set up an error handling framework."
I guess it depends on what you mean by "new project" but this set off red flags for me. Ideally this is a once per company task. Well, ideally, there would be something simple enough in the .Net framework. Last I checked the everything-to-everybody Exception Application Block just had too much going on.
MattH on May 18, 2008 10:55 AMHow do you handle privacy concerns when automagiclly sending error reports? How do you handle apps like ZoneAlarm which not only block such communications, but also pop-up a blaring siren accusing your app of being bad?
Etcetera.
Kevin
Kevin on May 18, 2008 11:01 AMKevin, I suppose you do what MSFT did in the screenshot above -- you ask permission from the user and explain what it's for.
You can actually sign up with Microsoft to get automated error reports back from them, too:
http://msdn.microsoft.com/en-us/isv/bb190483.aspx
All you need is a $400 cert.
Jeff Atwood on May 18, 2008 11:05 AMOur application crashes to pretty bad error screens, but is designed so that in almost all cases (sometimes heap corruption kills the .NET framework !?!) the application will be able to save its data in case of a catastrophic error.
Joshua on May 18, 2008 11:13 AMthanks for sharing about ELMAH
never knew there was a framework entirely for exceptions before
do you keep a tools list also? like your recommended reading list but instead for the tools you use? would really appreciate that
love your blog ^^"
chakrit on May 18, 2008 11:21 AMJeff, I'm wondering if you've seen DamnIT (http://damnit.jupiterit.com). It automatically warns you if your page errors in IE or FF. It also groups most common errors.
Justin Meyer on May 18, 2008 12:16 PMcrash early
When you hit a unrecoverable error, the only suggestion i have found useful is to "crash early". The more you delay and try to "fix" the thing the more havoc you cause.
~TH
TH on May 18, 2008 12:18 PMPersonally I prefer to blame my users whenever possible, of course it helps that our employees are forced to use my application! ;)
Telos on May 18, 2008 12:19 PMOpera has a pretty good disaster recovery tool, it let's the browser save the current tabs/windows open and if Opera crashes when you boot it up again it asks if you want to load those tabs again.
Hoffmann on May 18, 2008 12:22 PMhoffmann both internet explorer and firefox also do this
pete on May 18, 2008 12:31 PMSame goes for firefox. And with firefox beta 5, It is used more then I like. A lot more!
JW on May 18, 2008 12:35 PMJust tried out ELMAH it seems to work great although I haven't finished experimenting fully yet, I don't know if it may catch things I don't consider an error such part of my site requires a user to be in certain active directory groups on our network this is technically an error but I don't need to know about it. It currently redirects to an access denied page.
pete on May 18, 2008 12:35 PM@Owen: how exactly does the OS pull the rug on you just because you're using c++? Set up some seh and you handle the critical errors yourself, which gives you the freedom to save any and all data you want and display whatever crash-screen to your users. That shouldn't be much of a problem in c++ unless I'm missing something.
Coming from asm, though, some advice: TEST the values you use for pointers. If you blindly accept whatever value you get back from system calls, you're in for a world of pain.
Regards
Fake
I think Visual Studio has the best variety of crash symptoms - my favourite is when it just vanishes usually when you get to the critical point in a debug session. Others I experience daily include the indefinite hang, indefinite hang followed by losing all your work (with the 'oops something went wrong dialog') - usually occurs when switching from Debug to Release or vv., hang followed by crash followed by corrupted csproj files, OOM in intellisense, the list just goes on and on...!
Paul on May 18, 2008 1:50 PM@Jherico, while pointer de-referencing is a little off-topic, it's a little arrogant to say that well, you should only need to ever check on creation and your code should know where it's at.
Inside a class and it's private methods, you shouldn't need to continually check. However it's the public entry-point to any object that is a source of errors.
No matter how smart you are, if you are working in a team, your code will be reused in some way you didn't anticipate (another developer perhaps). I typically look at it from the argument checking point of view, where if the code below assumes the pointer (or reference in my case) isn't null, and I get a null pointer, the code's going to blow up anyways. I'm making the code more explicit and making sure that somebody doesn't think it's my code that's bad. Null isn't an error, it's an object state. In a way, NullableT in .net is proof that null is a first class citizen and needs to be explicitly accounted for.
in ...
void foo( MyStack stack )
{
//if( foo == null ) { throw new ArgumentNullException("stack", "stack passed can not be null");}
stack.Push(100);
}
what's a better message
"Object reference not set to an instance of an object"
or
"System.ArgumentNullException: stack passed can not be null"
It really goes back to the concept of Fail Early and Fail Often. This way, in testing and coding, you catch more errors because you're writing the contract into the class explicitly.
PS I'll also make a case for the Null Object Pattern here, because it's incredibly useful when applied correctly and cuts down on a ridiculous number of null checks.
Nullable on May 19, 2008 2:31 AM@Omar Abid: do you see any spam here?
Yes, the captcha is always "orange". That's by design. It's enough to stop spammers.
Nicolas on May 19, 2008 3:19 AM@Owen, Steve:
To catch unhandled SEH errors in Windows, call SetUnhandledExceptionFilter. To catch unhandled errors in .NET, including Windows Forms, add a handler for the AppDomain.UnhandledException event. Both setups will catch all unhandled exceptions on any thread in the process.
The .NET handler works in all versions of the full .NET Framework, and in version 2.0 and later of the .NET Compact Framework. Generally, use:
AppDomain.CurrentDomain.UnhandledException +=
new UnhandledExceptionEventHandler( methodName );
The current philosophy on software performance and stability is pretty much like this: If you have fewer than two cores or less than two gigs of ram, you shouldn't even be complaining about poor performance or crashes. Go get yourself a new computer.
The rest of us just get left behind.
WurdBendur on May 19, 2008 3:58 AM@Tom: "Well, let me see, what could you do to solve that problem...
Hmmm....
Every time you dereference a pointer you should check if it's null. Every Time You Dereference A Pointer You Should Check If It's Null. EVERY TIME YOU DEREFERENCE A POINTER YOU SHOULD CHECK IF IT'S NULL."
While that might be good advice, it doesn't really help with the question about how to implement the error handling for a NULL pointer dereference, not to mention that not all bad pointers are NULL pointers. Your advice is pretty much like saying, "handle your errors by not having them in the first place". Which is great, but it doesn't help handle the defect that does slip through.
mikeb on May 19, 2008 4:10 AMCrashes are not everything I hate in MS products. There are many small annoyances that are really small but very annoying.
Consumer on May 19, 2008 4:57 AMMeh, I just write code that doesn't crash.
Bill on May 19, 2008 5:48 AMJheriko: "The first comment is gold considering that C++ has exceptions."
Which doesn't help you catch null pointer dereferences since they don't throw C++ exceptions. In fact, they produce undefined behavior. On the DeathStation 9000, the OS catches them and initiates automatic shredding of the user's hard drives.
To deal with null pointer dereferences requires platform-specific code: SEH on Windows, catching SIGSEG on POSIX, and who knows what else on Weird Embedded System #302.
Evan on May 19, 2008 6:36 AM"Every time you dereference a pointer you should check if it's null. Every Time You Dereference A Pointer You Should Check If It's Null. EVERY TIME YOU DEREFERENCE A POINTER YOU SHOULD CHECK IF IT'S NULL."
There are so many problems with this statement.
So, suppose the pointer is null, and it's not supposed to be, what do you do? The correct strategy is to dereference the pointer and let the system throw an access violation exception, which you then handle appropriately. The page fault (and other system-wide) exceptions are there for a reason, use them.
Second, checking for a null pointer does not catch bad pointers that are not null. What's the point of expending all the energy on something that can't catch a simple uninitialized pointer?
Third, as someone has pointed out, this strategy is simply dumb and inefficient. Use a walled-garden approach instead - define unsafe and safe interfaces. The unsafe interfaces can handle bad data, including bad pointers, the safe interfaces handle only good data.
Max on May 19, 2008 7:45 AMThanks for that info Mike, I didn't know it.
To catch a crash on Unix, just set a signal handler for signals like SIGSEG, SIGILL, SIGFPE, SIGABRT, SIGBUS. Lots of Linux systems also happen to have gdb (the debugger) installed, so you can also try running gdb -batch -x script where script is a file containing the gdb commands you want to run to get a back trace or whatever then quit. ("set pagination off" and "set width 0" are also helpful there).
Reed on May 19, 2008 7:52 AMOops,
'gdb -batch -x SCRIPT' where SCRIPT is a file containing the commands.
Reed on May 19, 2008 7:53 AMo Let users know that it's our fault, not theirs.
This is a really good point but soooooo many software shops are too arrogant to heed it. For instance with Visual Studio theres a weird bug in the debugger where it will only report back 'Expression cannot be evaluated at this time' whenever you try to evaluate any variable in the immediate mode window. I searched around for a fix for this VS bug and all I found was this page: http://msdn.microsoft.com/en-us/library/f221hs8y(VS.80).aspx
The only advice was to change the syntax of the expression. Oh boi so how was it my fault that the debugger couldn't evaluate a
Records.Length() statement. This was a HUGE headache on my last project!!
Great post, Jeff! It really motivates me to create an error handling package that my entire team can use. I've been very, very lax in this in the past.
Thanks for improving my work, again.
Stew on May 19, 2008 8:18 AMOops meant Record.Count(mixed up with Java) but it was definetely a problem though.
o.s. on May 19, 2008 8:20 AMI agree with O.S. there are so many software shops that either don't care or are too arrogant to admit it. It is a great idea. The last paragraph of your post was the most important I think. Prioritizing the problems is where we'll be able to nip the problems in the butt before they come back to roost.
Nate Nead on May 19, 2008 8:26 AMAn important consideration is how do you handle include files that don't exist? Does your application crash responsibly?
If you develop for the web, you probably notice that you upload files often. No matter what method you use for uploading, there will be brief periods where files are not available. Does your website disappear, or explode when somebody visits a page while the include files don't exist?
It's something most of us don't think about, as we don't visit pages while we are uploading changes. But if you make many edits/day, realize that you might be alienating users who happen to visit while you are making changes.
Jeff Davis on May 19, 2008 8:53 AMactually, a call to SetUnhandledExceptionFilter will catch unhandled .net exceptions... after all, what do you think .net exceptions are implemented with?
Also, in your exception handler, put a call to "MiniDumpWriteDump" (its in the dbghelp.dll - google it) and you can write a minidump to send back to the developers. Load this up in windbg (or Visual studio) and you'll get a stack trace, registers, parameters and more. There should be more than enough info there to debug the crash and make sure it never happens again.
This is getting so off-topic w.r.t pointers.
Sure @Max, but equally, you can't guarantee that all code inside the walled garden always "Does the right thing"?
How do you define safe interfaces? Is that a new construct?
Safest == private
Marginally Safer = internal
Unsafe == protected
Unsafe = public
* note that I don't say any are safe, but a private method should be the safest
@mikeb - no, not all null pointers are bad, but you still can't de-reference a null pointer.
- null values passed to methods can be completely legitimate.
- what happens when null is passed is completely implementation specific.
However, if null is a valid state, a Null Object pattern can get rid of a lot of redundant code checks.
http://en.wikipedia.org/wiki/Null_Object_pattern
http://www.cs.oberlin.edu/~jwalker/nullObjPattern/
hmm... a lot of talk about .net exceptions.
The first comment is gold considering that C++ has exceptions.
As for:
"Every time you dereference a pointer you should check if it's null. Every Time You Dereference A Pointer You Should Check If It's Null. EVERY TIME YOU DEREFERENCE A POINTER YOU SHOULD CHECK IF IT'S NULL."
Erm... NO NO NO. This attitude may make your code work, but it not very intelligent, efficient, or good for debugging.
Everytime you dereference a pointer it should have a valid value. Constantly checking to see if its null is mindless... you shouldn't be doing things to break pointers once they are allocated and there are other invalid values for a pointer than just NULL. Check for null on allocation only... in this situation NULL almost always means "I couldn't allocate the memory. and, as this is a specific error code being returned, it is more than acceptable to check for it.
Its safe, I'll agree, but its terrible practice, especially in a language like C++ where you should know, or be able to work out, exactly what a pointer is pointing to based on where it is in code. Besides that 9 times out of 10 its a bad number of loops or faulty pointer arithmetic which causes the exception to be thrown, which will not make the pointer NULL.
But yeah, exceptions, use them! Correct exception handling code will handle almost everything. (I'm sure there is something out there I don't know about yet!)
Jheriko on May 19, 2008 9:43 AMit describes a lot like my project
but JS doesn't see any issue there
Hi Jeff,
maybe you should apply this to your blog when users type long replies and then lose the whole thing if there is an error and we have to hit 'back'.
;-)
Dennis on May 19, 2008 10:42 AMThis comes from a web app angle, but I'm looking to implement this into some of my winforms apps soon...
I got hooked on log4net a long time ago. (http://logging.apache.org/log4net) It provides amazing flexibility in a small package. I essentially have all my code with logging details setup and then a general catch all log for anything that bubbles to the top.
From there I wrote a web app (http://www.codeplex.com/hacksaw) that allows me to view these log files. That way the end user can simply notify me if they are having issues that I haven't discovered yet and the log can give me the details. The logger is initially set to dump just the exception messages, but if I need to go in and get details on a particular method, I make one simple change to the config file (no recompile necessary) and I can get a full dump of the variables and such for each method.
Besides, how many users actually remember to take a screen shot of the exception message being displayed to them, let alone know what to tell you other than "my program isn't working" 8^D
Sean Patterson on May 19, 2008 10:47 AMMiniDumpWriteDump shouldn't really be called from the crashing process, much less from the crashing thread (think about "stack" and guard pages). Also, in the managed world, to be able to use sos on the dump, the dump is not really going to be "mini" any more.
In addition to the AppDomain.UnhandledException event, the System.Windows.Forms.Application.ThreadException should be subscribed to, to hook unhandled exceptions from the message loop.
There is no good Windows Error Reporting story for .NET. Also, WER/OCA is useful only for signed binaries. Everything else just goes down the big drain in Redmond. (Verisign signed, BTW. I can't believe where all in the world these guys keep their hands wide open.)
Crash dump and logfile privacy is a huge issue. Think "compliance". SOX or FDA, or whatnot. Also, all your clients may not have a persistent, cheap internet connection ready for you to send out e-mail or tons of data.
There are many good logging frameworks, beyond or in connection with EnterpriseLibrary, mostly adding value by great aggregation and analysis tools e.g. SmartInspect.
Lastly, not too many people know Microsoft's kernel-based, high-performance "Event Tracing for Windows" (ETW) for drivers and applications, although it's a hot runner-up for "Greatest Thing Since Sliced Bread" because it allows to decouple trace source and trace sink and is claimed to process a log entry in "only 1500-2000 cycles, depending on settings". So they are long finished when EntLib hasn't even yet figured out which LogSink factory to invoke.
There is a managed ETW wrapper in .NET 3.5, but only for Vista and it didn't get a lot of love from it's developers, it feels.
Hmm. I got off topic.
We have, for our Windows apps, a dialog that reads "We apologize for the inconvenience." and allows to save some error information in the user's local application data folder for support personnel (all exception infos, dump, screenshot). It's the least we can do.
But in the end, users will dismiss the dialog as fast as anything else they deem usual software noise.
Henry Boehlert on May 19, 2008 11:35 AMOne can find more info about Windows Error Reporting here: https://winqual.microsoft.com/default.aspx
And I think that you can get a discounted cert for $100.
Vix on May 19, 2008 12:53 PMI didn't say anything about dereferencing a NULL pointer being OK - what I said is that "not all bad pointers are NULL pointers", meaning you can have a pointer that has non-NULL garbage in it. So even if you check for NULL pointers before each and every dereference you would still want the error handling infrastructure that Jeff is advocating.
However, my main point is that even if your policy is to check for bad pointers (NULL or otherwise) at every opportunity, that policy still does not help the user or your support process for the situation where you make a mistake in implementing that policy. Bugs happen - having an error handler in place will help to identify and mitigate the problems when your program crashes.
mikeb on May 19, 2008 12:54 PMGood post. I recommend sending error reports in XML!
This, however, is not a good recommendation:
"Leverage the 80/20 rule"
Joel Spolsky once said about the 80/20 feature rule that everyone needs a *different* 20% subset of feature which is why "light" editions rarely work, unless they're really not "light" at all like Photoshop Elements.
In the case of error reporting, 80% of your users might well encounter the same 20% of bugs but many will *also* encounter some other (perhaps minor) bugs -- just different ones for each user.
The 80/20 rule now becomes problematic if you take the Microsoft stance to *never* fix bugs unless they're widely reported, even when the fix is obviously trivial. That means a lot of your users will keep encountering a zoo of annoying little bugs, and the general impression becomes that your software is shoddy, even though it may not have major issues. Doesn't bother our monopolist, but should bother anyone who is not a monopolist. Those bugs in the lower 50% of frequency still contribute to the subjective impression of how polished your software is.
Also, Microsoft's attitude had the effect on me that I stopped reporting issues to Microsoft Connect which I had once done quite frequently -- back when they actually fixed them. Now I know that anything I report won't get fixed unless many others report the same thing, so why should *I* bother? Of course from Microsoft' perspective this may look like their software is miraculously bug-free now, assuming others no longer bother reporting issues either...
Chris Nahr on May 19, 2008 1:06 PM@bignose:
"The type of errors we're talking about are because "a previously unknown bug in your code causes the application to crash and burn in spectacular fashion". So there's no way we can trust the application to write a neat, comprehensive error log when that happens: we can't trust it to be sane at all."
Here's what I don't get: you have an exception handler in place. It gets triggered. You check the exception and it's certainly not something you can recover from. What is the problem in the following scenario? You a) check that your exception handler code hasn't been modified (a quick checksum will do, just to make sure there were no mem overwrites), b) try to create an output for logging and if that works, c) write out the details of the exception, then you d) check the validity of the memory of the stack and if valid write that out, e) check the validity of your data pointers and if valid f) start validating data and if valid g) try to save it down. Then h) quit/restart.
Where does the assumption that you cannot trust anything at all come from? The fact that your exception handler executes means that there is at least one thing you can trust - so start working from that.
Regards
Fake
Given this (as well as the referenced "what's worse than crashing" entry, what are your thoughts on this post from an MS blogger?
http://blogs.msdn.com/eric_brechner/archive/2008/05/01/crash-dummies-resilience.aspx
Racecar Bob on May 19, 2008 1:10 PMTH, I talked about "Fail Fast" here:
http://www.codinghorror.com/blog/archives/000924.html
Jeff Atwood on May 19, 2008 1:38 PM@mikeb - my apologies. I hadn't grokked the extent of your comment. As you say, checking pointers is just one small part of a good error handling strategy.
StillNull on May 20, 2008 2:45 AM@Liron Levi: Please exclude personal data from your logging and encrypt the files reliably before they go over the wire into your inbox and delete them as soon as possible. Thank you for handling data responsibly.
Henry Boehlert on May 20, 2008 4:47 AMFrom my experience, I'd have to say that rule #4 worked the best for us (not that the other rules are meaningless...they definately aren't). I hate it when a bug, small or not, reaches the customers. However, it seems like nine times out of ten that fixing that one bug resolves all of the customer's problems.
Also, that first picture is awesome! I love Fight Club!!!
Mike on May 20, 2008 5:26 AM"Which doesn't help you catch null pointer dereferences since they don't throw C++ exceptions."
Good point. This is what I get for being a smarty pants.
Still... nobody should be struggling with null pointer dereferences anyway... :)
I am programmer, most of the time I make sure that I log the errors and actions performed by my program into a text file. After reading this article, I remembered about few applications that I use actually urge to send a dump file created to an e-mail address mentioned. Few times I did send, but I was not sure if the company did something with the dump file, but sent it anyway. Itβs a mixed response actually. In some application it did help, but in some the error still existed, may be they fixed it or not I am not sure. One such application which I don't want to mention, it used to generate this error report and ask the user to fill the details and submit it, but the whole process so slow. Actually come to think of it Dr.Watson's error log kind of makes sense to read, it is not easy to interpret, but at least makes some sense. Any way I am thinking of incorporating the process automatically sending the email to a particular address in case my application crashes.
Anand.V.V.N on May 21, 2008 2:20 AMThere is a nice phrase about this in The_Hitchhiker's_Guide_to_the_Galaxy - Mostly Harmless :
(It was, of course, as a result of the Great Ventilation and
Telephone Riots of SrDt 3454, that all mechanical or electri-
cal or quantum-mechanical or hydraulic or even wind, steam
or piston-driven devices, are now requited to have a certain
legend emblazoned on them somewhere. It doesn't matter how
small the object is, the designers of the object have got to find
a way of squeezing the legend in somewhere, because it is their
attention which is being drawn to it rather than necessarily that
of the user's.
The legend is this:
"The major difference between a thing that might go wrong
and a thing that cannot possibly go wrong is that when a thing
that cannot possibly go wrong goes wrong it usually turns out to
be impossible to get at or repair.")
Awesome images! Which airlines/aircraft did these come from?
Daniel Serodio on May 21, 2008 6:45 AM@Daniel Serodio: The airline image on the right comes from the movie Fight Club.
Mike on May 21, 2008 9:00 AMIn case of SEH you can't trust your application to log crash, because client memory space can be destroyed, heap can be destroyed, I/O libraries can be in bad state, etc.
And, in general, we should not send any data without user's approval.
vtolkov on May 21, 2008 1:17 PM@Chris
"The 80/20 rule now becomes problematic if you take the Microsoft stance to *never* fix bugs unless they're widely reported, even when the fix is obviously trivial. That means a lot of your users will keep encountering a zoo of annoying little bugs, and the general impression becomes that your software is shoddy, even though it may not have major issues. Doesn't bother our monopolist, but should bother anyone who is not a monopolist. Those bugs in the lower 50% of frequency still contribute to the subjective impression of how polished your software is.
Also, Microsoft's attitude had the effect on me that I stopped reporting issues to Microsoft Connect which I had once done quite frequently -- back when they actually fixed them. Now I know that anything I report won't get fixed unless many others report the same thing, so why should *I* bother? Of course from Microsoft' perspective this may look like their software is miraculously bug-free now, assuming others no longer bother reporting issues either...
"
Myself, and others in my company have all had the exact opposite experience. I submitted a couple of bugs for Visual Studio that literally no one else had ever reported. One was fixed for VS2008, and another was marked no-repro. However, it was reopened soon after and slated for the next version. Throughout the whole process Microsoft was very attentive and helpful in researching the issues.
I don't think I did anything special. I submitted every piece of information I could possibly gather, all stuff that I would ask for if I was fixing a bug. State information, repro steps, machine information, etc. Sometimes as developers we might forget that sort of thing when submitting problems with someone elses software, and just fall back to "it's broken".
The same is true of others in the company. They've all gotten bugs looked at and fixed. We've even gotten tweaks, not bugs mind you but tweaks, considered and approved at during testing of an Release Candidate. We've also had one thing get escalated up to Scott Guthrie. And he actually said that nobody had ever had that issue, but he was still willing to work with us to get it addressed.
Maybe lots of people do have issues getting their concerns addressed with Microsoft, but I've never had anything but success and good experiences.
kettch on May 22, 2008 2:17 AMToo bad Microsoft never learned this lesson:
http://dotmad.blogspot.com/2008/05/throw-exceptions-responsibly.html
What about this kind of bug:
http://folklore.org/StoryView.py?project=Macintoshstory=Disk_Swappers_Elbow.txt
Part of "crashing responsibly" is not actively punishing your users. The size and complexity of PC applications means that it isn't directly comparable, but the way the iPhone handles problems is brilliant.
In short: as a user you're never really informed of a fatal error. This kind of sounds bad (especially as a developer) but it works well because you rarely lose any data and re-launching the application is almost instant.
More here: http://www.zx81.org.uk/computing/opinion/error-mishandling.html
hahahahahahahahahahah whats with that picture. Nobody does this
maybe I'm not abusing it enough, but I think I've only seen VS 2005 crash once.
I don't use FF3b5 anywhere near as much as Opera 9.5b2 on my Mac, but Opera is a _lot_ more crash-prone for me. However, I still can't live without searching from the address bar (CMD+T, "g search terms"; I've also set up i for google image search, yt for youtube, etc), or paste and go. Paste and go is probably Opera's killer feature at this point.
On privacy: if you ask the user if they want to send a crash report they will probably drag the window as far off the screen as possible and keep working. However, I make a point of sending crash reports! Except if it's my own software.
John Ferguson on February 6, 2010 10:25 PMI'm developing a medical call center application (client/server).
I have the following rules when writing the software:
1. I place detailed error logging code everywhere in the code where the code should not execute (but actually get executed due to a bug). I use try/catch clauses a lot (most of the time only for logging purposes). I use other severity levels as well, but the error logging is important because:
2. Whenever an error log is created - my log4net appender will automatically prepare a zipped version of the log file and send it to a special email account. I can do this because I have a responsibility to make sure that the software runs 24x7 with minimum downtime. My customers even appreciate this because many times I know about problems before they do and am able to prepare quick fixes.
This simple scheme allows me to get near real-time status for all my installations. When a problem do occur - I have a detailed log file with enough information (stack traces are invaluable in this respect) to fix the problem.
In my opinion logging is vastly underrated in our industry. After 14 years of active development I came to the conclusion that no amount of logging is too much. Now I know the arguments againts logging (slows the software, takes too much disk space, useless information etc).
This is rubbish, pure rubbish. Today, most of my time is spent on actually solving a bug and only a fraction is spent in an attempt to decipher what caused it in the first place. This would not be possible without the level of details I get from my logs.
Now I'm not saying this scheme is perfect. I have tons of ideas how to improve it even better but I have 80% of the problem solved this way.
Hope this helps
Liron
@Henry Boehlert: my logs don't contain personal information. Encrypting the logs may be part of a more comprehensive system I'm thinking of (as I've alluded in the previous post).
One of the problems of the current log system I'm using (log4net+special customizations) is that it is not suited for off-line tool analysis. For example - I want to use the logs as a means for collecting usability information like window usage statistics, number of steps needed to accomplish a specific task etc. Many times I want to log the state of complete objects. These things are difficult to do with log4net.
What I'm thinking of is a complete DB based log (something simple like SQLite) with support for different types of logging events (e.g., special event records for user actions, another for performance counters etc). This will come with a generic log viewer that is able to display all events on a time line, complete with serialized logged objects, special logging event types etc.
Only problem is I have more urgent priorities at the moment :-(
Liron
Liron Levi on February 6, 2010 10:25 PMI've actually seen the Windows error dialog come up for a .NET application with a global exception handler already installed. I can only surmise that it was out of memory or was in some weird hardware state that actually fried the .NET framework. The thing is, the application would start, but every time it actually tried to access any data or web services it would crash.
Not surprisingly, all it took was a reboot to solve it. I'm not really sure if it would even have been possible to do anything proactive, given that the crash was obviously happening deep within the bowels of the framework and therefore totally out of my control.
I agree with the approach, more or less, but sometimes technology just fails. You're likening the situation to an airplane crash, but it's a false analogy because you'll know about that kind of "crash" well in advance. Sometimes with software, by the time you know that there -is- a problem, it's already too late to do any damage control. Not always, but sometimes.
Aaron G on February 6, 2010 10:25 PMThe comments to this entry are closed.
|
|
Traffic Stats |