If you're waiting around for users to tell you about problems with your website or application, you're only seeing a tiny fraction of all the problems that are actually occurring. The proverbial tip of the iceberg.
Also, if this is the case, I'm sorry to be the one to have to tell you this, but you kind of suck at your job -- which is to know more about your application's health than your users do. When a user informs me about a bona fide error they've experienced with my software, I am deeply embarrassed. And more than a little ashamed. I have failed to see and address the issue before they got around to telling me. I have neglected to crash responsibly.
The first thing any responsibly run software project should build is an exception and error reporting facility. Ned Batchelder likens this to putting an oxygen mask on yourself before you put one on your child:
When a problem occurs in your application, always check first that the error was handled appropriately. If it wasn't, always fix the handling code first. There are a few reasons for insisting on this order of work:
- With the original error in place, you have a perfect test case for the bug in your error handling code. Once you fix the original problem, how will you test the error handling? Remember, one of the reasons there was a bug there in the first place is that it is hard to test it.
- Once the original problem is fixed, the urgency for fixing the error handling code is gone. You can say you'll get to it, but what's the rush? You'll be like the guy with the leaky roof. When it's raining, he can't fix it because it's raining out, and when it isn't raining, there's no leak!
You need to have a central place that all your errors are aggregated, a place that all the developers on your team know intimately and visit every day. On Stack Overflow, we use a custom fork of ELMAH.
We monitor these exception logs daily; sometimes hourly. Our exception logs are a de-facto to do list for our team. And for good reason. Microsoft has collected similar sorts of failure logs for years, both for themselves and other software vendors, under the banner of their Windows Error Reporting service. The resulting data is compelling:
When an end user experiences a crash, they are shown a dialog box which asks them if they want to send an error report. If they choose to send the report, WER collects information on both the application and the module involved in the crash, and sends it over a secure server to Microsoft.The mapped vendor of a bucket can then access the data for their products, analyze it to locate the source of the problem, and provide solutions both through the end user error dialog boxes and by providing updated files on Windows Update.
Broad-based trend analysis of error reporting data shows that 80% of customer issues can be solved by fixing 20% of the top-reported bugs. Even addressing 1% of the top bugs would address 50% of the customer issues. The same analysis results are generally true on a company-by-company basis too.
Although I remain a fan of test driven development, the speculative nature of the time investment is one problem I've always had with it. If you fix a bug that no actual user will ever encounter, what have you actually fixed? While there are many other valid reasons to practice TDD, as a pure bug fixing mechanism it's always seemed far too much like premature optimization for my tastes. I'd much rather spend my time fixing bugs that are problems in practice rather than theory.
You can certainly do both. But given a limited pool of developer time, I'd prefer to allocate it toward fixing problems real users are having with my software based on cold, hard data. That's what I call Exception-Driven Development. Ship your software, get as many users in front of it as possible, and intently study the error logs they generate. Use those exception logs to hone in on and focus on the problem areas of your code. Rearchitect and refactor your code so the top 3 errors can't happen any more. Iterate rapidly, deploy, and repeat the proces. This data-driven feedback loop is so powerful you'll have (at least from the users' perspective) a rock stable app in a handful of iterations.
Exception logs are possibly the most powerful form of feedback your customers can give you. It's feedback based on shipping software that you don't have to ask or cajole users to give you. Nor do you have to interpret your users' weird, semi-coherent ramblings about what the problems are. The actual problems, with stack traces and dumps, are collected for you, automatically and silently. Exception logs are the ultimate in customer feedback.
Am I advocating shipping buggy code? Incomplete code? Bad code? Of course not. I'm saying that the sooner you can get your code out of your editor and in front of real users, the more data you'll have to improve your software. Exception logs are a big part of that; so is usage data. And you should talk to your users, too. If you can bear to.
Your software will ship with bugs anyway. Everyone's software does. Real software crashes. Real software loses data. Real software is hard to learn, and hard to use. The question isn't how many bugs you will ship with, but how fast can you fix those bugs? If your team has been practicing exception-driven development all along, the answer is -- why, we can improve our software in no time at all! Just watch us make it better!
And that is sweet, sweet music to every user's ears.
This post feels like the old cliche that I found a hammer and now everything looks like a nail.
First, I echo the comments here about the apparent misunderstanding or apparent misuse of TDD. TDD is *not* premature bug fixing.
Second, not all bugs can be caught. There are entire categories of bugs that don't cause an application to fail, but still hinder or even preclude a user from successfully executing a task.
There is ALWAYS pressure to get your application out. And besides the economic pressures for getting into production, there are soft reasons. As creative beings, we get a huge amount of satisfaction seeing creations running being used (like proud parents or starving artists). Also, some feel like they are serving the better good - providing solutions for mundane and/or repetitive tasks or for highly critical processes or whatever.
There is a spectrum of tools/approaches to dealing with errors in our software. We need to appreciate them all - from system inception to system maintenance - and embrace a balanced approach using them each in an effective and timely manner.
Bill Berger on April 17, 2009 2:12 AMThere are different types of errors.
Would I wait for a user to find out that a button doesn't do anything? Maybe.
Would I wait for the user to find out that when they delete an order they are actually deleting another customer's orders, dropping 3 tables, and removing the primary key? Hardly.
Steve on April 17, 2009 2:51 AMI think Iain Holder and Matt Lentzner are absolutely right. In no way does EDD replace TDD, but I do buy Jeff's argument that EDD is uniquely valuable. So use both. TDD requires an up-front time investment, but saves user frustration, improves overall design, and minimizes maintenance and refactoring. EDD requires (practically) no up-front time and provides valuable feedback on fixing bugs that do slip through to production.
So I do buy what I think Jeff intended to say. You can write tests until you're blue in the face, but you'll never guarantee that 100% of the hours are valuably spent and you can't guarantee you're going to stop 100% of the bugs. In a time-constrained, rapid development environment, this will feel to management like wasting time. Thus EDD provides the next layer of defense: you've tested all you can with the hours you have, now it's time to push the bird out of the nest.
The caveat is that this doesn't work for all development cycles. If you're not going to release the next patch for three months, relying on EDD is going to murder you. But in web development, there's no good excuse to wait that long. Be rapid or GTFO.
CynicalTyler on April 17, 2009 3:38 AMWhile there are many other valid reasons to practice TDD, as a pure bug fixing mechanism it's always seemed far too much like premature optimization for my tastes
I wish people would stop trying to use Knuth's saying in completly meaningless ways - testing or fixing bugs is not optimization, premature or otherwise.
obviously you haven't seen my code, my software has no bugs J
Eber Irigoyen on April 17, 2009 4:26 AMI actually email myself every error that happens on my website www.postjobfree.com
That way I know almost immediately if there is any problem.
First! I'm so proud.
Michael on April 17, 2009 5:57 AMIsn't WER just another way of waiting around for users to tell you about problems with your website or application?
Isn't shipping software as fast as possible just another way of fail[ing] to see and address the issue before they [get] around to telling me?
Daniel Straight on April 17, 2009 5:58 AMWOW, great post...
I didn't know about ELMAH and build my own system... doh!
(which actually has some more features than ELMAH like:
- creating tickets automagically
- allow users to add information and get notified about a follow up)
, but is lacking the rss feeds. I am of to adding it :)
@Daniel Straight:
There are ALWAYS (and I stress always) behaviours (I am not saying bugs here) users will find that you didn't expect or anticipate. You will only find them when they are used by users that aren't involved in the development/design/analysis of the product.
Isn't WER just another way of waiting around for users to tell you about problems with your website or application?
The loop is closed much faster and it's completely automatic. All the user has to do is .. use the software. No emailing, no calling, no writing.
Isn't shipping software as fast as possible just another way of fail[ing] to see and address the issue before they [get] around to telling me?
I get automatic Firefox (+plugins) updates all the time for issues I haven't run into yet, personally, as a user. This method of shipping software as fast as possible should be seamless and automatic, and it is .. in that case!
(although I did run into a major bug with FF 3.0.0 that drove me nuts. That one, they didn't anticipate well.)
Jeff Atwood on April 17, 2009 6:13 AMWe collect WER dumps for our applications, and our experience is that they're extremely helpful. Something like two thirds of the time they contain enough information to identify and correct the original fault.
And, to back up your assertion, many (perhaps most) of these reports do not tie up with anything we've had formally reported. The application has crashed, but the user hasn't submitted a fault report to us (other than via WER).
I just wish there was an Would you like to include your contact information with this report in case the vendor needs to ask for additional information checkbox.
John on April 17, 2009 6:14 AMAlthough I remain a fan of test driven development, the speculative nature of the time investment is one problem I've always had with it. If you fix a bug that no actual user will ever encounter, what have you actually fixed?
But that isn't what TDD is for. It is not a pre-emptive user defect fixer. TDD as a mechanism provides two major things:
1. Better designed/implemented software.
2. A set of tests to help you refactor without introducing errors.
Stephen Walther's covers this and more here:
http://stephenwalther.com/blog/archive/2009/04/11/tdd-tests-are-not-unit-tests.aspx
What is the parallel to this when working with embedded consumer electronic devices with no internet connection? Currently the only method available to us is to spend a lot of money to have people handle service calls about errors they experience and then if service finds them of a high enough importance they let us know and we get some new firmware upgrades out there.
I'd love to know ahead of time what problems our users are experienceing and deploy fixes, especially in markets where upgrades for some consumer electronic devices can be done over-the-air) but there doesn't seem to be a good facility for this. So, for now we are stuck with a QA department to, hopefully, find all of the bugs. also, people are generally more forgiving with a small webpage error than they would be if their TV crashed.
TJ on April 17, 2009 6:19 AMAlso, if this is the case, I'm sorry to be the one to have to tell you this, but you kind of suck at your job
Well, in most job when you have real time limit and a lot of presure you have to cut sometime on good practice. You cannot always do what you should do. I do not think that I or We suck because we do not have the time to fix stuff that haven't got any complains. Maybe with StackOverFlow you got yourself the priority fixed, good for you. But in fact, in a lot of enterprise, developper do not decide what they should do.
Think about it before telling people that we suck in our job.
CalmDown on April 17, 2009 6:19 AMWow, ELMAH looks awesome. I'd like a Java port please thanks!
MattB on April 17, 2009 6:24 AMWe too at Lokad are using ELMAH. It's still a bit surprising how successful is this little piece of code compared to the whole Health Monitoring thing that is supposed to be built-in since .NET 2.0.
Joannes Vermorel on April 17, 2009 6:31 AMI think checking for user exceptions is a useful practice, but I'm dubious if it can be the main driver for development. Maybe you're only really talking about web applications, where you can monitor exceptions server side, and roll out fixes instantly.
Although I remain a fan of test driven development, the speculative nature of the time investment is one problem I've always had with it. If you fix a bug that no actual user will ever encounter, what have you actually fixed?
Your attitude to TDD still confuses me. Are you saying that you implement TDD, but ignore test failures unless you're sure a user will see a problem, or are you a fan of TDD, but don't use it on your own projects?
Finally, the phase of testing that's missing here is in-house testing. Does this mean you do not use any quality control for web applications, using your users as testers?
Steve W on April 17, 2009 6:42 AMI quite agree with the post, but I also think knowing more about your application than the user is somewhat an abstraction.
I mean, unless you write very close to metal applications, your application will have to go nicely along on a lot of components.
And if you have the luck of writing something quite successfull, it will have to rely on / live with / work with virtually countless pieces of softwares, written by who-knows, maintained by who-cares and known by no-one.
ELMAH looks very interesting. Is there an equivalent for those of us in the Java world?
Alex on April 17, 2009 6:43 AMI work in an office where I'm the only developer on a very large projects, so sometimes I push myself to get some results quickly and I ignore any exception handling and tell myself it's ok because I'll do it later. I'm involved in a project right now, an enormous project, with no system to really track exceptions and I'm having to go back and re-tool my code. I think I'm going to give ELMAH a chance.
Thanks for this post and from now on, that will be my first step when developing a new project.
Shady on April 17, 2009 6:45 AMThat's home in on. Why would you write hone? It doesn't even make sense...
Mr.'; Drop Database -- on April 17, 2009 6:51 AMAs usual I like your post. I did find one section that slightly tweaked me Although I remain a fan of test driven development...practice rather than theory
I think you misunderstand TDD. TDD != unit test everything, it is a development process that is test driven. I am not saying that you are wrong in not practicing TDD, but I would think you would want to read up on it before making proclamations on it. The best I could find this morning was http://www.developerfusion.com/article/9375/tdd-in-practice-dealing-with-hardtotest-areas/2/. I know there are better.
Mike Polen on April 17, 2009 6:53 AMAnd now, the obligatory WE DO THAT TOO Mac post.
Although Apple does collect crash reports, they don't allow devs to see them. However, the log's location is documented and tons of different logging solutions exist to retrieve them automatically and send 'em back even here on the Mac front -- from Smart Crash Reports to PLCrashReporter to homegrown solutions :)
millenomi on April 17, 2009 6:59 AMI started trapping errors on a whole host of websites at my last company years ago and it was an enlightening experience. Started with old ASP sites on shared servers and the then-new .NET sites. We found bugs and security issues that no user or client was reporting to us. Like Microsoft's research, the vast majority of the issues were from a very small set of bugs (all written by ex-employees, of course). Two interesting side effects:
1. I took a lot of satisfaction the day the RSS feeds of our errors went from dozens or hundreds in a day to one a week or so.
2. Patching the same issue across dozens of old web sites with similar code helped to focus us on being better about code sharing and deployment.
There are a lot of usability bugs that do not actually result in crashes. Those you cannot handle this way.
philibert on April 17, 2009 7:05 AMWhile there are many other valid reasons to practice TDD, as a pure bug fixing mechanism it's always seemed far too much like premature optimization for my tastes. I'd much rather spend my time fixing bugs that are problems in practice rather than theory.
I'm sure you still don't get TDD: Its not a bug fixing mechanism (apart from bequeathing a test suite you can use to check for reported bugs), its a design fixing mechanism.
I'd rather spend my time writing the next app than running round fixing bugs I'd rather not have put in in the first place.
btw How do you know you've fixed that top bug? Do you have tests for that? What about when you've 'fixed' 19% and you're looking at the next 1%, how do you know that the 19% are still fixed? And just how many bugs do you have?
quamrana on April 17, 2009 7:05 AMBanana tactic:
Bananas are green when you buy them at the store [at least often they are], they will get yellow once you take them home and wait a while
So the store is selling you an unfinished product, it is finished at the customers home. Sure, this works very well, but it is something frowned upon by most developers, customers, and companies. I'm not buying a car that is not really working just so I can report back to the car vendors what problems I have and they can fix them and one year later I really have the car I thought I bought one year ago.
Mecki on April 17, 2009 7:08 AM...to continue my thoughts. For those usability bugs without crashes, I like what Microsoft did with the Windows 7 Beta.
There was a easy to find link Send feedback where you fill-out a very simple form and you submit it. A bit more work to process these feedbacks, but at least you have a good chance someone that has a problem will use that if it is easy to find. Most of the time, users don't go out of their ways to find out how/who/where to submit problems/issues.
philibert on April 17, 2009 7:11 AMWow, this is a great article!
Interestingly, we just yesterday released our final beta version of a product that is similar ELMAH but more comprehensive and compatible with any .NET app (WinForm, WPF, Service, not just web apps).
Free download at www.GibraltarSoftware.com
Jay Cincotta on April 17, 2009 7:11 AMTDD? duh I read about TDD before and sure it makes your application rock solid because you build it from testing. But considering the time constraint, I'd rather skip writing test scripts until I got enough time do it. we don't ship the test scripts to the customer anyway. I mean, the customer doesn't care whether or not we wrote test scripts; they are only after of the final product. There's something really annoyed me in TDD. I read one article that suggests writing first the tests before the application code to the point of trying to call a method you never implemented yet. I think that’s stupid. The test will of course fail because the method doesn't exist yet. Also, most of the times, writing tests is more difficult than writing the application itself. What works best for me is a thorough review of the code before testing.
nobodynobodyghost on April 17, 2009 7:11 AMThat exception log works great when everything runs on your server that you have complete control over.
What do you do when your application is running on the customer's computer? How do you get the exception dump back to the engineers?
... my thoughts again on non-crashing bugs. Stackoverflow is a good example of a relatively simple way to submit problems/issues/comments. But still too complex for a lot of users. Personally it was a big turn-off to have to create an account with yet an other system to submit bugs/problems and I did not do it.
philibert on April 17, 2009 7:14 AMI actually take this to another level: every exception generated by any test or production server is emailed to every engineer on the development team immediately, with the stack trace and as much detail about the context of the exception as the error handling code can provide.
I can't wait until someone happens to look at a log file or the Event Viewer to know that something is broken. And constantly being annoyed by emails is a great incentive to fix bugs quickly.
Eric Z. Beard on April 17, 2009 7:20 AMI'm sure you still don't get TDD: Its not a bug fixing mechanism (apart from bequeathing a test suite you can use to check for reported bugs), its a design fixing mechanism.
I don't know why I should be the one to defend Jeff on this point, but he does say there are many other valid reasons to use TDD, just not as a bug fixing mechanism, so it seems he is agreeing with you.
Steve W on April 17, 2009 7:24 AM@Jeff Atwood
If you fix a bug that no actual user will ever encounter, what have you actually fixed?... I'd much rather spend my time fixing bugs that are problems in practice rather than theory.
Young man, what the hell are you thinking????
If you fix a bug in new code, then by definition, a user won't encounter it. Would you rather leave it in until it becomes a practical bug???
And if the bug is in code that no user will actually cause to be executed, then why exactly do you need that code?
Sloppy thinking, young bloke, very sloppy :-)
Remind me of one time we corrected a bug before a user reported it. When the user called us, it went like this :
- User : Hey, i got an error with blurry-term-used-to-describe error.
- Dev : Yeah, we know, you crashed at 9 this morning, it's fixed now (11, same day)
- User : Heee, how do you know that ?
- Dev : We installed a camera on the website so we can track your work.
- User : Oooohhh took 10 seconds to realize the joke.
And a lot of laughs from everyone, including the user... ;-)
Dan on April 17, 2009 7:38 AMDo you handle client-side (Javascript/CSS) errors as well? You could use some quick ajax calls to send the Javascript error log to the server along with browser and browser version.
Hoffmann on April 17, 2009 8:11 AMInteresting, thanks for the info.
GoldenMoonHotelCasino on April 17, 2009 8:20 AMI agree with this article completely. However, this is really only practical for certain types of systems. I wouldn't want to just get the space shuttle launch control software out there and wait for the exception logs. Ditto for any medical software.
JohnOpincar on April 17, 2009 8:22 AMThis post is very similar to what Joel Spolsky wrote on beta testing. There is one point he mentioned there that I think adds value to this blog post. He said that if you ship your beta software too early, you will see two negative results:
1. You will be deluged with more bug reports than you can deal with, and you will end up being forced to ignore most of them.
2. You will alienate your users because your product is so buggy and unusable.
This is why I think TDD is so valuable. As Iain mentioned, tests aid you in writing better designed code and enable you to refactor with confidence, but they also provide a sanity check for you to let you know that your code works in the basic cases. For the most part I don't think anyone writes tests that are extremely thorough; they mainly write them to test that code handles the common cases as well as a few potential edge cases. When you have done enough internal testing that you know your product works in the ideal cases, then it is time to let your users take a stab at using it.
Sean Reque on April 17, 2009 8:24 AM@jeff
I get automatic Firefox (+plugins) updates all the time for issues I haven't run into yet, personally, as a user. This method of shipping software as fast as possible should be seamless and automatic, and it is .. in that case!
This sort of thing only works when a user is connected to the internet. It's an assumption a lot of developers make, which can in many cases be wrong.
Thomas Winsnes on April 17, 2009 9:02 AMAt a previous location, we had a mature enterprise product where we took this approach to stabilizing the app. For a mere 30 users, we had 400+ distinct issues come in through FogBugz in the first 3 months (600+ came in and most had several occurrences). All we did was put an exception logging call in all the event handlers.
When I stopped supporting that app after about 2 years, we were down to ~0.8 issues per day following this same approach. We did have the benefit of a ClickOnce deployment so we could pump out several releases per day.
Austin on April 17, 2009 9:06 AMThat's assuming your bona fide error causes an exception in the software. What about, say, a captcha system that produces completely illegible squiggles? A search system that turns up useless results for the first 5 keywords I tried when looking for a specific item (and then I go to Google and get it, the first hit on my first search)?
StackOverflow.com, for example, suffers from both these problems and more. It's one of those programs/websites that doesn't throw an exception very often (maybe once every week or two for me), and yet many features are not in a state that any person would call working.
I think I prefer sites that are more usable, even if they throw an exception now and then. Of course we aim for both, but putting all of your eggs in the never throw an exception basket seems to lead to lack of manpower to deal with usability.
Ken on April 17, 2009 9:09 AMI don't believe that exception driven development requires the use of built-in exceptions.
Nathan on April 17, 2009 9:15 AMVery nice and interesting post, and nice interesting read about the WER. We usally sit with user and observe how they interact with the application. If there are any errors we fix it then and there, make the changes in code. This works for us becuase the tools/application that we develope are usually in-house and small size. Any bugs found are corrected. Out process is not systematic, we tried to but it didn't work becuase of various reasons. One reason being here the process is more result oriented than caring about if the standars are mantained or not, or if the desing is modular enough, the scope of futhur imporvement is not considered etc.Unfortunately these tools are just developed just in time for and also, these tools/applications are used few times and forgotten untill they need it again (probably never). These are more like one time use tools, or rarely used once in 3 months, or once in two years, so really there is no interest in putting that little extra effort, becuase any way we know it will be used for a while and fogotten all about. There are few applicaiton/tools that I have worked on and tested it well enough and put that extra effor and the desing was modular enough that it can handle and changes, so far so good.
Anand.V.V.N on April 17, 2009 9:17 AMThat's home in on. Why would you write hone? It doesn't even make sense...
People have been saying hone instead of home since like 1968. Stop living in the past.
J. Stoever on April 17, 2009 10:04 AMAmen. I also set up error reporting as the first thing on any web-deployed project. It makes a huge difference.
One quibble, though. Test-Driven Development (or whatever you want to call the different variations of the developers write code to exercise code) shouldn't be justified in terms of bug fixing or bug finding. It's about bug prevention. Just as surgeons don't sterilize instruments to cure existing infections, they do it to prevent new infections.
I can sympathize with this post, however there are two important issues worth mentioning.
First, unless you are extremely liberal with your use of exceptions (which most aren't for good reason), you will miss a potentially large class of bugs due to calculation or workflow bugs. Think about cancellation issues in financial calculations that look A-OK but really reflect a bug in the system. Good TDD is great at routing out these issues.
Second, users do funny things. I noticed on one project (that implemented a similar exception logging system) that users would give up on parts of the application that were consistently buggy or - worse yet - figure out complex workarounds. This will cause the number of exceptions caused by a given code path to drop over time and divert the developers attention to more frequent exceptions.
J. Scott Miller on April 17, 2009 11:14 AMI agree partially, it's true what you say but a balance must be found to not ship too-buggy software that scares away your clients or possible future client (in demo versions, for example).
In resume, shipping too late is bad and shipping too early is also bad.
Unknown Programmer on April 17, 2009 11:14 AMWhen a user informs me about a bona fide error they've experienced with my software, I am deeply embarrassed. And more than a little ashamed
Actually, I think this is possibly the most powerful statement in the whole article. Regardless of technical procedures used, if more people actually felt this way, they would probably do a better job. I work with a lot of people who have no pride in their work, and fixing bugs is just something they take in stride. They feel nothing about the existence of the bug in the first place, and do nothing to try to eliminate them unless someone complains.
Jasmine on April 17, 2009 11:17 AM100% agree that this is very important - I also try to fix exception handling first - but crashes!=errors.
Perhaps a feature is too broken to work at all, or your parser rejects uncommon but valid input, or your application got slower, or its screens no longer match the help pictures - none of these generate exceptions.
Don't only log exceptions, log all types of errors and quite a lot of other information too, and regularly spend a few minutes sitting with users.
Pete Austin on April 17, 2009 11:48 AMI hate to pile on, but Jeff, you counter-TDD example is bogus.
If you find a bug with your test that a user would never encounter then why does that code even exist?
To me, TDD enforces the contract that the production code has with its clients. There's a school of thought that says the test code should be written before the production code. You're done when the production code satisfies the test.
Matt Lentzner on April 17, 2009 11:50 AMI do this a lot with Paint.NET. Most of my 0.01 minor updates are pushed out solely to fix bugs that were identified from the stream of crash logs that were e-mailed to me. Occasionally I'm able to implement a conservative fix for a bug that I can't even repro myself. For example, I compile without Check for underflow/overflow enabled for major performance reasons. Some people have software that installs mouse or keyboard or other hooks, and these start executing in my process and flip the x87 floating point exception handling bits and then don't un-flip them. This causes my code to throw overflow exceptions when it shouldn't be (or rather, it happens in code that I've written and tested and determined that even if it does overflow it's ok -- certain types of pixel manipulation code, for instance). So, based on exception logs that identify a few hots spots, I can put in a try {...} catch (OverflowException) { provide default value of zero or whatever }. It's then reasonably proven that the crash is fixed because I stop getting logs with that callstack and exception type.
Rick Brewster on April 17, 2009 11:53 AMWrong wrong wrong!
EDD rebuttal: 1. if u have no beta users, if your SW is 2 buggy no one would keep using 2. EDD encourages behavior such as Run-It-And-See-If-It-Crashes...
Isaac on April 17, 2009 12:24 PM@Jeff Atwood
I enjoyed the article, although I get the feeling you don't have a good grasp on test driven development based on your following statement:
Although I remain a fan of test driven development, the speculative nature of the time investment is one problem I've always had with it. If you fix a bug that no actual user will ever encounter, what have you actually fixed?
This completely misses the whole point of Test Driven Development. I think you would benefit reading the article Test-Driven Development Isn't Testing by Jeff Patton
(http://www.stickyminds.com/sitewide.asp?Function=edetailObjectType=COLObjectId=8497)
And constantly being annoyed by emails is a great incentive to fix bugs quickly.
Or just ignore them all.
I prefer to not write bugs to begin with.
Bill on April 17, 2009 1:14 PM@Bill
Good luck with that.
Ens on April 17, 2009 1:32 PMTo the people that said ..why does that code even exist?, I think the correct question is why does that test exist? Presumably the code does something useful but the test scenario can't happen in real life and that's how I understood the comment about TDD being speculative.
Being an embedded systems guy I can only envy web developers who can rush something out and let users do the testing. To a much lesser degree we do that too but it results in devices being shipped back to us and lot's of warranty cost.
Jeff,
Do you realize that with posts like this, irrelevant of your understanding of TDD, you are promoting many of the problems things like automated unit and regression tests have solved? When you ship software where one small bug can make a business lose thousands of dollars, how can you justify this?
Software is not only about stack overflow web site no matter how many millions of users it has.
Hadi Hariri on April 17, 2009 1:55 PM@Jeff Atwood
If you fix a bug that no actual user will ever encounter, what have you actually fixed?... I'd much rather spend my time fixing bugs that are problems in practice rather than theory.
Young man, what the hell are you thinking???? Jim Cooper
I agree with Jim Cooper here.
If I was being interviewed by Jeff Atwood for a job with him and he said that I would tell him he was an incompetent software engineer and walk out. Such an utterly stupid remark.
A bug is a bug is a bug; it is not a theoretical or a conjectural concept.
It is akin to saying that if an electrician wires a socket incorrectly (for whatever reason) in your house then the 'bug' is only theoretical if no person uses it.
What an unprofessional mindset.
Sam on April 18, 2009 4:23 AM^ sorry, but the truth is there will always be bugs and they need to be prioritized. You can fuss over the usage of the terms practical vs theoretical, but he makes a valid point, one that's obvious when you quit harping over his verbiage.
A bug that results in an exception is a bug that crops up in a valid use case and needs to be fixed. These fixes necessarily take priority over TDD. If you want to put in the unit tests to catch the bug afterwards, great, and that's part of his point too.
Michael Reiland on April 18, 2009 5:00 AMAll this stuff just makes me think we need to simply retire the word 'software'. Pretty much all sentences which contain the any of the phrases 'software project', 'software expert', 'software technique', 'software bug' are wrong at least as often as right.
No-one uses 'hardware' in the same way, no-one thinks a bridge is really the same as a disposable shopping bag. Those responsible for making those two things would never lecture each other 'when building hardware, you need to ...'.
But too many people still have this strange idea that there is some meaningful commonality between a game, a website, a service, an embedded device, a product, an in-house application, ...
soru on April 18, 2009 7:53 AMJeff,
I have to admit you're a far better writer than me. We've updated the copy on the home page of our online exception reporting tool to use the term “exception-driven development”, which we all liked a lot.
In case someone is interested, we are making a web app called CrashKit which collects exceptions from your web/desktop applications. We already support Java, Python, PHP and JavaScript, but we're in an early private beta stage, and there are many areas we want to improve. Still, if someone is ready to become an early adopter or is just plain interested, please visit http://crashkitapp.appspot.com/.
Come on Jeff - put an end to all the speculation about your competence and post the source to StackOverflow. Think of all the help you'd receive...
The source to FogBugz has been published for years without Joel being rumbled, so you could just get away with it.
Will Dean on April 18, 2009 9:46 AMJeff, would you please fix the Read older entries link on the main page? It always takes you to a _specific_ older entry, rather than the previous page of entries.
I have to admit jeff, while you're mostly right about the average end user only being able to give semi coherent ramblings that are usually useless, there are people out there, who arent computer programmers per se, who could definitely help out quite a bit in fixing bugs by describing what was happening as, or before the bug happens. most error reporting software simply takes basically a screen shot of what is going on at the moment of the crash, but those few moments of usage leading up to the crash could be as important, if not more so, than what was happening at the time of the crash.
Jasen on April 19, 2009 2:03 AMAlso, I'm very sorry about the abundance of commas and excessive modifiers in that last comment. 4 am is a very bad time for proofreading.
Jasen on April 19, 2009 2:06 AMWritten like a truly old-school MS programmer. Make a buggy product and let the users test it for you.
But then this blog is all troll lately.
I also have a hard time following the logic of the post. Oh I guess that's because it's not a logical argument. The only thing it says, which is totally f'ing obvious, is that 'bugs happen, monitor real errors, quickly fix and turnaround if you can' Unless you're a hobbyist programmer, you or your team are already doing this.
When a user informs me about a bona fide error they've experienced with my software ....
This should read 'when a user experiences a bona fide error'. But then again you're also making a judgment call on what a bona fide error is. Just because your application hasn't melted down into a smoldering heap of bytes doesn't mean a bona fide error hasn't occurred. I've read a lot of your comments and seen the way you approach issues in UserVoice.
We don't ship software that has many exceptions in runtime. Exceptions are meant for developers and testers to bring the system to a halt as quickly as possible so that as many real bugs can get fixed before we ship.
You might make a few good points, but you obscure them with your smack-down rhetoric. The worst part is many junior programmers read this blog and think they're better off for it.
spaceace on April 19, 2009 2:43 AMWhile I always expect users to point out the problems with my applications, sometimes I just don't expect the (higher than expected) number they come up with. Users are certainly great at improving your work I find but yeah, its always good to keep a crash log as well. - Prague Hotel.
Prague Hotel on April 19, 2009 8:12 AMNeed to set filter.
This blog has descended into navel gazing of the most egregious kind.
Get a life, post less, and make them interesting for fecks sake
Red on April 19, 2009 8:57 AMI'm sick of that stupid iceberg picture. (1) It defies basic physics; it would roll onto it's side. (2) Icebergs don't look like that. (3) You can't see that far under water anyway. (4) Don't tell me to lighten up.
Jim on April 19, 2009 9:19 AM@CalmDown:
Your responsibility as an engineer is to provide the best experience possible for the customer. There are always deadlines to meet, but if you find yourself continually unable to meet them, you either need to work on time management, or work on expectation management with your boss and/or customer. Do not let your boss or customer drag your product down into mediocrity. Left unchecked, they sometimes will...though unknowingly.
Worse still is this unspoken agreement among some engineers that complexity is a necessary ingredient to every codebase. I run into it occasionally and it is very disheartening. It seems like since they've always had experience dealing with (and writing?) messy, complicated code, a codebase that is actually designed is a bit disturbing. Kind of like how people tend to go from one dysfunctional relationship to the next until they figure out the common factor. I know some of this sounds ridiculous, but I've seen it more than a few times. They never seem to make the connection that simplicity often begets a lower bug count. I suspect they feel a bit insecure at the prospect of someone doing a better job than they. That, and some people tend to slide into being self-impressed with what they've done instead of pushing it to be even better.
I don't mean this to be navel-gazing, but it underscores a larger problem of people who just do the work without reflecting upon it working alongside people who do.
Matt Green on April 19, 2009 10:14 AMI would prefer to call it error-driven programming. I sometimes avoid exceptions because I think they are a bad model for handling errors in C++ (I am of the opposite opinion when it comes to, e.g. C#), but the applications I produce still provide error logs, which can provide the same information.
I wholeheartedly agree with the principle though... in my experience the most bugs are found once the software is in the hand of the users. They do things that I just never seem to think about trying and the result is that they turn up problems in my code that I never would have even considered looking for.
This doesn't mean you should ship buggy software of course... this is exactly what alpha and beta tests are for. :)
Jheriko on April 19, 2009 10:19 AMI'm not sure I agree with you on this Scott...
This may work for a site like stack overflow, but if you have a complicated stateful server application the cost of debugging and fixing a bug when it happens in production is many orders of magnitude greater than fixing it when you write a unit test.
* Diagnosing a bug in production you are exposed to the full complexity of the entire system, and all of its dependencies.
* All you have to work with are log files. Generally, you cannot attach a debugger to step through and debug interactively when your app is running in production.
* You may suffer some down-time either as a result of the bug, or in upgrading the system with the bug patch. This carries some reputation risk.
* Once you've identified the root cause of the problem you then need to think very carefully about remedial actions. What else could this failure have impacted? What knock-on effects could it have? For example, has it corrupted some data in your production database, which will come back to cause knock-on problems at some indeterminate future date?
For these reasons, and more, I believe that problems in a large stateful production application are many orders of magnitude more expensive, therefore finding as many of them as you can at development-time is time well spent.
That said, for a site like Stack Overflow this methodology may work better. Though I'm sure you still have some overheads of potential data corruption, down-time, and patch release.
You also forget that a large part of Test-Driven Development is about design, in addition to testing, though this may be more relevant to a framework or public API than to Stack Overflow.
Daniel Fortunov on April 19, 2009 11:26 AMWhere by Scott I meant Jeff... that's how we pronounce it in the UK dontchaknow?
Daniel Fortunov on April 19, 2009 11:27 AMThe truth is that Jeff is a uncapacited person, so he needs to justify his lack of ability.
TT on April 19, 2009 1:11 PMThat's all good, but it bothers me to no end when a company implements automatic crash reporting and then updates their website to make it impossible to report bugs without going through 10 layers of non-technical support personnel.
David on April 20, 2009 2:19 AM@Sam (who said)
It is akin to saying that if an electrician wires a socket incorrectly (for whatever reason) in your house then the 'bug' is only theoretical if no person uses it.
What an unprofessional mindset.
...no, It is akin to saying that if an electrician wires a socket (period) in your house, then a person sticking a plastic knife (with an emf-sensitive trigger attached to an explosive charge) into the socket would be something he would not be reasonably expected to account for when wiring the socket (the theoretically implausible...not impossible...test case).
A lot of people seem to be taking the view that all tests are created equal, but that's obviously untrue. Stating this in an article, as was done here, hardly seems like cause for an uproar.
If you'd prefer an electrician to preemptively account for you sticking a plastic knife into a socket (maybe a cover that deploys when it detectes plastic at too-close a distance?), then fine...you're being thorough, go for it...I'm not buying a $4,000-spork-defeating socket for my house, though.
I looked around and wasn't able to find an exception reporting system like elmah for the PHP world. So I wrote one. Now more are starting to come out of the woodwork.
https://sourceforge.net/projects/skidder/
Cliff Ingham on April 20, 2009 2:54 AMsounds like you're trying to justify something!
bilbo on April 20, 2009 3:03 AMThe silent exception reporting is all well and good until you find yourself working with people who really don't want any of their data being passed out of their own network, or whose network isn't connected to the internet consistently. At that point you end up spending a lot of time on the phone.
Also, @Jim: Lighten up.
Breakfast on April 20, 2009 4:50 AMhttp://itc.conversationsnetwork.org/shows/detail3995.html#
John the Statistician on April 20, 2009 7:55 AMsure, it isn't the greatest approach 100% of the time, but it does have merit.
Oh, and jim, lighten up :)
so, what would you do when your're developing very content intesive (image, video, swfAS2/swfAS3, xml content files, etc) a RIA in flex?
most errors we get are not simply caught by try/catch. there is no catchable exeption for a misidentified swf format (as2 instead of as3, which simply displays the first frame), a browser crash because the adobe plugin doesn't like h264 or for data not being displayed because of a wrong set of config files. These are examples for the program working perfectly within given parameters nevertheless producing situations which are recived by the user as errors.
So, I think, when you come down on us for not knowing an error before the user does, you speak mostly for excel-like .NET applications for database organization. there are some problems in real life which aren't behaving according to microsoft standards...
excellent, thanks for the info.
GrandPortageCasino on April 20, 2009 10:44 AMI had a refreshing discovery when starting a project in django recently, where when in debug mode the framework dumps a well-formatted stack trace that even includes sections you can expand to reveal local vars for each frame. Once you deploy your application and turn off debug mode it stops dumping the stack trace and instead emails any number of site admins with the error details. However I do like the idea of a central exception log like you showed above, which would probably be easy to do with some django middleware.
Matt Smalley on April 20, 2009 11:45 AMI would say ... for server based code this is just fine. For client based code this is tricky, because if you iterate often you have to keep track of versions and all that nightmare. For client code I'd advocate TDD.
Gregory Magarshak on April 20, 2009 12:50 PMhow 'agile' of you jeff.
matt on April 21, 2009 5:27 AMWhen I hear people say things like, Real software is hard and is hard to learn. I feel really disheartened.
A good developer makes difficult tools easy to use... as long as management, and often the users themselves, will get out of your way and allow you to do your job.
Mitur Binesderti on April 21, 2009 9:56 AMI completely agree. In fact we adopted ELMAH for our SharePoint environment and it was the very first task in our first Sprint for our big SharePoint Redesign.
We also have a complete environment where users can use the code for immediate feedback and user testingbefore it gets to production.
rich l on April 21, 2009 10:44 AMNice article, thanks!
I agree, correct exception handling is the principal of software quality. Exception handling practices are well-documented, but there are so many developers who can't apply them: you can find crazy try/catch blocks at a lot of parts of the code. This is the reason why I prefer centralize/encapsulate the error handling. I wrote a short article on how to implement it with AOP features of Spring.NET (PostSharp is coming soon).
http://www.coderecycling.net/2009/04/centralized-exception-handling-logging.html
I use XML data type (feature of MS SQL 2005 and 2008) for error logging, this enables me to add extra information without the need of database structure modifications. I can make more detailed queries by XQuery and build error reports with XSLT.
Robert on April 22, 2009 2:50 AMAll Jon Skeet's code is exceptional!
Lucas Aardvark on April 22, 2009 5:36 AMMicrosoft has been doing this for years, thus leading to their current remarkably high level of quality?
Exception reporting is great, but if you put out crappy software as a baseline, you're still putting out crappy software. If you're fixing *that* much stuff due to your exception logs, you're doing it wrong.
Chris D on April 23, 2009 12:44 PMI've had the same issue with production sites. But I never seem to eliminate 100% of the issues. There are still many issues with corrupt VIEWSTATE etc ... that are sometimes unpredictable to impossible to solve.
I wish .net had a better logging/exception handling system that could Snapshot the state of the process for you. Variables/Requests. It would go a long way in trying to fix these pesky production exceptions.
I don't think I've ever had an event viewer log free of errors in my entire career. It's a lot of effort to stay on top of every issue. :(
Event viewer / logging does a terrible job of not allowing you to ignore errors. If someone sends a bad request to my DNS server for example ... do I really need it as an error in the event viewer? They're the ones who sent the bad request. lol
Chad Grant on April 26, 2009 8:14 AMFor winforms...would you still recommend the code you published on codeproject at http://www.codeproject.com/KB/exception/ExceptionHandling.aspx. It is 5 years old now. If you were to do a winforms app what would you use?
Seth
Seth Spearman on May 5, 2009 6:53 AMWe adopted AVICODE, its pretty good.
geethaji on May 6, 2009 2:09 AMFor Winforms, you'll probably want to take a look at Exceptioneer.
It's starting to shape up with some great features and just added support for winforms:
Michael K. Campbell on May 11, 2009 1:51 PMI hope you will think again twice before saying: "Ship your software, get as many users in front of it as possible, and intently study the error logs they generate."
You are practically saying that the final user (the client of your application) is the tester of the application.
How many years in PM do you have? Keep practice this philosophy and you will loose all your customers ... and fast!
Adrian on May 31, 2009 3:54 AMExcelent post! I setup ELMAH in out application a month back and now ELMAH RSS Feed is the first thing I check in the morning.
faj on June 10, 2009 11:47 AMThis is only a preview. Your comment has not yet been posted.
As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.
Having trouble reading this image? View an alternate.
| Content (c) 2009 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved. |
Posted by: |