March 29, 2008
I lived in the Denver area at the time Denver International Airport's completely computer automated baggage system was unveiled in 1994. The troubled development of this system was big local news.
The premise of Denver's plan was as big as the West. The distance from a centralized baggage check-in to the farthest gate - about a mile - dictated expansive new thinking, planners said, and technology would make the new airport a marvel. Travelers who arrived for check-in or stepped off a plane would have their bags whisked across the airport with minimal human intervention. The result would be fewer flight delays, less waiting at luggage carousels and big savings in airline labor costs.
Tours that preceded the system's debut led invariably to an airport basement where 26 miles of track, loaded with thousands of small gray carts, sped bags up and down inclines as conveyor belts minutely timed by the computer deposited each bag in its cart at just the right moment.
The baggage system was an abject failure. It had huge problems on opening day, and almost immediately had to be superseded by manual procedures. Things never improved much from there. Denver's automated baggage handling system was scrapped completely by 2005, in favor of traditional manual handling and barcode scanning procedures.
"It wasn't the technology per se, it was a misplaced faith in it," said Richard de Neufville, a professor of civil and environmental engineering and engineering systems at the Massachusetts Institute of Technology. Professor de Neufville said the builders had imagined that their creation would work well even at the busiest boundaries of its capacity. That left no room for the errors and inefficiencies that are inevitable in a complex enterprise.
"The main culprit was hubris," he said.
I was surprised, then, to read essentially the same scenario play out 10 years later in England at Heathrow Airport's new Terminal 5.
Many things did not go according to plan at T5, but at the core of the fiasco is baggage. This is supposed to be a state-of-the-art system, the biggest in Europe with 10 miles of conveyor belts controlled by 140 computers and designed to process 12,000 bags per hour at up to 23 mph. But it had never been tested before in a live terminal. On T5's D-Day many aspects -- human and technological -- simply did not work.
BA say they are confident that a few days bedding-down will sort out the problems. However, T5 is not yet at full capacity, with another 70 long-haul destinations due to move there on 30 April.
I sincerely hope BA can escape from Gilligan's Island, unlike so many hardy travellers before them. Otherwise, all we'll end up with is another sad entry in the long, dismal history of software project failure.
Posted by Jeff Atwood
I don't really understand the point of this post?
Maybe Sean’s lack of understanding comes from his being a project manager at Terminal 5?
I suspect the point is: learn from others mistakes, life isn't long enough to make them all yourself.
I could have put money on one of my favourite bloggers commenting on this. I read two key points :
1) Software engineering is a lot harder than people think - _especially_ non techies. Unfortunately non techies are sometimes the ones writing the checks, and without techies on hand to give authority on what you can and can't do, you're going to have a disaster on your hand. Say what you like about Bill Gates but he had enough technical chops to avoid major disasters like this (at least until the mid nineties anyway).
2) Steve McConnell's classic mistakes are truly classic. I'm not sure how many came into play with Heathrow T5, but "shortchanged quality assurance" and "insufficient risk management" leap out. You see them playing out again and again. I'd bet money on another IT disaster cropping up in ten years time from this with the same classic mistakes.
I don't really understand the point of this post?
I know a few have mocked my propensity to bold text before, perhaps rightfully so. But in this case, if you're wondering what the point is-- just look for the bold text.
I understand that hubris is a key ingredient in many failures. What I was expecting to see, I suppose, was something relating this to software engineering and design.
I thought the exact same thing when I heard about their ambitious "new" project. Oh well, we've come a long way in the past 10 years, right?
Open the luggage bay doors, Hal...
..and I'm so worried about the baggage retrieval system they've got at Heath Row...
I'm just glad to see the return of gratuitous self-linking. You've written gazillions of little posts over the years -- the dozens of links, internal or not, are what make this site so easy to get absorbed in.
Also, anyone who doesn't get the point of this article IS the point of this article.
Actually it sounds like they put too much faith in the technology working, and didn't put enough into the human aspects of interacting with the technology or just in moving between jobs. They should have been fully-staffed dry-running the new terminal for at least a month - dispatching fake aircraft and moving real bags back and forth, with the real staff doing the jobs - and pushing back the 'release date' every time the tests failed. That would have caught:
- Inability to park
- Inability to get into the building
- Knowing where to pick up baggage from for a given flight
What they actually did was give existing handlers from Terminal 1 five days' training on the new system, but not really using it in practice, then told them to turn up at T5 instead of T1 last Tuesday morning. Result, everyone tries to park in the same space (and circumnavigating Heathrow to park somewhere else is really hard as the roads jam up completely with even the slightest incident on the M4) and no-one can get to work on time.
There's a reason that West End and Broadway shows have many days (weeks, months) of rehearsal before opening, including a couple of dress rehearsals and technical rehearsals so that the costume, props, sound and lighting people all know where they have to be, what they have to have ready at what time (including which performers need to be waiting in the wings), and what their cues are. It's why the military do live-fire exercises. It's so that when the time comes to put on the performance, or go into battle, it's just like the last time you did it, only this time you have an audience.
Actually getting a new system into the hands of the users can be a real problem, which is another reason why the 'big bang' approach to deployment is such a bad idea.
As I understand it, the baggage handling system worked rather more efficently than the baggage handlers could cope with (mainly for the reasons in Mike's post above) resulting in baggage backing up to such an extent that the whole system became unworkable. I believe they are currently trying to clear a backlog 15,000 pieces of baggage!
Just FYI, the text doesn't appear bold to me under Firefox (v 18.104.22.168).
The problem is the real world.
Computer systems that have to interact with a physical entities in the physical world have complexity caused by the unpredictability of the stuff that they deal with. And baggage systems have to cope with real physical stuff.
Imagine a document management system with a workflow. The paths that the document can take through the work flow is a simple permutations and combinations problem based on easily predicatble inputs and outcomes. So it is easy to code.
A suitcase can slide off the conveyor, can get trapped and cause a logjam. One suitcase can land on top of another and go round the system in tandem. A suitcase can fall off the conveyor and be thrown on to the wrong conveyor by a handler. Suitcases can do all sorts of stuff that is difficult to imagine in the office. This makes it difficult to code for all eventualities.
In general writing software for systems that involve handling physical stuff is inherently much much harder than writting software for more abstract stuff.
This by no way means absolves the terminal 5 fiasco which was predicatable and could have been avoided by a sensible testing and dress rehersal (good analagoy mike) prior to go live.
Interesting post, this is huge news over here (London) at the moment, in the paper every day. In fact its such big news, you'd think a commercial airliner or two had crashed into a building killing upwards of 3,000 people.
No, just delays because of baggage handling issues.
I do, however, fully expect all problems to be sorted out quickly and quietly, and we'll be back to front page articles of Kate Moss's latest escapades with a magical nostril clogging white powder.
"controlled by 140 computers"
Er... that'll be the problem, right there - next time, it might be an idea to have it controlled by ONE computer!
I would have built a scaled down version of the terminal using trained mice as baggage handlers and marshmallows as bags then run it into the ground for at least 3 months....
DIA was a failure before it even started. Blame that on politics and a typical government project management failure. DIA is a huge, beautiful and fully-functional airport, but it's still just an airport - not the modern, state-of-the-art, automated, "all weather" facility we were promised. Blame that solely on the well-greased politicians, not on a "software project failure". Cut the programmer community some slack, Jeff. :)
People working in the new terminal could not park
People working in the new terminal had never been there before and got lost
People working in the new terminal had never seen the actual systems and so failed to work them properly
Some people had had little or no training ....(because it was not thought necessary)
They seem to have forgotten that People are part of the system and you need to test the system *with the people* who are actually going to use it
"Computer systems that have to interact with a physical entities in the physical world have complexity caused by the unpredictability of the stuff that they deal with."
This is so true - I work on a Warehouse Management System that tracks stock in, around, and out of a warehouse. There just no way to code around what the users physically do with an item. If there's a db timeout, I can tell them to put it back, but it they don't read the screen, or put it back in the wrong place, things can get messy.
Humans are, by their very nature, not like computers. They don't think like them and their ability to learn is what makes them different. People usually feel constrained when working with computer systems/terminals because they almost always know a "better/faster" way to accomplish things. People have to change how they work to accommodate the machine, whereas really the machine should facilitate the desires of the people.
"Blame that solely on the well-greased politicians, not on a "software project failure". Cut the programmer community some slack, Jeff. :)"
The post isn't necessarily pointed at the programmers, I wouldn't say (Jeff can correct me if I'm wrong). The point of the post as I read it is that the designers and producers of a system failed to apply the rules while they were doing so, and that this is a scenario often paralleled in the software development world.
You see it all over, it's just that Heathrow's T5 shambles is big(ish) news.
you need to test the system *with the people* who are actually going to use it
And just how were the bags in T1 supposed to be handled whilst everyone was training over in T5? You can't make baggage handlers pull a double shift just to learn the new systems (they'd strike just at the mere mention of it). That's the problem with Mike's analogy too - the actors aren't still performing their previous show at the same time as dress rehearsals for the new one - they've got time to adjust.
I don't see any way that full-scale testing could be done in this situation.
It always amazes me how people refuse to learn from the mistakes of others. Automated sorting and delivery systems are nothing new. They have been around for years. FEDEX and UPS do it daily. I even worked on similar systems for newspapers.
Any system deployment for complex problem can degenerate into this mess. I hope they let the World know what really went wrong, so we all can learn. I read that "The problems appear to be due to a combination of factors" (BBC). While this is pretty obvious for a complex system, I still don't understand why they jump-start the new terminal to such a big load at the very first day. They even persisted in doing so for a few days, and only stopped when the undelivered baggage has reached 15,000. Why not increase the load more gradually? Handling just a couple of flight during the morning of the first day should be good enough to catch many errors. Why is the rush? If UK is not too dissimilar to other parts of the World, I suspect downward pressure from high-level management, and upward pressure from natural limitation of complex system. I sympathize with all the sleepless engineers and technicians in the middle.
Another problem with Denver's airport (and possibly Heathrow as well) is that people were actively sabotaging the machines. Why? Because if it worked, the machine would have put many baggage handlers out of work.
People do respond to incentives.
Speaking to the earlier point about theatre, as someone with actually more of a background in that than in technology...
I often say that being a theatre practitioner has given me a good appreciation for deadlines: Deadlines in theatre are absolute. You absolutely cannot go on stage on opening night and tell people oh we're sorry, we aren't ready yet, please come back in a week or so. The day you say you're going to open is the day the curtain opens at 8:00 PM.
However, there are two major differences to keep in mind between a show and, for example, a baggage handling facility (or any large, complex technology system like the ones I've worked on):
For one thing, a whole lot of the process of doing a show is human, and those humans are very used to thinking on their feet and dealing with things on the fly when they get messed up. I've seen people skip whole acts in the script, fail to bring out a gun that is used to shoot the person that the murder mystery is based on, break limbs onstage, whatever, and the show still goes on. Just another story to tell in the pub. (Lesson: Build enough slack in the process that things can go wrong and you can still complete the task)
The other thing is that with very few exceptions, the audience doesn't care because they are on the one side not very observant, on the other side they don't know how you planned on doing it to begin with so they don't notice if things are different from plans, and on the third side, if the rest of the show is good, they are extremely forgiving of a flub in act one. (Lesson: make everything as slick as possible, so that when things inevitably go wrong, people are ok because your the rest of the process makes up for it)
In Madrid we recently experimented the disaster of Barajas T4. For weeks, maybe months it was the chaos. It seems that it mostly works now, though. I've read that a spanish company is behind the London system, so I would be cautious because: if it's the same system that was implemented in Madrid, then it has been tested... somehow :-) and it will eventually work :-D If I understood it correctly, Denver had to be dismantled.
The thing about airports is that they suck and they always will suck.
Is baggage handling systems an eternal curse?
Not if they're done right. I used to work for a company that had a division that did airport baggage handling systems, run by (depending on your point of view) either an extremely careful or an extremely anal-retentive project lead. His team spent more than six months not writing a single line of code but building a complete architectural blueprint for the system and simulating it in full detail. It was almost like watching a software engineering textbook come to life (although the developers didn't like it much because they couldn't leap in and hack out code).
Their work came in on time and under budget, and worked perfectly at another international airport at about the same time when Denver was melting down. The subsequent flood of orders from other airports almost overwhelmed them (I think they went through a complex series of mergers with other companies, I'm not sure of the quality of their current product). It's a pity it's not documented anywhere since it'd make a (rare) example of someone doing what you read about in SE textbooks but rarely see in practice.
I've always tell my clients to start small, but think big. You introduce functionality bit-by-bit, but with the idea that you can include new functionality and features as you work out bugs and unforeseen glitches.
FedEx and UPS do run major airport facilities that processes packages that is much more complex than anything any airport handles. They do it efficiently and quickly. Everyday, hundreds of planes fly to Memphis to the FedEx facility where over a half million packages each hour are taken in one end, processed and spewed out the other to hundreds of outgoing flights. It is an amazing site. However, both UPS and FedEx started off on much smaller scales. 100 years ago, UPS was just a local delivery company delivering store packages. They added functionality and complexity a bit at a time until their sorting facilities can do some amazing stuff.
The problem both Denver and London had was the idea that they could put everything together in one big bang. And, when that didn't work, they had to scrap everything. What would have happened if they put together a new automated system piece by piece? Maybe the system could first handle inter-airline transfers, then once everything gets worked out, add handling airline-by-airline until the whole system works.
Whenever I hear of a major project that will change everything and be implemented all at once on a grand scale, I tell myself there's a project doomed for failure. Remember that the Internet itself started out as a network of only 36 nodes and there was no such thing as email, webpages, or blogs. You start out small, get the basics working, and then scale up.
Ben: "And just how were the bags in T1 supposed to be handled whilst everyone was training over in T5?"
I'm not familiar enough with the problem to know, but I'd presume that they didn't close T1 when T5 opened, did they? So therefore they had handlers at both terminals simultaneously.
Why couldn't this have been done for testing as well?
The customers who participated in the trials it said it was flakey. The staff who worked on it said it needed more work.
The software provider are based in Canada (when there are literally thousands of competent local software companies in that area). This software is used in a different context by a friend of mine who has zero confidence in it - a little hunting on the net could have probably found this out.
The management went live anyway because they didn't believe their punters or their staff.
Pointy-haired wigs for anyone?
"But how do you simulate 15,000 pieces of luggage that each weigh 30 - 70 lbs?"
Easy, just use the huge amounts of baggage that are already sitting in a warehouse at Heathrow because of previous baggage handling failures, making it impossible to identify their owners.
It's hilarious - they're even asking people flying from T5 to travel without any luggage "if possible"... Yeah, like we're going to fly from London to Sydney for a couple of weeks without any luggage...
Thanks for mentioning Denver in '94 Jeff - as a UK citizen I'm well aware of the T5 problem, but had no idea that there was a previous similar problem elsewhere... should make for some interesting conversations!
Make Small Changes.
Big changes are a very bad idea. That's why Agile methods are good. That's why big code rewrites are bad. It's why T5 broke, why Google doesn't do big product launches, why Vista bombed, why the internet never fell over, why Britain does OK with an ancient and constitutionless legal system. Organic growth is by far the safest. Why?
The Law of Unintended Consequences always wins.
Smaller leap, less chance of falling foul of it. I like this about Agile - the iterations are smaller, the bugs easier to find and more recently created.
I recall one report from the Denver baggage fiasco, where an expert said the amazing thing wasn't that it didn't work; it was that the people who designed ever thought it could work. We make models of the problem, and if it's big and complex we make a model that's simpler than reality. If our model doesn't get updated until the project rolls out, we're dead.
Root cause of the problem is you will always have idiots sitting in the pilot seat making all the critical decisions.
Someone from OPs should have been fired - if they can't test a live system prior to going live when PEOPLE depend on it, deserves to be fired.
"It's hilarious - they're even asking people flying from T5 to travel without any luggage "if possible"... Yeah, like we're going to fly from London to Sydney for a couple of weeks without any luggage..."
RWW, I think David came up with an answer:
"FedEx and UPS do run major airport facilities that processes packages that is much more complex than anything any airport handles. "
FedEx your bags to Australia.
"It always amazes me how people refuse to learn from the mistakes of others. Automated sorting and delivery systems are nothing new. They have been around for years. FEDEX and UPS do it daily. I even worked on similar systems for newspapers."
The thing is with FEDEX and UPS is that most of the items they ship have a similar design. They have 6 flat sides with the information on one of the two sides with the large space or they are big envelope. no matter what you have a flat surface to read.
Just think of the last time you saw baggage at an airport and all the shapes and sizes that people used. there is no set shape of items and even then there is no known location where the tag will be located. Then with the suitcase you have the tag on the handle which can be twisted around, folded, etc which make it hard to read.
I don't think you can blame this one on software - this is more of a hardware problem. We often forget the physical side of such systems, and the fact that they will never be as precise as the software that controls them. The hardware requires constant maintenance, just to keep the error rate low. Solenoid misfire, gate bearings gum up and jam, sensors get covered in dust and give false readings... even simple maintenance, like cleaning shampoo from someone's suitcase off the conveyors. The real-time controllers and other software are less than half the problem in these sorts of environments, so I hate to see them take the blame.
I wonder how many software developers it takes to say "Let's just try it with one airplane."
I'd say this belongs in the engineering/management project failure hall of fame. Software is but one component here.
Perhaps the mistake was to think of it as a "computer problem" rather than in terms of strategy, physics, and logistics first.
Ha, great timing Jeff. I'm going flying to London tonight.
Why can't they slowly integrate the new system? It seems there shouldn't be a single day when they switch completely, but rather slowly start moving more and more portions of the old system to the new system, fixing problems as they go.
The look is interesting, though harder to read. I wouldn't keep it too long.
However if you are going for glory, see if you can change the cursor to an underbar.
Kevin: I'd presume that they didn't close T1 when T5 opened, did they?
Yes, they did, or rather they closed the British Airways checkin there. The BA operation at T1 moved over to T5 lock, stock, and barrel.
The next step is to bring the BA long haul operation at T4 up to T5...
I watched a documentary about the new terminal 5 a few nights before it opened. It showed them load testing the system with the maximum number of bags (real bags).
I don't know the full details of the failure, but I didn't get the impressions it was a software failure (for once).
Ps/ The new black on green template is almost unreadable.
Add Sydney (YSSY) to the list - their new baggage system was the same. Jumbos taking off for Europe without a single bag on board...
Does anyone "on the outside" really know what went wrong?
Software failure? Or also hardware failure? Human factor?
None of the responsible persons will give details now because
of legal reasons. So we can only guess (wrong) ...
Testing a baggage system is costly but as someone wrote here
it had been done and what is communicated currently is that
only the big "backlog" is a problem, not the new luggage.
Is that true - we do not know (currently)
It's funny I've worked on baggage control systems in the past and its nothing like a web project.
the old PLCs (Programmable logic controllers) are just one big if then else.
You first design the system and install it and then you alter the software you made in simulation. You'll run into things like how the tag printers will print tags that cant be read when they're low on ink. We had one where we needed to add additional barcode scanners to get better read rates.
Start with a human backup, these systems usually have a manual encoding station to prevent the system from not knowing what to do with the bags.
The tweaks made to the system after its being used actually make it usable, you don't expect a cutting edge system to actually work which is sad.
@Erik: I heard a rumour, after posting my initial thoughts, that actually BAA, the airport owner, *had* been dry-running the system for months and had some very experienced handlers by the end of the testing.
Then they handed over to BA who fired all the people who'd been doing the testing and brought over their own staff from Terminal 1 with minimal training. Result, no-one knew what they were doing.
Dual-running was inevitably going to lead to someone losing their jobs - ultimately they'd be overstaffed - but they could have mixed teams of experienced and inexperienced people, brought more people over from T1 when the first ones were confident, and repeated the process as the other flights moved across.
I do wonder a little whether unions were involved.
With DIA, the engineers who had prior experience building these baggage handling systems said it would take 4 years to build such a system. The mismanagers said "the airport opens in 2 years, so we promised them it would be built when the airport opens."
The DIA baggage handling system ended up going into operation 2 years after the airport opened. Exactly on the schedule that the engineers said it would take.
Was the DIA baggage handling system built "on time" or was it "2 years late?"
I worked on a consulting gig in Denver during that time period and, though it sounds like I was a statistical anomaly, I had a great experience with the automated baggage handling system. I flew in every Monday morning for about 3 months. I got off my plane, jumped on the train to the baggage terminal, rode up the long escalator, rounded the corner and I could literally stick my hand out and my bags would be just emerging onto the conveyor belt where I could grab them. The guys I traveled with thought I was full of it (which I usually am) until they went with me and saw it for themselves. I guess the sun shines on a dog's ass every once in a while as they say, but in one man's view it wasn't a *complete* failure... On the other hand, it's not like the Therac failure of the 1980's, but the same underlying principles were at work...Canadians. ;)
The thing about airports is that they suck and they always will suck.
At least the jet engines will... ;-)
So, there was virtually no load-testing on these things?
Fortunately for web applications, load can be simulated. But how do you simulate 15,000 pieces of luggage that each weigh 30 - 70 lbs?
"The fiasco is rapidly turning into a national humiliation ..."
""It's a national disgrace and a national humiliation," [Liberal Democrat MP Alistair Carmichael] said."
Even if the airport terminal personnel are great people, you need to think about the whole process. People are part of the system too, even the customers who try to keep up with everything that is going on.
Well, they did practise the process, but there were glitches. Looks like the glitches were not fixed after practicing.
The thing about software is that it is really easy to do something complicated. The bit where the engineering comes in is that it is really difficult to find all the risky edge conditions, handle failure gracefully AND scale up to do lots of even that one thing.
If we didn't have hubris we wouldn't do anything exciting and we wouldn't progress.
Re Mike Dimick's comment: yes, all good risk-reduction measures, but could it actually be done? They can't move all their staff to the test (or T4 dies). They can't get a proper, full test with new staff (costs too much to train, and new hands may not work the system the same way as old hands). So they'd have to test with a fraction of the old staff at a fraction of capacity and hope the result scales.
I wonder: is this new system so centralized that you can't test a small bit and scale the results? E.g., a few high-capacity conveyors rather than many small ones.
The contractors who submitted reasonable estimates of cost/time probably didn't win the contract.
The project was D.O.A., not after the deadline was missed, but at the moment the bid was accepted.
I keep my monitor brightness setting so low, that I cannot read anything from Jeff's comment. Though looks like it is so dark, I cannot read it with bright settings up either.
It is difficult to switch from site to site, because some sites have mostly white (I turn brightness down) and some sites are mostly dark (I turn brightness up). Also in YouTube they have many dark videos, but the site is white. I once tried to get me some addon that fixes it, but it was hard to install and I didn't want to waste much time on that. They should have "switch to black"-feature there.