June 21, 2010
Remember that Scene in Star Trek IV where Scotty tried to use a Mac Plus?
Using a mouse or keyboard to control a computer? Don't be silly. In the future, clearly there's only one way computers will be controlled: by speaking to them.
There's only one teeny-tiny problem with this magical future world of computers we control with our voices.
It doesn't work.
Despite ridiculous, order of magnitude increases in computing power over the last decade, we can't figure out how to get speech recognition accuracy above 80% -- when the baseline human voice transcription accuracy rate is anywhere from 96% to 98%!
In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. Adding data or computing power made no difference. Researchers at Carnegie Mellon University checked again in 2006 and found the situation unchanged. With human discrimination as high as 98%, the unclosed gap left little basis for conversation. But sticking to a few topics, like numbers, helped. Saying â€œoneâ€ into the phone works about as well as pressing a button, approaching 100% accuracy. But loosen the vocabulary constraint and recognition begins to drift, turning to vertigo in the wide-open vastness of linguistic space.
As Robert Fortner explained in Rest in Peas: The Unrecognized Death of Speech Recognition, after all these years, we're desperately far away from any sort of universal speech recognition that's useful or practical.
Now, we do have to clarify that we're talking about universal recognition: saying anything to a computer, and having it reliably convert that into a valid, accurate text representation. When you constrain the voice input to a more limited vocabulary -- say, just numbers, or only the names that happen to be in your telephone's address book -- it's not unreasonable to expect a high level of accuracy. I tend to think of this as "voice control" rather than "voice recognition".
Still, I think we're avoiding the real question: is voice control, even hypothetically perfect voice control, more effective than the lower tech alternatives? In my experience, speech is one of the least effective, inefficient forms of communicating with other human beings. By that, I mean ...
- typical spoken communication tends to be off-the-cuff and ad-hoc. Unless you're extremely disciplined, on average you will be unclear, rambling, and excessively verbose.
- people tend to hear about half of what you say at any given time. If you're lucky.
- spoken communication puts a highly disproportionate burden on the listener. Compare the time it takes to process a voicemail versus the time it takes to read an email.
I am by no means against talking with my fellow human beings. I have a very deep respect for those rare few who are great communicators in the challenging medium of conversational speech. Though we've all been trained literally from birth how to use our voices to communicate, voice communication remains filled with pitfalls and misunderstandings. Even in the best of conditions.
So why in the world -- outside of a disability -- would I want to extend the creaky, rickety old bridge of voice communication to controlling my computer? Isn't there a better way?
Robert's post contains some examples in the comments from voice control enthusiasts:
in addition to extremely accurate voice dictation, there are those really cool commands, like being able to say something like "search Google for Balloon Boy" or something like that and having it automatically open up your browser and enter the search term -- something like this is accomplished many times faster than a human could do it. Or, being able to total up a column of numbers in Microsoft Excel by saying simply "total this column" and seeing the results in a blink of an eye, literally.
That's funny, because I just fired up the Google app on my iPhone, said "balloon boy" into it, and got .. a search for "blue boy". I am not making this up. As for the Excel example, total which column? Let's assume you've dealt with the tricky problem of selecting what column you're talking about with only your voice. (I'm sorry, was it D5? B5?) Wouldn't it be many times faster to click the toolbar icon with your mouse, or press the keyboard command equivalent, to sum the column -- rather than methodically and tediously saying the words "sum this column" out loud?
I'm also trying to imagine a room full of people controlling their computers or phones using their voices. It's difficult enough to get work done in today's chatty work environments without the added burden of a floor full of people saying "zoom ... enhance" to their computers all day long. Wouldn't we all end up hoarse and deaf?
Let's look at another practical example -- YouTube's automatic speech recognition feature. I clicked through to the first UC Berkeley video with this feature, clicked the CC (closed caption) icon, and immediately got .. this.
"Light exerts force on matter". But according to Google's automatic speech recognition, it's "like the search for some matter". Unsurprisingly, it does not get better from there. You'd be way more confused than educated if you had to learn this lecture from the automatic transcription.
Back when Joel Spolsky and I had a podcast together, a helpful listener suggested using speech recognition to get a basic podcast transcript going. Everything I knew about voice recognition told me this wouldn't help, but harm. What's worse: transcribing everything by hand, from scratch -- or correcting every third or fourth word in an auto-generated machine transcript? Maybe it's just me, but the friction of the huge error rate inherent in the machine transcript seems far more intimidating than a blank slate human transcription. The humans may not be particularly efficient, but they all add value along the way -- collective human judgment can editorially improve the transcript, by removing all the duplication, repetition, and "ums" of a literal, by-the-book transcription.
In 2004, Mike Bliss composed a poem about voice recognition. He then read it to voice recognition software on his PC, and rewrote it as recognized.
a poem by Mike Bliss
like a baby, it listens
it can't discriminate
it tries to understand
it reflects what it thinks you say
it gets it wrong... sometimes
sometimes it gets it right.
One day it will grow up,
like a baby, it has potential
will it go to work?
will it turn to crime?
you look at it indulgently.
you can't help loving it, can you?
a poem by like myth
like a baby, it nuisance
it can't discriminate
it tries to oven
it reflects lot it things you say
it gets it run sometimes
sometimes it gets it right
won't day it will grow bop
Ninth a baby, it has provincial
will it both to look?
will it the two crime?
you move at it inevitably
you can't help loving it, cannot you?
The real punchline here is that Mike re-ran the experiment in 2008, and after 5 minutes of voice training, the voice recognition got all but 2 words of the original poem correct!
I suspect that's still not good enough in the face of the existing simpler alternatives. Remember handwriting recognition? It was all the rage in the era of the Apple Newton.
It wasn't as bad as Doonesbury made it out to be. I learned Palm's Graffiti handwriting recognition language and got fairly proficient with it. More than ten years later, you'd expect to see massively improved handwriting recognition of some sort in today's iPads and iPhones and iOthers, right? Well, maybe, if by "massively improved" you mean "nonexistent".
While it still surely has its niche uses, I personally don't miss handwriting recognition. Not even a little. And I can't help wondering if voice recognition will go the same way.
Posted by Jeff Atwood
Speech increase the human-effort, and poeple are too lazy to speak. my bet is on 3D gesture recognition. imagine i see a new word on screen, create an imaginary circle around it with my index finger and drag and drop ( in 3d without touch ) to the new tab. something i can do in a second. and it feels powerful.
May be we need Hand-waving computer control like in Minority Report have a offices full of people waving their hands instead. Been more fun to watch.
You forgot the obligatory link to the Bill-Gates-On-Voice-Recognition-Memorial-Page: http://mpt.net.nz/archive/2005/12/30/gates
According to him, it _will_ come to a PC near you in two to three years, since 1997.
In the USA, isn't voice control already popular in IVR systems?
Over here in the UK, we mainly still use push-button IVRs ("For accounts, press 2"). I'm not sure why. It might be partially due to cultural issues, and perhaps partially due to the wide range of different accents within a relatively small population.
I remember graffiti on my Palm IIIx - I could write pretty fast. Even so, the onscreen keyboard on my Samsung Tocco Lite is much easier, even if I do feel a bit fat-fingered on it sometimes.
The handwriting recognition on my dad's old Windows tablet was pretty decent though. Also, when I get voicemail by Google voice, the speech recognition is pretty nice - except when the voicemail is in a different language. Then it's just funny.
I'm totally with you on the idea that it wouldn't be that useful, even if it did work. I think folks get excited about the idea because it seems very natural, ergo it must be easier than the current forms of control we have.
But it depends on the domain. Would you want to control a car's steering via voice recognition? I doubt it. Speech recognition for cars would be great if you could get in and say “Go to work”, and it drove you to work, but the magic there is automatic driving: a “go to work” button wouldn't detract from the experience one bit, other than making the demo seem less magical.
I'm much more excited about reducing and removing interfaces, rather than putting a boil-the-ocean amount of work into making them different.
Voice-recognition does seem to work pretty great for booking movie tickets over the phone though.
The voice recognition built into the Android operating system does a very good job, and actually has a purpose - it allows you to configure destinations for Sat. Nav., and search the web without having the fiddle around with a tiny soft keyboard.
It would be interesting to know whether this tech. achieves better than 80% recognition. I'm pretty sure it exceeds that from my own results (English Home Counties [read 'posh'] accent)
"spoken communication puts a highly disproportionate burden on the listener."
I fully agree!
What is funny is that you mention podcast later...
Personally, I avoid these numerous and surprisingly popular video tutorials or podcasts, partly because I am much better at reading written English than at understanding spoken English (depending on accent, too), because I am French.
And mostly because most of the video tutorials I saw are excruciatingly sloooow, we wait for the cursor to slowly move and hesitate to a menu, move elsewhere, etc. And I don't have sound at work anyway.
It is much faster to read a tutorial on a Web page, to print it out to read on the public transport, to skip some parts I already know, etc.
I understand the interest of video in some fields (eg. a demonstration of a manipulation in an image editor), much less in other fields (typing code in an IDE!).
But perhaps I am too old fashioned... :-)
Beware of falling into the mindset that if some form of input is worse overall than another (voice vs keyboard/mouse) that it is unuseful out of hand. It is the one-or-the-other kind of mindset that often leads to people taking sides and arguing which is better. Instead of thinking how voice control would replace our current ways of using computers, try thinking of how it can enhance our experience.
Of course it would be ridiculous imagining a workplace where everyone is giving basic commands to their computers. Shared spaces like that must always be considerate of how it affects others. But at home, the consideration is far less. Laying in bed, who wouldn't love to just say, Star-Trek like, "Computer, lights off."
One other thing I'd like to point out, that would give a very simple insight into how voice can enhance our experience, is that while reading is far quicker than listening, saying something is far quicker than typing, and often quicker than locating and pressing a button on a particular screen.
Even a human typist needs to have the text _dictated_ to him, which isn't the same thing as plain talk. To be able to actually talk to a computer, it'll need artificial intelligence, knowledge about you and the context of your conversation, and probably also a camera with software that can understand body language.
A camera also helps to solve the problem of computer knowing when you speak to it, as opposed to other people/computers.
I confess I'm not entirely convinced by the idea that it's still uselessly terrible. Command interpretation - and, more importantly, comprehension - as seen in SF remains beyond us I suspect, but the last time I tried dictation software (Dragon NaturallySpeaking) I found that, with some training of the software and some practice on my part to avoid umms and aahs, it was of comparable accuracy and speed to general typing. In 2000, on a Pentium II.
I suspect there is a greater problem though. While we work in noisy, shared environments, or use our home computers with others around while we're watching the TV or listening to music, dictation as our primary means of input is a fundamentally flawed concept. We have to vocalise our trains of thought to all and sundry and neither us nor them are likely very keen on that. I suspect it'll end up as another form of assistive technology, like screen readers for the blind at present.
Interesting topic, indeed. Speech recognition, no matter how good it might become, will never work for all aspects of human-computer-interaction. Very useful for physically challenged people for sure, but your sample with the room full of people trying to control their computers is a good example why it won't work in practice. It is already disturbing enough when somebody next to me starts talking out loud and I wonder if he's talking to me, just to find out, he's using his bluetooth headset for a phone call.
Another problem is, that human beings can understand "context" and "situations", computers cannot. So if there are 10 people around me and I'm start talking, people will know when I'm talking to one of them, or to a group of them, or to someone else. They will either know by the fact where I'm looking, who my eyes are focusing or by the context of what I say. How can a computer know something is a command for him or talk to my coworker? When I say to my coworker "Just go to Google and search for ...", I don't want my computer to do this; how shall my computer know?
My biggest glitch with computers are mice. I use a trackball, which I consider much better, but still not perfect. Touchscreens, Tocuhpads? Don't like them. Keyboards with sensitive surface and gestures recognition? Don't like those either, because they are basically touchpands. My dream is that I can have a normal keyboard one day, with normal keys, for typing, but I can just lift my hands a bit up and make a gesture in the air and the computer will understand it, so I don't have to move my hands far away from the keyboard at any time, just to move a window to the left, make it bigger or open a menu. Sure, you could do this all with keyboard shortcuts, but that is not as effective as using a mouse, at least not with the operating systems I have to work with.
I so disagree!
>>Wouldn't it be many times faster to click the toolbar icon with your mouse, or press the keyboard command equivalent, to sum the column -- rather than methodically and tediously saying the words "sum this column" out loud?
You are trying to eat soup with a fork!
If you could say "Can you please sum column D15 and place the results after the last populated cell, and could you also save the worksheet for me after that" - Then YES, it would be faster than clicking (especially for non-IT folks).
Of if you are filling up a classic user profile page and could simply blurt out your address without having to carefully break it up into Street/Zip/Country, it would be delightful! Of course, if you are going to have to say, Postcode IS xxx, Country IS xxx, then NO, that is TEDIOUS as you point out.
When voice recognition software and hardware mature and allow us to speak as fluently as we do in day to day life, THEN voice recognition(and not control) will definitely be more effective than the lower tech alternatives
Because we do not know how to MAKE IT, does not mean we do not know how to USE IT.
Funny that the poem line "it gets it wrong... sometimes" failed while "sometimes it gets it right" is recognized successfully...
* typical spoken communication tends to be off-the-cuff and ad-hoc. Unless you're extremely disciplined, on average you will be unclear, rambling, and excessively verbose.
* spoken communication puts a highly disproportionate burden on the listener. Compare the time it takes to process a voicemail versus the time it takes to read an email.
Carefully composed writing is a lost art in many parts, lots of people write like they speak.
The reason that Palm Graffiti worked is that it used the limited vocabulary approach. However, I think that for handwriting recognition, this is probably the best approach. Make the user learn a new alphabet, in which all the ambiguity is removed, and which can actually speed up writing, because you can actually make the letters simpler. It's much easier to write a letter T, or A on Graffiti (http://en.wikipedia.org/wiki/Graffiti_%28Palm_OS%29), because they are reduced to a single pen stroke. my only problem with the Palm was the lack of friction, causing it to have an unnatural feeling, quite different from writting with a pen on paper.
"In my experience, speech is one of the least effective, inefficient forms of communicating with other human beings."
Jeff, I respect your experience, but I completely disagree with your implied conclusion. Yes, speech is inexact, and people can ramble, but the density of information sent back and forth via the human voice isn't restricted to only the words spoken. Tons of subtle cues that we take for granted in day-to-day communication - stress, tone, intonation, pauses, etc - alter meanings and advance conversation.
This all gets stripped in other forms of communication, for example, email. I can not tell you the number of times where a situation came up in my work where email kept getting passed back and forth trying to resolve an issue. A quick phone call ultimately resolved the issue/misunderstanding in 30 seconds. I now use a rule of thumb now that if more than 4 round trips occur for a given issue, I pick up the phone and give the person a call.
Which is easier for a non technical person to understand? "Tell you computer to open browser." or "Click the little icon in the lower left of your screen, open the Programs menu, find Firefox somewhere in that huge list."
Now, it's true that we'd need really go AI in order to interact with a computer naturally. But I think we could improve user experience if we could create custom commands with voice recognition.
For example, you walk into your office. As you walk across the room to your desk you say, "Computer, open project control the world." Then, as you are sitting down, your computer automatically opens up your world domination plans exactly where you left them. It could even have a voice print checking feature. And with cameras, facial recognition. Essentially, just shortcuts with voice commands.
So, it'll be a while before we can interact with computers like they do on Star Trek. But we could have some useful features today. I wonder why we don't... For the shortcuts with voice commands, we'd just key it to a recording, so it should be possible to ignore the error rate.
When you were researching this, what did you find out that people have been doing to try and solve the problems?
The reason speech recognition doesn't work is the same reason we don't have automated generation of novel software, automated debugging, the semantic web or any of a number of other software tools - because they all require approximately-human-level AI to provide the necessary judgement, context, extrapolation, fuzzy logic and culture.
I mean, even if I misheard you when you said "balloon boy", I could still make a judgement, based on recent cultural events, that you meant to say "balloon boy". But that takes an incredible amount of processing power to make such a conceptual leap. And computers just aren't up to the task yet.
The good news - the human-level-AI problem is solvable, it's just really, really hard. We will solve it someday, and it will render a huge swath of our day-to-day tedium automate-able.
But today is not that day, tomorrow isn't looking so hot either.
"[I]s voice control, even hypothetically perfect voice control, more effective than the lower tech alternatives?"
Depends on the situation, I guess. You astutely pointed out that people still regularly leave voicemail, but it takes SO LONG to listen to the actual voicemail. (well, compared to reading the same message as Text.)
I presume, by lower tech, you mean pushing a button or key on your keyboard. I think voice recognition *can* be useful. The ideal situation is those automated telephone menu systems. But I can also imagine it being incredibly useful in hands-free operation of a phone.
Enunciation and Projection
Most people speak sloppily. They don't enunciate or pronounce their words properly. Nor do they know how to project their voice so that it is consistently clear. Our evolved brains use a lot of pattern analysis, interpolation, and contextual clues to fill in the gaps. While some speech recognition programs have some of this, they certainly don't have the comprehensive suite of these techniques that our brains have both learned and evolved over the years. It will take awhile to actually figure out how they work in an algorithmic sense.
Back to my assertion about people speaking poorly. I once installed Dragon Naturally Speaking for a customer of mine who had a very outgoing demeanor, so in telephone conversation and in board-room meetings, people always could hear him and understand what he was saying. Yet, when he tried this speech recognition program, it clocked in at around 90% recognition because he wasn't used to talking to a computer. Oddly enough, when a phone call came in while I was there setting it up for him, he (mentally) switched over to his proper mode of speaking (still wearing his head-set that came with the program) and recognition shot up to 99+% because he was speaking clearly and enunciating his words properly.
Once he got off the phone, it dropped back to 90%. It was quite funny, but clearly illustrates that most people don't typically understand how to talk. (well, at least, how to interact with an emotionless computer.)
And well, I guess that's the real part of voice communication that we're not talking about here:
HOW you say something is just as important as WHAT you say.
Voice transmissions have a sub-channel that carry emotional information. Computers, currently, are completely unable to even detect and process this information. I'll concede that it is generally irrelevant in speech-to-text processing, but it's still part of the contextual clues that we, as humans, use to fill in the gaps of what is being said.
I think it's telling that you choose to use examples that are, in almost every case, more than 5 years old when discussing the utter failure of voice recognition. Run your own test and see, rather than just report on others' reports of failure. Spend 5-10 minutes training the speech recognition in Windows 7 and then try dictating your blog post. I suspect that you'll find, if you do an honest attempt rather than deliberately trying to foul it up, that the recognition quality is very good.
I should also mention here that your thesis is a little confused. Are you attacking voice recognition (which really means identifying a voice), speaker-independent speech recognition (what Star Trek led you to dream of) or speaker-dependent speech recognition (which is what Windows 7 or Dragon Dictate offer)? Voice recognition software is actually quite reliable these days, although most of us have no use for it. Speaker-independent speech recognition is what really sucks, unless you constrain it to a very limited input set such as numbers for telephone IVR systems.
Ultimately, I suspect that if you had pain when operating a keyboard, you'd start finding alternatives like speech recognition a far more worthwhile investment of your time and energy - and then you'd be pleasantly surprised by how effective it can be.
I'm a coder and I need brain motor-functions fast recognition, not voice recognition as input.
Handwriting consumes one hand and inhibits you from using the other, which directly influences your efficiency... Yes, if you can type faster than handwrite (which most people can) obviously handwriting recognition is pointless.
Speaking with your voice does not consume either of your hands. The key to understanding the potential of voice recognition is to avoid assuming that it will be the single interface to the computer. You can keep working with the mouse/keyboard (or whatever hand-focused interface exists at that time) and, if you want to do something that would be faster if you didn't stop your hands from what they were doing, you simply talk to it.
However, people can speak simple commands faster than they type and faster than they move a mouse, since general-purpose computers usually have so much they can do that everything becomes buried in the UI, even the simple things. The comment that David Reagan made above exemplifies this point wonderfully.
"Wouldn't it be many times faster to click the toolbar icon with your mouse, or press the keyboard command equivalent, to sum the column -- rather than methodically and tediously saying the words "sum this column" out loud?" -- I'm sure something similar was once said about toolbar buttons ("wouldn't it be easier to open the File menu and click Save instead of searching for the tiny little save icon amongst all those buttons up there?") but we learned pretty quickly that you get used to it and don't really need to "look" for the button, despite how many buttons there are.
"I suspect that's still not good enough in the face of the existing simpler alternatives." -- No doubt StackOverflow was not precisely what you wanted it to be when it began. If you had said, "Oh, this isn't absolutely, mind-blowingly spectacular today, so I think I'll just give up on it," where do you think we would be today?
And finally... "In 2004, Mike Bliss composed a poem about voice recognition. [...] The real punchline here is that Mike re-ran the experiment in 2008, and after 5 minutes of voice training, the voice recognition got all but 2 words of the original poem correct!" -- If this is not a direct and blatantly obvious display of the improvement of voice recognition software, I don't know what is. Yes, it took 5 minutes of voice training. Now he doesn't need to do that voice training anymore - it was 5 minutes, once. And if we don't just give up on voice recognition now, then eventually that will be 1 minute of training, and maybe someday the AI won't need training and will change its expectations as it gets to know you better.
The point is... There are places where voice recognition helps (as a third interface to the machine: hand/hand/voice) and there are places where it simply doesn't belong. Just like the computer itself. But if we aren't willing to understand that, then we'll never see what it can do. If we had decided that computers were a waste of space and energy when they were the size of rooms, where would we be today?
There are three different problems with very different error rates: recognizing anybody's structured speech, recognizing one person's unstructured speech, and recognizing anybody's unstructured speech.
If you can define a structure around your interactions with the recognizer (http://en.wikipedia.org/wiki/VoiceXML), you can cut out the list of possible matches, and really increase the confidence measure of the transcription. We're using this approach here: http://www.fidelus.com/locator2.html
This works today, and works well. We've done demos on a speakerphone at a noisy tradeshow booth with very few issues.
You are exactly right that recognizing anybody's *unstructured* speech is a very hard problem to solve. That's why commercial services still use people to do transcriptions if the confidence measure is below a certain threshold. Over time, these services learn the voice of the people that call you frequently.
I don't think Google uses human transcribers, which is why GVoice transcriptions are hilarious.
There are certainly places where I am very grateful to have voice control available to me today.
A big one is while I'm driving. My car is a Ford Focus with Ford's "Sync" system installed, and I can use it to control my bluetooth phone, and even my stereo by voice. Pressing the voice prompt button and saying "call john smith at work" or "play artist The Beetles" seems to work with nearly 100% reliability (the exception being some of my phone contacts or music artists with very unusual names). Being able to do these sort of things without taking my hands off the steering wheel or even looking at a screen is very nice.
Another place is the transcripts of voicemails provided by Google Voice on my Nexus One. Granted... the transcripts are occasionally TERRIBLE, and are rarely 100% accurate, as you've noted above. However, as an at-a-glance summary it helps me to determine how urgent the message is, and if I need to listen/respond to it right away. Consider the following two (made up) messages:
Josh, this is forgotten hospital. Your mop's been in an accident and you need to fling as soon as you can. Please boil 555-555-5555.
Hi Josh, this is Jane. Veal were considering seeing Boy Store Tree at the theater tonight and thought maybe wood like to come too? Please bet me now.
These are made-up examples, but they're pretty typical of the results that I get (actually Google's voicemail transcripts are often better than the above, but the worst English voicemails I get look something like these). Yeah, they're not accurate. Yeah, I'll probably listen to the message anyway. Still, I can learn a lot about the relative importance and "ignorability" of the message thanks to the unreliable transcript, and I find value in that.
So why in the world -- outside of a disability -- would I want to extend the creaky, rickety old bridge of voice communication to controlling my computer? Isn't there a better way?
Just because you can't imagine the benefits doesn't mean they don't exist.
Of course, only an idiot would argue that voice recognition will ever replace other methods of human-computer interaction. When we say that Voice Recognition would be useful, we don't mean that it is useful for everything. But there are plenty of scenarious where traditional forms of computer interaction just aren't appropriate.
Sure, desktop spreadsheets are easier to mouse / keyboard. They're designed that way. But when we're doing something else with our hands it would be really useful to have machines do what we tell them.
Wouldn't it be many times faster to click the toolbar icon with your mouse, or press the keyboard command equivalent
What... when you're driving? I don't think so, Jeff.
I think you're right to bring up text input and small devices. Early texting on mobile telephones was fiddly and difficult to use, so people invented a short-hand where they didn't spell anything properly, used the word "loll" all the time and added a message terminator in the form of a smug, smiling, winking face.
So, the solution isn't to change the technology but the behaviour of the people using it. I imagine a voice recognition future where some new language is invented in order to leverage the technology - an unambiguous vocabulary enhanced by pops and whistles and that noise you can make by sticking your hand in your armpit.
Jeff Hawkins, who created Graffiti for Palm is co-author of a great book which touches deeply into why current computer technology fails so badly at things the human brain does well--like vision and speech recognition. The book is "On Intelligence" co-written with Sandra Blakeslee. In the book, Hawkins discusses how human (and other) brains use relatively simple decision making and vast access to memory and abstraction to interpret messy input like speech or vision. He makes a good argument about why current approaches to things like general speech recognition can't get beyond a certain hump and proffers a theoretical new approach based on human wetware instead of computer hardware. One of the best books I've ever read (okay, listened to) on the subject. More info at http://www.onintelligence.org/index.php
Why ever downgrade to v2 or v1?
Apropos of nothing, when did quoting break? All I see when poster X quotes poster Y and replies is a series of paragraphs formatted exactly the same. Seems to me I remember a few years back that quoted text was automaticaly italicized. Bug report?
Perhaps it's just that I have a fairly neutral "Midwest American" accent, but every time that I have used speech recognition since Microsoft's Speech API 4 in 1998, after training it has been 99% accurate for dictation and 100% accurate for command and control.
My wife, on the other hand, can't get speech recognition to work properly in any mode, even with training. Even though we are both from the same town about an hour away from Chicago, she can speak with a Chicago accent while I cannot.
Does anyone know how it's works in different language?
English language opposing to others have very irregular spelling. In addition, many quite different accents.
For example, words with some pronunciation but with different spelling:
weak, week; sent, cent; sun, son; bye, buy; sum, some, piece, peace; meat, meet; too, two; pears, pairs; weigh, way; rode, road; ...
How it's looks in Spanish?
As someone who usability tests IVRs (automated phone systems) for a living, I have to say that speech recognition currently works very well, given the right domain. The same could be said for keyboards, mice, trackpads, touch screens, etc. What constitutes a niche depends on your perspective. You probably physically touch dozens of computers a day, of which only a few have a keyboard and a mouse/trackpad. Just today I've used a car, a Playstation 3 (a.k.a. Blu-Ray player), a digital watch/heart rate monitor, an alarm clock, a Palm Pre, a microwave oven, a VOIP phone, and a Mac Pro. Clearly the Mac is a niche in my daily experience.
The interesting thing about speech is that it's conversational. In our tests, we've found that for a good voice app, accuracy above about 75% doesn't improve the user experience. Why? A good app will prompt you to repeat, often with suggestions on how to improve accuracy. This isn't burdensome for the user, since it's the same thing that humans do when they can't understand what you say. Indeed, the target for a speech app shouldn't be perfect recognition, it should be to be within the ballpark of a human listener-- along with an error recovery script that's also as good as a human.
The problem with your critique of speech recognition is that it's complaining that speech doesn't work where it's an inappropriate input mechanism. That's a tautology. I, for one, am glad I don't need a mouse and keyboard to operate my microwave oven, car, and digital watch. That doesn't make them bad technology, just inappropriate for the use case.
IMHO Henry Kuo makes a much better point than anything in the original post. It is indeed a trap to think that if one method is not best overall it is the simply depreciated. It is rare to see such a vision-less post from you.
You would not want to use voice rec for coding but then, an all touch interface wouldn't work best either. Does that mean it does not work on a phone? Clearly no. Handwriting recognition is not efficient for writing a full essay but using a tablet pc as my sketchbook and then being able to write the name of a particular drawing (so it is findable in a mass of scribbles) before moving on to the next is still one of the best computing experiences I have. It makes sketching both effortless and functional.
As mentioned by an earlier, the voice rec in Win 7 is quite good. Probably good enough for a captain's log in fact. I suspect it will be pretty good in Kinetic but of course they'll be embarrassing videos a plenty because it is not mission critical worthy, the "two incorrect" words is that difficult final mile, but it does bode well for the Star Trek future in which we simply tell background computers what we want to happen. If not tea, earl gray than at least basic tasks in which the really depreciated buttons like tv remotes and on/offs slowly start to fall away.
The last 20 years taught us that imperfect software can still be useful (see any web app). And that some good ideas take a long time to become useful (see tablet computers).
Also, until recently, automatic translation was a joke, for similar reasons. No longer: http://translate.google.com/translate?u=http://www.asahi.com/&sl=ja&tl=en">http://www.asahi.com/&sl=ja&tl=en">http://translate.google.com/translate?u=http://www.asahi.com/&sl=ja&tl=en It still has a long way to go, but a computer currently translates Japanese better than most humans, and that can already be useful.
Ironically, even in Star Trek (TNG) they recognized that if you really needed to get something done quickly, the best thing to do was sit Data down at the console and have him type furiously at robot speeds. Make it so!
"In my experience, speech is one of the least effective, inefficient forms of communicating with other human beings."
You might be forgetting just how much information we can transmit through so few words, with things like tone, inflection, etc. The issue with language is that each message is not self-contained; on the contrary, it depends fundamentally on the information each party already has. So I think the problem here is not the inefficiency of language but rather the immense difficulty of digitally recreating a system of such complexity.
Of course, one reason for using speech recognition is to reduce the learning curve for computer use. But we've seen how this level of intuitiveness can be reached without speech recognition in software like the iPhone OS.
Jeff, I see you read and quoted from the Robert Forstner piece (or peas). Did you read the comments on that piece?
Several gentlemen actively working in NLP took serious issue with his claims about the field flat-lining, specifically criticizing the study that claimed 80% accuracy. I quote: "State of the art wide coverage parsers are currently sitting around 88-95% accuracy, not 80%, with >99% coverage (meaning a successful, though possibly incorrect, parse of 99% of unlabelled unrestricted text)."
Consider taking another look and possibly involving some additional sources - you have a widely-read blog, and it would be a shame to pass on misinformation.
See also: http://www.reddit.com/r/programming/comments/bzbdf/rest_in_peas_the_unrecognized_death_of_speech/
While accurate, speaker-independent voice recognition with no constraints on vocabulary or context is still a long way off, there have been enough improvements over the last 10-15 years to make speech input really genuinely useful in more specific scenarios.
I've been working on voice control for Windows Media Center for the past couple of years, and find it works very well indeed. The key benefits I've found are:
- The ability to choose a single musical artist, film, TV program, etc from a decent sized media collection that may include thousands of alternatives - without needing to drill down through menus.
- The ability to issue commands without interfering with onscreen action (e.g. changing the music while a slideshow is running)
- Instant access for things like jumping to a particular point in a movie ("skip to 47 minutes")
- An audio input device that only listens when users are issuing speech commands, rather than trying to make sense of all the random sounds it hears (we use an accelerometer to intelligently unmute the mic when needed)
- The alternative, for our target audience, is a normal remote control; most living room users don't have a keyboard & mouse conveniently to hand for controlling their TV experience
During our development, it became very clear that the single most important thing needed for good speech recognition is a high quality microphone system. Most PC mics are lousy for this (limited bandwidth, etc.) which makes the voice recognition engine work much harder. Garbage in, garbage out. (Bluetooth is even worse, due to limited bandwidth.)
I think this is why most users who dabble with speech recognition find it generally poor, even though the quality of, say, Windows 7's built-in speech recognizer is actually pretty decent.
It will be interesting to see how much Microsoft's Kinect (aka Project Natal) pushes the speech-in-the-living-room experience forward: with 3D cameras that can accurately identify a speaker's position in the room, coupled with an array microphone that can focus on that position, the potential is there for very good recognition.
And another thumbs' up for Jeff Hawkins' book On Intelligence, mentioned by an earlier poster: well worth a read for anyone interested in all types of machine recognition.
A very good summary, but there are some things which should be pointed out.
1. The graph is from the Sphinx Open Source recognizer from CMU.
2. It is from 2003
3. Sphinx today is 10 years behind IBM, Nuance, Google, and MultiModal (and 6 years behind Microsoft).
4. You are proposing many strawmen and burning them. What about the places where speech works?
The truth is, depending on the domain, many applications have crossed the 4% WER barrier. You just don't notice anymore. Did you know that all live closed captioning is dictated? Captioners 'respeak' what is happening and use macros. The majority of medical transcription is done with computers, and then corrected by correctionists (mainly formatting and billing code extraction/confirmation.)
Ever call 411? State and City please? (that is speech recognition presenting a limited phone book to the operator). Google Grand Central has huge raving.
If you still think Speech Recognition does not work, download the Dragon Dictation or Search apps for the iPhone/iPod/iPad. They are free. Try them out with some full hard sentences. Don't just try single words, or things you would not normally say. Try it for real. then decide.
@Eddy: Kinect will only react to keywords, nothing more.
I always wondered about the interest in speech recognition, handwriting recognition I can understand - it's just easier taking notes on a pad than on a keyboard (at least to me). The only application I could see for it is when you're pacing around the room brainstorming, but then the problem becomes that you're pacing and brainstorming, neither of which tends to produce clear speech.
The best thing they could do is couple this with language recognition, to recognise patterns and non-sensical sentences. So you have a certain % of recognition of which word it could be and you weigh that off against which word makes the most sense in context, but that's a whole different issue we haven't grasped yet either. Context sensitive language.
guys, the reason why a machine cannot recognize speech well comes from a linguistic matter: human speech recognition mecanism itself is a kind of mystery.
"Speech recognition, no matter how good it might become, will never work for all aspects of human-computer-interaction."
Insert any other form of input in place of "Speech recognition" and that statement is still true. Keyboards are not very helpful in image manipulation, the mouse/trackball is terrible for word processing and an X-box controller is not so hot for playing Guitar Hero. Each has their strengths and weaknesses, and speech recognition is no different.
Actually the big difference is that each of those inputs has a very limited amount of information that needs to be processed... a keyboard only has so many keys, a mouse has X/Y coordinates and a few buttons, and a game controller has buttons, a control pad, analog stick, etc. But a voice is an audio stream of data that must be recorded, background noise filtered out, words must be parsed out, and without actually understanding the language it's impossible to differentiate between homophones. Considering the niche application of voice recognition, not a lot of people are concerned about solving those problems when other interfaces can be munged to work well.
I think it's weird that everyone feels the needs to slam a technology as useless because it's not universally useful. Heck, even in Star Trek which is used as the quintessential example of voice recognition still had sophisticated textual/graphical interfaces that you didn't talk to.
Speech Recognition is the technology of the future, and always will be.
Just out of curiosity I turned on voice recognition in Windows 7. I skipped the training. Next I read the poem. It got all but one word correct. This was using the built-in microphones on my laptop in a fairly quiet environment. Last time I tried this I don't remember it being nearly as good of an experience.
I think general voice recognition can be very sensitive to the particular user's voice. I use Google Voice and I find that the transcript is really hit and miss, however it is hit and miss depending on the speaker. For some callers it is very close to 100% every time that particular caller leaves a message. For others, it can be very bad every single time.
Improvements in voice recognition have been slow but not nearly as slow as suggested I think.
Dear aunt, let's set so double the killer delete select all
I use Google voice recognition all the time, both Goog411 and the Android voice recognition. Both are remarkably effective and useful, and the pervasive availability of voice input is (IMO) a killer feature of Android phones.
I've heard the reason Google gives away the Goog411 service for free is it lets them acquire a massive, user-corrected training set for their VR algorithms. It seems to be working. I've had cases where Goog411 handles cases ("Le Boulanger Café") that human operators failed at.
It's in the car where it really seems to have found it's niche. Hands-free voice activated dialing and bluetooth over the car system means not having to look away from the road, take your hands off the wheel, or even touch your phone to make a call.
Or setting the destination for the nav system while driving. Or even the radio. And so on.
But still I get frustrated with my own car's system. Partly due to some design flaws that cause minor annoyances. And partly due to it sometimes just not understanding what I'm trying to tell it. Even with an extremely limited vocabulary and command set, and even with the ability to train the system to my own voice, it *still* manages to get it wrong some of the time.
So yes, there's a long way to go. But I think it can prove to be a wonderful tool for some limited applications like that. Basically anyplace where your hands and eyes might be tied up doing something else.
Although the post is mostly about communicating with computers, you do mention the drawbacks of trying to verbally communicate with other people (via a computer). The benefit of verbal communication, and the main reason it is used, is that it offers up the possibility to use your hands for something other than communicating, while you're communicating. That means you can do something -and- communicate at the same time. If you'd only use your hands you'd have to choose between either doing something or communicating. In areas such as gaming this makes a huge difference, because it doesn't allow for the breaks in action needed to type something to our fellow gamers. Also typing would be too slow and probably inaccurate when trying to type something fast, to be efficient enough. Also, you wouldn't have the time to look at the chat window (whereever it might be on the screen) because you have to focus on something else entirely. Here verbal communication offers a necessary complement to typing. And you are right, it is needed to be rather precise but it really doesn't take long to master proper commands, if it's in a field of your interest.
can't live without it on my android phone, and it's pretty decent on windows 7 too. "whatever happened"? - it got good, and people use it as their preferred option every day.
the baseline human voice transcription accuracy rate is anywhere from 96% to 98%!
I bet the numbers will go down if the listeners are blindfolded and are asked to listen to random conversations. From the software point of view, it's an unfair competition.
I find it funny when "futurists" try to incorporate an age-old broken communication medium into current technologies. I can't figure out why restaurant drive-thrus still have a speaker and microphone! Haven't they screwed up enough orders to figure out that the system is flawed? Even humans can't communicate effectively with speech, which is why we are constantly saying "Come again?" and "Say what?" to each other like idiots. So how can we expect a computer to figure it out?
By the way, who had the *crazy* idea to incorporate text messaging into cell phones, and why? Seems to me like older people struggle to understand this more than younger folks. "Why would you ever want to type when you can just talk?!" they proclaim. Because some people, at some times, would rather press a hundred letters instead of saying a dozen words. The messenger can get their point across without drawing attention to themself. The message is sent when the sender wants it to be sent, and it is read when the reader wants it to be read. What's wrong with that?!
Text/typing is also flawed, which is why it bothers me to still see things like voice/hand recognition being discussed. We would be better off thinking about the next NEW method of communication instead of trying to conjure some hybrid of past failures. This was a good post, Jeff!
I can imagine stackoverflow.com understanding my commands and I could answer the question a lot more quicker. If not me than hope for next generation developers. Something "not" similar:
SOF: Mr Asad there is a new question for you in C# tag. would you like to contrubute ?
SOF: What is the best library to use for "voice control" with C#.
ME: VoiceControl.Net...(gap): proceed
SOF: are you sure ?
SOF: I did not get that..
SOF: I did not get that..
SOF: I did not get that..
Me: :) :) :)-------
better type it before someone else answers it and wins the "Nice Answer" badge
I suggest you try Android voice recognition. Not sure if it goes above 80% accuracy, but it’s quite good. Voice recognition is very neatly integrated: any text field of any Android app can be talked into.
I realize you didn't create that graph, but I'm disappointed you would reproduce it given that it is highly misleading in several ways...
1. It uses the good old trick of manipulating the y axis to suit whatever interpretation of the data is wanted. Replot that data on a linear instead of logarithmic scale and voice recognition software suddenly doesn't look so dismal.
2. The data set shown here in red was cherry-picked from several found on the original NIST plot. When the number of words that need to be recognized is fixed and limited (such as commands that can be issued to a computer), voice recognition performs quite well.
3. The last data point on the plotted set is from over 8 years ago. Lack of data since then in no way implies that no progress has been made.
I'll add to the recommendation for "On Intelligence" by Jeff Hawkins. He clearly postulates why he believes AI the way it is currently being tackled will never work, and uses speech recognition often as an illustration. A must read if you're fascinated by how the human brain works, and what the (potential) future of AI would/could be.
A fun little personal experience:
I called a friend of mine recently and left a voicemail consisting entirely of me saying "penis" in every imaginable inflection/accent/volume/etc (I've known him since high school...this is par for the course with us).
Well, my friend uses Google Voice (I think that's the name), which takes your voicemail and translates it to text for him to read. Somehow, the application translated my message into a bizarre, but somewhat logical message from Denise, telling him that she would be late, but it would be nice to meet him. We were absolutely bewildered by this....
I wouldn't call Graffiti hand-writing recognition.
Jeff Hawkins turns it on it's head - he had the user get trained rather than make the machine learn. It recognizes a relatively small set of 'strokes' and mathematically determines which is the best fit.
Better to call it a limit 'print-writing' recognition at best.
I appreciate your thoughts on voice recognition, but you dismiss useful cases on the "fringe" too handily. I'm glad you stipulated that you found valid use in voice recognition given disability; I have a close family friend who makes her profession as a writer, and her Parkinson's disease makes this, a source of her identity, more and more difficult as her disease progresses. Voice recognition software lets her continue to "write", but as you mention, this software isn't at our desired Star Trek Quality Level quite yet.
Doctors have long recognized the impact of "quality of life" treatments on overall patient health. For instance, patients with degenerative, incurable diseases like alzheimer's can temporarily stave off cognitive decline with simple things like playing scrabble or socializing.
I hope that, given the massive investment in R&D for complex biological treatments by drug companies, somebody, somewhere, takes the time to consider what might be achieved by devoting significant brainpower to quality of life treatments, like voice recognition software for writers whose bodies betray them. I don't think of voice recognition as a luxury, I think of it as a healthcare issue. If it spills over to be useful for us in the mainstream, that's just a pleasant side effect.
Currently I am working on a voice command application for Android and for the most part users find it accurate enough to be useful.
Some things I do to combat poor recognition (Google's) are noticing similar words, hi vs high as the same in certain situations and many types of fuzzy matching in others. Solutions vary case by case but there are certainly ways to feel that 80% a little less.
For the most part, its novel as a cool party trick but echoing what others have said, I have many disabled, impaired or pure hand free users who find voice recognition, while flawed a great way to interact with something they previously could not with no additional hardware.
I suspect that a big driver behind speech and handwriting recognition was simply that 15-20 years ago there were a lot of adults in their 40's who had never learned how to type, much less use a mouse. I would rather click on a column and sum, but that's because I grew up learning how to do this quickly and accurately.
White collar professionals at the peak of their careers in the late 80's were often told that typing was a skill for secretaries and that all they had to do was give dictation. The secretary would do the "speech recognition".
Have you tried Windows 7's handwriting recognition? It's pretty damn accurate!
Voice recognition still sucks in Windows 7 but after training it's usable.
How ironic that most people use their iPhones, Blackberrys, Androids, et al for anything BUT verbally communicating. Using a telephone to actually SPEAK with another person? How quaint.
The biggest problem with vocal interaction with computing devices may simply be that we're infatuated with the romantic Hollywood versions of that. Like programming seems so dynamic and cool in movies, we're not prepared for how tedious and laborious the real versions of these interactions usually are. Given that Star Trek's teleportation scheme was just a creative way for the show's prodcers to circument budgetary limitations, I suspect that voice recognition was likewise more a means of quickly moving the story along than actually depicting or predicting a useful future technology.
On the handwriting tangent, ya hafta wonder why Jobs & Co. didn't build handwriting recognition into the iPad. Is it that he simply didn't see the need to compete with tablet PCs, he didn't believe it'd be successful enough, or that he views handwriting as a useless and all but dead technology?
Here is some official information about IBM's Watson.
IBM BlueGene/P (a.k.a. "what is Watson?"
People who know me also know I am very passionate about sci-fi. So
when I saw this today I just couldn't help thinking about HAL in
"2001: A Space Odyssey" as well as Skynet in the various Terminator
Kitchener / Waterloo / Cambridge,
Last year I had a problem with my right wrist and was unable to use my hand for 10 days. The first thing that came to my mind was searching for a voice recognition software, only to find out that it simply doesn't exist nowadays.
I agree that for non-disabled people it wouldn't be the most productive way to use a computer, but it would be excellent for people who are disabled in some way.
I mean, except for thought recognition, which would be much better.
I don't know for voice recognitions, but handwriting recognition is a very significant variant of Chinese input systems. Without that, I couldn't imagine how my Dad could learn how to type into the mobile phone Chinese names of his friends, colleagues and relatives.
The art of Chinese typing through a 105-key keyboard, let alone a 12-key numpad on phone, still has quite a learning curve. Therefore Chinese handwriting recognition systems are actively developed and improved through these decades, bridging the gap for elderlies. I remember when I was small, the recognition rate by the computers in public libraries is quite low. Nowadays it is a lot more satisfactory.
I think you are missing the point. It's not about speech recognition error. It is about finding the appropriate application for speech recognition such that users will tolerate an error here or there. I has to be the kind of application where the user will gain a huge benefit from using Speech Recognition.
My Android App seems to be a good example, 1250 people used speech recognition to have their recipes read out loud. I think they just don't want to get their million dollar phones sticky, is that worth waiting through a few speech recognition errors?
Speech recognition is still alive in this app: www.digitalrecipesidekick.com
For general purpose (unconstrained) voice recognition, the best thing we could hope for (without real AI) is probably about the same level as an Aspie/Autist understands verbal communication; i.e. very literal. This might be good enough for cheap closed captioning, but not for making a more intuitive user interface (at least not for a non-geek). For geeks (closer to the Autism end than the Normal end of the autism spectrum) it might soon be good enough for some usage... ;)
The real problem is that if the speech recognition requires a fair amount of training, a tactile interface is likely to be easier to learn to use fast.
Dictation using a human typist is not faster (from thought to final written text) than typing it yourself if you're a good typist, though a good stenographer may manage to extract the essence and write it down during a brainstorm faster than an amateur would, it's very unlikely that an unintelligent machine would ever be able to do so...
Hi Jeff, I have pretty strong opinions about this (I guess a lot of people do, but I normally don't have strong opinions about many things).
1. Not nearly enough attention is paid to the argument for people with disabilities.
With an estimated 75% of the population having some sort of disability (I don't have the stats on me, but check out the whole chapter in "Don't Make Me Think V2"), it's not so much about whether YOU "...would I want to extend the creaky, rickety old bridge of voice communication..." to anything, but whether or not it would benefit the world as a whole. 75% of the population is lots of people - the potential majority.
Look no further than your own brainchild, SO - For developers, it's not perfect, but people need it and "it works": http://stackoverflow.com/questions/87999/voice-recognition-software-for-developers
2. Most reviews of this type of software are full of... it.
How is it that anyone who 'reviews' voice recognition or hand writing recognition spends a whole 5-10 minutes training/using it and calls it crap. I get that some people have the attention span of a goldfish but this software isn't magic (yet). As a result, the reports of voice and hand writing recognition being crap are highly exaggerated.
- I write software that relies on Windows XP Tablet PC Edition and the hand writing recognition works - given that you don't write like a slob. It's not perfect and it doesn't have built in learning yet, but when it does, it'll only be better.
- Dragon Naturally Speaking, once trained works VERY well. The CEO of my company is functionally blind. He uses a screen reader with MS Mike and Mary as well as Dragon to great success. He dictates about 30-40 e-mails a day in between everything else he does. I've found errors in maybe 5% of his e-mails to me. Take a look at your inbox: that's likely better than most people's typing.
It's worrying that, given your public voice, people might take what you say here as a truthful indication that these technologies don't work, and aren't worthwhile. They do, and they are.
One common technique sometimes seen portrayed in hard science-fiction is subvocalization. Basically, your brain thinks about speaking, your body starts to go through the motions of forming speech, but no actual detectable noise comes out. The general idea in the fiction seems to be that the computer doesn't directly attempt to transcribe the sounds of the speech as such; instead, it reads the muscle twitches or nervous impulses or brainwaves or something to determine what you were trying to say.
At least it'll be quieter down at the local Starbucks.
I just want to second that the handwriting recognition in Windows 7 is, in my opinion and experience, quite nice. With absolutely no training on my HP TM2, it reads both my print and cursive with maybe a 1-letter error per sentence. But by far, the best part is their interactions after you've written something. Completely intuitive. Don't like a word? Draw a line through it, and it's gone. It got a letter wrong? Tap the word, and it explodes it out by letter. You can then insert, delete, or overwrite individual letters.
Its only drawback is that it is dictionary based, leading to lots of corrections for words which, well, aren't really words.
I started using voice recognition almost 20 years ago, due to overuse injuries (15 years as a programmer was more than my tendons could cope with). Back then it was discrete speech, and required about three hours of training. (reading lists of words, then letting it tweak a user specific model) Now, it takes minimal training, and allows continuous speech. (some do take samples of your writing style to tune its grammar model).
There are some real advantages. It is a MUCH faster typist than I used to be, and it can spell, something I was never noted for. Its accuracy is good enough to be annoying, at 95-97%. You still have to watch it, its rarely wrong, but when it is, your spell checker will be no help finding its mistakes.
By the same token its not much good for a programmer. The problem isn't the language syntax, but the identifiers. The system is geared to insert correctly spelled words separated by spaces, NotMixedCaseMashedTogetherDevisedByTheVwlAlrgc. If its just your code, you can cope, picking ones that can be easily said, but if you like large systems, you will never be dealing with your code only. Its why I stopped programming, and started managing. I could produce code, but not fast enough to satisfy me.
Oh yea, while some systems provide a means to operate the mouse, they make generating identifiers pleasant. There are head and eye tracking systems, but they aren't for the able, if you can move your upper body, it will never stay pointing where you want.
I mentioned grammar models above, that is the trick with accuracy - That 80% accuracy figure is typical English, without a grammar model. There are to many homophones in the language, to do much better. Did you want me to insert "there" or "their"? If you know the 3 or 5 words before you can make a pretty good guess. If you can delay inserting until you get the following word, you can do even better. When you are modifying a document, you want the system to be able to query the editor, and get the words surrounding the cursor, so the editor/browser/whatever has to be built to cooperate.
One of the real barriers to voice input is the modern open office. You did not want the next cube to a voice input user, especially in the old discrete speech days. Dana Bergen wrote this after some time spend with one of the early discrete speech systems. (that also required a wired, headset microphone if you wanted any accuracy.)
Summary: If you spend your time creating original English text, (like most of this post) it will be faster than your typing, and will improve your spelling. If your hands don't work, but your mouth still does, it will let you rejoin the online world. Its not the answer to all the computing worlds problems, but no single tool is.
This post was untouched by human hands.
For a laptop or desktop, I'm with you: typing and mousing is a lot more effective than speaking.
But on a mobile phone with a screen that fits in your hand and a tiny software keyboard, typing is a royal pain in the ass. The device is designed to pick up the human voice, so it'd be nice if it could do more than just copy that voice to somewhere else on the planet. Unfortunately, the most frequent and annoying thing I type on my phone is my password, which is optimized for having lots of numbers and punctuation, not being easy to type on a simplified keyboard. And I don't want to speak that into the phone, even if I could.
For universal speech recognition, we need to do a lot better job with semantics. The human perceptual system has an impressive amount of interaction between high-level and low-level language processing for disambiguation, filtering, prediction, etc. If you proposed the human speech system to a software architect, the high coupling, lack of clean interfaces, and legacy spaghetti code would drive him nuts.
Still, I can't help thinking that the technology, properly integrated with a rigorous grammar model, such as exists in programming languages and IDEs, wouldn't be hugely accurate and successful.
I believe that all the technology is ready NOW, for a near-perfect implementation integrating voice recognition for, say, Visual Basic.Net programming. This seems especially feasible with the addition of Visual Studio's Intellisense rules.
Very interesting topic. Maybe before we can venture out onto 100% voice recognition we need to have prior advancements in the field of Artificial Intelligence. I mean unless a computer 'understands' us in the same way as we understand another human being, it cannot really tell the difference between the phonetic representations of 'their' 'there' and 'they' 'day'. Guess it needs 'common sense' to guide it to choose what to interpret.
Dunno how we're gonna infuse common sense into our computers. What if it meant that we had to take whatever technology we have, what we are striving on, and tear it down to the basics and come up with something else completely new? Maybe if someone actually does this and shows us that yes voice recog tech is more feasible than the cost of global computer eradication... even if temporarily :)...
Well for the time being it would be cool if we just had everyone use their fingerprints or retinas to identify themselves over the internet... hmmm or maybe voice prints (dunno if each voice is unique) :)
I'd like to see a lot more points on that graph before I drew those nicely decreasing lines on it. I really wouldn't be surprised if voice recognition algorithms are prone to a kind of punctuated equilibrium, i.e. new techniques are adopted giving large improvements and then have periods of stagnation (or only slight improvements).
@Hymnos, never ever trust the Gates. He has been wrong much more than right. He blurbs stuff and hopes they come true one day. I think MS is only as big as it is because it had Gates as skipper on the business side. He was smart enough to realize he is not as smart as people make him out to be, so he bought lots and lots of smart people to work for him instead.
So in that sense he is brilliant, for the rest, not so much.
>> There are to many homophones
Voice mail to text would be nice. I can't understand someone leaving their phone number at 700,000 cpm anyway. Maybe the sender could be forced to edit the transcript before sending?
Voice will never work until we are required to prepend "Computer" to every command.
For some reason, the activation of text-to-speech on the android is pretty rough in the default OS. Other than that, though, it is pretty lame sometimes. If your vocabulary is a little different it does NOT handle it well. I've definitely thought about getting a better speech-to-text program, but I'm too cheap.
Ubiquitous voice programming, it seems, will have to wait for a universal ID system that includes your own personal voiceprint. What's interesting is that if you had a very recognizable word ('computer,' for example) that you reserved for starting authentication, and a unique marker (EG, a small radio transmitter) you could probably build a centralized system that did voice recognition much more effectively.
SOoooooooo....... You want Star Trek, wait ten years and hope cloud computing catches up to what you want. Someone's working on it.
One thing no one seems to be considering is how much the OS is important when it comes to the input methods (voice, touchscreen, mouse/keyboard,etc...)
Touchscreen was not invented for the iphone, yet the Iphone OS from Apple was the KEY in making good use of the technology, why could you not apply the same thought to voice recognition?
Just try to imagine a new OS which would be centered around voice recognition, and thus new applications also made to be used with voice.
Windows (and every app made for it) are designed with the very idea that mouse and keyboard will be used as inputs, the iphone with your fingers and they work fine that way. Why would the voice work efficiently to control these OSes?
Would you need a taskbar, icons or the minimize/close buttons on a real voice-controlled OS? Of course not, it'd be ridiculous.
It's hard of course to reimagine things in a new way, but hey, apple did it.
You're correct, Jeff, it doesn't work because the bandwidth of spoken communication is severely limited.
Ever heard how a pictures says 1,000 words? I would say a mouse click says a few hundred.
Of course, because human language is always changing in tone and usage, speech shouldn't be something to use when communicating with a computer. Computers are in theory "perfect", and human speech is far from the same standard. So, ditching speech recognition entirely and moving on to some other data input method is most likely the easiest and cheapest path to take. Although, speech recognition can still be useful in some things, like for people who have a hard time typing (handicapped). But, if we explore the other methods for input, maybe someday we will find something far better than speech recognition.
I hope this made sense. I just type the words as I think them up. :)
When you constrain the voice input to a more limited vocabulary -- say, just numbers, or only the names that happen to be in your telephone's address book -- it's not unreasonable to expect a high level of accuracy.
Expect, yes, achieve, no. On an iPhone with over 400 contacts I've found the Voice Control dialing to have less than 5% success. It improves, slightly, if you append "home", "mobile" or "work" to the end of "call so-and-so", presumably by limiting the search to the set of contacts that have the available services, but still usually calls the wrong person.
And you should forget about using Voice Control to play tracks or albums in a music collection of over 2,000 tracks.
Regarding Minority-Report-style hand-waving interfaces, there's always the obligatory OK/Cancel comic: http://okcancel.com/comic/3.html
An interface that forces you to wave your hand around in mid-air is very tiring. It may work for a minute but not for a work day.
As for hand-writing recognition: It definitely has uses. I don't use it to wite larger corpi of texts but I do own a Tablet and frequently use OneNote to do sketches or diagrams. I still do most of the text with typing there but the very fact that even my handwriting is searchable (and the recognition is pretty good) is a big plus for anything I just quickly scribble down. And the Math Input Panel in Win 7 does its job also quite well (once you're accustomed to write numbers like it expects them to be. For me writing a 1 as a simple line is still awkward) and for some formulas it's faster than typing.
Main problem with Tablet PCs though: quickly changing from keyboard to pen input involves turning the display. Direct manipulation on the screen with a pen is nice but you notice accuracy errors, you have parallax and generally having the screen flat on the table doesn't let you see as good as having it angled. The angled one is unsuitable for pen input, then, though.
Handwriting recognition does exist on the iOS devices in the form of Chinese input and it's actually quite impressive.
I wonder how much has really been invested in voice recognition.
Voice recognition tech would come in handy when interacting with automated homes, appliances, cell phones, etc. So to close the master bed room shades say "Upstairs Master Shades Close". The software will be looking for a small group of voice commands for the first word like "All, Upstairs, Downstairs and Basement". This first command tells the computer which floor to activate. On each floor there is a series of other voice commands which would be the room list like "Bath, Childrens, Guest, Master, Closet and Hallway".
So with each command you get more and more specific. Far from conversational speech but it is logical and direct. No bullshit = No error. Probably would not have to voice train with a minimal voice interface like this.
Now in terms of voice transcription like the Google Youtube video thingy, I assume you'd need faster processors and much more detailed thorough programs to determine the difference between the vocal versions of 'site' and 'sight'. That's context recognition. It'd have to be able to understand context by 'keyword' collection and relationships. Semantic web comes into play here and that is not yet here. Also advanced machine learning, AI has to be present. Vast databases will be called upon. So much faster processors will be needed.
But why screw with voice recognition when thought recognition is the real aim. Its much more elegant and minimal than voice. Less waste and confusion. Along with thought comes vision. So future computer interfaces will be purely visual & mental. What needs to happen is a full merger between the fields of synthetic biology, nano-technology, AI, machine learning, computer hardware, etc. We can really start to see that the lines in some cases are beginning to blur. http://www.youtube.com/watch?v=IyAOepIU6uo
While there seem to be plenty of reasons to doubt the conclusion of the post, let's assume it's right.
To me, that would seem to mean that the current technology for voice recognition is based on a flawed assumption, and that progress will not be made until a new technology comes along that tosses out the old assumptions and starts fresh with a new look at the problem.
Once a technology, even a flawed one, is established, it's devilishly hard to get rid of it and start from scratch. (We mostly all use QWERTY keyboards, which were actually designed to slow down typing, almost 50 years after the problem they were created to solve has vanished.) But sometimes that's what it takes to solve the problem.
There is a pretty successful science fiction writer named David Weber who does not type his books, but dictates them with voice recognition software. He had an accident of some kind where he can't type. He discussed this in an interview on a science fiction site called thedragonpage.
go to cover to cover -> show archives (drop down)
his interview is there. The next show actually discussed writing books by using voice recognition software instead of typing.
Hold your horses. It hasn't arrived yet. It's called a voice recognition card similar to a video card. Be patient.
If you think back to the sophistication required to complete many of the common computer requests in Star Trek, I think you'll realize that accurate recognition of human speech was pretty trivial in comparison.
Usually there would be a request like, "Computer, analyze this phenomenon and then search your data banks for any previous occurrences. What is the most probable cause."
Star Trek computers were apparently all sentient beings. They were just extremely modest sentient beings and didn't want to make a big deal about it.
Ah, "data banks". As a phrase it seemed to make so much more sense back then.
Hopefully someone will read this.
I live in a slight version of terror that speech recognition will become commonplace. I stutter. Not terribly, but it's there. It is even more pronounced when I have to pause, and clearly say something without any type of preamble. It's even worse when it is my last name.
I program for a living, and love it. However, if think people are listening to me, then my stutter becomes more pronounced. I can sing fluently, but I think singing might make me feel stupid, screw up the voice recognition software, etc.
As we all (should) be taking handicaps into account as we program (blindness, deafness, etc), please, let's not forget the people in the world with a very un-publicized impediment to life, the stutterers.