Whatever Happened to Voice Recognition?

June 21, 2010

Remember that Scene in Star Trek IV where Scotty tried to use a Mac Plus?

Star-trek-4-apple-mac-plus

Using a mouse or keyboard to control a computer? Don't be silly. In the future, clearly there's only one way computers will be controlled: by speaking to them.

There's only one teeny-tiny problem with this magical future world of computers we control with our voices.

Voice-recognition-accuracy-rate-over-time

It doesn't work.

Despite ridiculous, order of magnitude increases in computing power over the last decade, we can't figure out how to get speech recognition accuracy above 80% -- when the baseline human voice transcription accuracy rate is anywhere from 96% to 98%!

In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. Adding data or computing power made no difference. Researchers at Carnegie Mellon University checked again in 2006 and found the situation unchanged. With human discrimination as high as 98%, the unclosed gap left little basis for conversation. But sticking to a few topics, like numbers, helped. Saying “one” into the phone works about as well as pressing a button, approaching 100% accuracy. But loosen the vocabulary constraint and recognition begins to drift, turning to vertigo in the wide-open vastness of linguistic space.

As Robert Fortner explained in Rest in Peas: The Unrecognized Death of Speech Recognition, after all these years, we're desperately far away from any sort of universal speech recognition that's useful or practical.

Now, we do have to clarify that we're talking about universal recognition: saying anything to a computer, and having it reliably convert that into a valid, accurate text representation. When you constrain the voice input to a more limited vocabulary -- say, just numbers, or only the names that happen to be in your telephone's address book -- it's not unreasonable to expect a high level of accuracy. I tend to think of this as "voice control" rather than "voice recognition".

Still, I think we're avoiding the real question: is voice control, even hypothetically perfect voice control, more effective than the lower tech alternatives? In my experience, speech is one of the least effective, inefficient forms of communicating with other human beings. By that, I mean ...

  • typical spoken communication tends to be off-the-cuff and ad-hoc. Unless you're extremely disciplined, on average you will be unclear, rambling, and excessively verbose.
  • people tend to hear about half of what you say at any given time. If you're lucky.
  • spoken communication puts a highly disproportionate burden on the listener. Compare the time it takes to process a voicemail versus the time it takes to read an email.

I am by no means against talking with my fellow human beings. I have a very deep respect for those rare few who are great communicators in the challenging medium of conversational speech. Though we've all been trained literally from birth how to use our voices to communicate, voice communication remains filled with pitfalls and misunderstandings. Even in the best of conditions.

So why in the world -- outside of a disability -- would I want to extend the creaky, rickety old bridge of voice communication to controlling my computer? Isn't there a better way?

Robert's post contains some examples in the comments from voice control enthusiasts:

in addition to extremely accurate voice dictation, there are those really cool commands, like being able to say something like "search Google for Balloon Boy" or something like that and having it automatically open up your browser and enter the search term -- something like this is accomplished many times faster than a human could do it. Or, being able to total up a column of numbers in Microsoft Excel by saying simply "total this column" and seeing the results in a blink of an eye, literally.

That's funny, because I just fired up the Google app on my iPhone, said "balloon boy" into it, and got .. a search for "blue boy". I am not making this up. As for the Excel example, total which column? Let's assume you've dealt with the tricky problem of selecting what column you're talking about with only your voice. (I'm sorry, was it D5? B5?) Wouldn't it be many times faster to click the toolbar icon with your mouse, or press the keyboard command equivalent, to sum the column -- rather than methodically and tediously saying the words "sum this column" out loud?

I'm also trying to imagine a room full of people controlling their computers or phones using their voices. It's difficult enough to get work done in today's chatty work environments without the added burden of a floor full of people saying "zoom ... enhance" to their computers all day long. Wouldn't we all end up hoarse and deaf?

Let's look at another practical example -- YouTube's automatic speech recognition feature. I clicked through to the first UC Berkeley video with this feature, clicked the CC (closed caption) icon, and immediately got .. this.

Uc-berkeley-physics-lecture

"Light exerts force on matter". But according to Google's automatic speech recognition, it's "like the search for some matter". Unsurprisingly, it does not get better from there. You'd be way more confused than educated if you had to learn this lecture from the automatic transcription.

Back when Joel Spolsky and I had a podcast together, a helpful listener suggested using speech recognition to get a basic podcast transcript going. Everything I knew about voice recognition told me this wouldn't help, but harm. What's worse: transcribing everything by hand, from scratch -- or correcting every third or fourth word in an auto-generated machine transcript? Maybe it's just me, but the friction of the huge error rate inherent in the machine transcript seems far more intimidating than a blank slate human transcription. The humans may not be particularly efficient, but they all add value along the way -- collective human judgment can editorially improve the transcript, by removing all the duplication, repetition, and "ums" of a literal, by-the-book transcription.

In 2004, Mike Bliss composed a poem about voice recognition. He then read it to voice recognition software on his PC, and rewrote it as recognized.

a poem by Mike Bliss

like a baby, it listens
it can't discriminate
it tries to understand
it reflects what it thinks you say
it gets it wrong... sometimes
sometimes it gets it right.
One day it will grow up,
like a baby, it has potential
will it go to work?
will it turn to crime?
you look at it indulgently.
you can't help loving it, can you?
a poem by like myth

like a baby, it nuisance
it can't discriminate
it tries to oven
it reflects lot it things you say
it gets it run sometimes
sometimes it gets it right
won't day it will grow bop
Ninth a baby, it has provincial
will it both to look?
will it the two crime?
you move at it inevitably
you can't help loving it, cannot you?

The real punchline here is that Mike re-ran the experiment in 2008, and after 5 minutes of voice training, the voice recognition got all but 2 words of the original poem correct!

I suspect that's still not good enough in the face of the existing simpler alternatives. Remember handwriting recognition? It was all the rage in the era of the Apple Newton.

Doonesbury-newton

It wasn't as bad as Doonesbury made it out to be. I learned Palm's Graffiti handwriting recognition language and got fairly proficient with it. More than ten years later, you'd expect to see massively improved handwriting recognition of some sort in today's iPads and iPhones and iOthers, right? Well, maybe, if by "massively improved" you mean "nonexistent".

While it still surely has its niche uses, I personally don't miss handwriting recognition. Not even a little. And I can't help wondering if voice recognition will go the same way.

Posted by Jeff Atwood
122 Comments

I'm not sure if you guys/gals have seen this, but here's a comical drama of Voice Recognition Technology vs. Scottish Accent.

http://singularityblog.singularitysymposium.com/the-perils-of-voice-recognition-technology/

Ramezah Yusof on July 7, 2010 9:53 AM

If we can't nail voice recognition, then the robocalypse (that we all secretly want because robots, even blood thirsty murderous robots, are cool) will never happen.

Thomas Elders on July 9, 2010 8:52 AM

Sadly even text to speech isn't very good. You would think with all the technology we'd be able to at least get that right.

Justin Killen on July 12, 2010 10:01 AM

I think you're missing the point.

The next step after reliable voice recognition is proper language comprehension. I mean like an incredible secretary inside your computer.

For example, you could bark instructions such as the following at your computer: "I was working on a document a few months ago and can't seem to find it. Ehh, I think I had written about a 5 or so pages - it was an initial stab at the implementing a proprietary database replication - can't remember if I archived it in my emails or saved it as a separate doc..."
And the computer would pop up the best candidates. "Is it one of these.."
Or even: "Can you take Jim's presentation from last months management meeting - it should be on the shared drive (if not send him an email asking for the location) and use that format with the points I recorded yesterday. Try to make the formatting look really good. Oh yeah - can you order me a BLT - but no mayo this time. I'd like it delivered for 1:30. With a can of coke."

Try doing that faster with a keyboard and mouse.

David Wilson on July 12, 2010 3:05 PM

@Atwood: "While it still surely has its niche uses, I personally don't miss handwriting recognition. Not even a little. And I can't help wondering if voice recognition will go the same way."

@Jared Taylor: if you can type faster than handwrite (which most people can) obviously handwriting recognition is pointless.

Well I find I'm using HWR all the time. There is a danger about the keyboard, it can be too fast. Writing quickly without first forming one's thought and argument carefully leads to a less effective and persuasive communication. For this reason my Apple Newton continues to get daily use. I've even got one doing duty as a web server: http://misato.chuma.org:8080/

For me, there are many situations where pulling out a keyboard device isn't culturally acceptable. I have lots of business meetings in a day. Staring into a screen and tapping into a keyboard is seen as distracting and if focused on it too long, people start thinking you're playing.

@Rob O'Daniel: "On the handwriting tangent, ya hafta wonder why Jobs & Co. didn't build handwriting recognition into the iPad. Is it that he simply didn't see the need to compete with tablet PCs, he didn't believe it'd be successful enough, or that he views handwriting as a useless and all but dead technology?"

I think Steve didn't want
* the bulk of a decent sized stylus to ruin the aesthetic value of the iPhone design.
* the complaints he might get from people with poor quality handwriting
* to relearn the lessons from the Newton. Apple learnt from the Newton
experience that it was much speedier navigating around and getting things done just by tapping rather than writing. The Newton UI guide talks about this and recommends to designers that they minimize any writing required: See http://www.4shared.com/document/a5KyvcqG/ui_guide.html and http://www.4shared.com/document/w7ztLVE0/uiguidl.html

Tony Kan on July 16, 2010 3:51 AM

Completely agree with Matt Dawdy!

Though, I think the unhandicapped world is too addicted to the mouse and key board/pad than to speak or 3d'ize the work environment.

Izlooite.blogspot.com on July 16, 2010 5:20 AM

Hmm, not sure where those numbers came from. I have an acquaintance who is a speech recognition researcher & (as far as I can remember) he told me that the current accuracy of recognition of clean speech (speech without any background noise) is about 95 to 98%. The caveat is that the recognizer has to be trained for the person's voice and a particular acoustic model. According to him, the big problem for speech recognition isn't recognizing clean speech, it is recognizing speech with a background noise.

Another acquaintance at a big consumer electronics company spent several years working on voice control for household items like air conditioners, but his company gave up on the research because it seems that the level of accuracy they achieved, about 80%, was unacceptable to consumers.

Bongo Felafel on July 17, 2010 3:56 AM

Sorry for the late post.

In regards to speech impediments, regional accents and general usefulness for the handicapped, you guys should check out VoiceAttack (http://www.voiceattack.com). It works with Windows' speech recognition engine (which means it needs to be trained to your voice - which makes it good for those that have certain speech impediments). What it does, basically, is execute macros based on specific phrases that you input. It is set up mostly for gamers (keyboard input), but, you can launch/kill programs and play .WAVs and text to speech, etc. Thought it might be worth mentioning.

- Gary

Gary Magenheimer on July 19, 2010 3:14 PM

You put the maximum accuracy rate of software transcription at < 80%. I assume that is regarding recognition when no context is employed.

Where the context is known, recognition can be much more accurate.

Analyzing the speech that can be recognized and then using that to help provide context to unrecognized speech is where the gains in accuracy are going to come from...

I'm a little bit hard of hearing and I often have a hard time understanding what people are saying. I use what I hear to give context to the words that I didn't catch or misheard.

It will happen. Half the text I enter on my Android phone, I enter using the speech recognition. It does a fairly accurate job with a little light editing.

Timothy Lee Russell on July 23, 2010 9:06 AM

I'm reminded of this post from early Java One days, when attendees suggested using voice recognition to speed up the talk transcription process: http://java.sun.com/developer/technicalArticles/InnerWorkings/TranscriptHumor/ - especially the part about the questionably feasible "virgin control".

Paul McGuire on July 24, 2010 11:02 AM

First thing, the idea of voice recognition in a cubicle farm isn't nice. I'm stealing an example from someone else in the days of DOS, but imagine a guy talking to the technical guru who happens to be a bit loud... "How do I wipe my hard disk?", "format c:", "are you sure", "yes" (everyone in the room starts screaming).

As for why voice recognition isn't progressing any more, I think the problem is older than it looks. No-one has been even trying to improve voice recognition for years. What they've been working on is statistics. This is most likely the be followed by that, yada yada. This can help improve the results from actual voice recognition to a point, but some time you have to improve the underlying voice recognition too. The stats aren't magic - they need something to work with.

Steve on July 30, 2010 4:10 AM

voyance Hello, I am new counselor and I wanted to congratulate you on this site very well done!

lora on August 10, 2010 5:16 AM

micro paiement What a job! Very nice!

lora on August 10, 2010 5:17 AM

voyance paris Your article is really a nice reward for our work. Thank you so much!

lora on August 10, 2010 5:18 AM

tarot Congratulations for the content of your blog, which incidentally is very interesting to see, go, bravo.

lora on August 10, 2010 5:20 AM

tarot gratuit ligne I stumbled upon your blog very interesting! ! A quick hello from a person who has always appreciated

lora on August 10, 2010 5:21 AM

What a relief to know I'm not alone in realising voice recognition is a LO-O-O-O-ONG way from being the solution to mankind's ills. I've been hampered by a work-related arm injury for 18 months and cannot tell you how many people (employer, insurer, etc) think that just because they've provided me with voice recognition software, I should be just as productive as before. AAARGH!!!

Hilarious side note (not). When I asked Macspeech Dictate to cache this page so I could comment, it crashed. Hollow laugh...

Jdalgliesh on August 15, 2010 11:50 PM

will voice recogintion work?

voice recogintion is good for general direction.

just lmao when i saw this video attempt to handle simple programming.

http://www.youtube.com/watch?v=boYjnZVlO5I

oww that was soo funny.

Jhnqwteh Gqrewhethn on October 3, 2010 6:18 AM

I started using voice-to-text recently. As a person who works at home, online, writing all day most every day, I recently developed repetitive-motion issues that start at my neck and include my entire right arm. Getting my computer to understand me seems to be the only solution.

I actually found this article while I was searching for the commands to type in a window other than notepad. I use the Vista VTTR software, and I really like it. I'd just like to be able to use it a lot more places than I currently do.

Does anybody know how I can get it to type '1' instead of 'one'? It seems fine once it hits 11. Then it automatically types numbers instead of alpha.

Besping on November 18, 2010 9:39 PM

The following interesting blog post articulates some options to improve the results of speech recognition – http://www.solicall.com/blog/?p=66

me.yahoo.com/a/PSK868lsuur8vmV4hbXgczVPvYgi on February 25, 2011 6:58 AM

The problem is in English itself, like in the linked article, the "recognize speech" / "wreck a nice beach" examle. Without paying close attention to both the context of the conversation and the words itself it's only hardly distinguishable. The rules get even more relaxed around direct speech, popular phrases like "Beam me up" and jokes.

The ambiguity that particular words / phrases can be written down in several different ways is the problem. In my native language (http://en.wikipedia.org/wiki/Czech_language), I'm not aware of any word/phrase that you could write down in different ways without changing its meaning (or just trying to look stupid).

I guess there will be some cutting-edge AI programming and massive data processing behind first real speech recognizer (the one which works with all/most languages on earth), but who knows :)

Pavel Odvody on May 3, 2011 7:14 PM

Even humans such at speech recognition.
One day, I hear "... the coffee at Tim Horton's ..." as "... the coffee at 天河城(tian1he2cheng2) ..." because I went to that mall.
Recently, I hear "quantum quantum" in the song Hauu Nanodesu (http://www.youtube.com/watch?v=AjNPeKgLnC8 ), after I looked it up, it's "patapata". I think I read Wikipedia too much.
Having an excessive knowledge base will impede voice recognition.

(insert rants about natural languages here)

ProjSHiNKiROU on May 11, 2011 9:37 PM

«Back

The comments to this entry are closed.