Speak up! Tell your microwave, fridge and TV what to do
- Published
I've decided to like this entire column by dictating it into a phone. Oops. That's why there's already one mistake in the fast line…oops again…..that should read the rest line…….correction that should read the Fust line……. correction that should read the fest line……correction that should read the fast line……I'll keep going until I get this right.
I've changed my mind. I'm going back to typing - until words like "write" and "first" are instantly recognised.
My voice has definitely helped me over the past twenty years as I've travelled around the US.
Most Americans love a British accent, especially in the South. People really have cornered me in elevators hoping to elicit just "one more phrase" before they get out on their floor.
Unfortunately the same can't be said for electronic gadgets that are built to listen.
To-may-to, to-mah-to
Historically they've been Anglophobic. Even the latest iteration, Siri, found on iPhones, forces me against my will to choose "English (United States)" as the input language if I want geographical context in New York to find restaurants and other bits of information.
It assumes everybody in America speaks with an American accent. So that may explain the snafu in the first line of this column. Once the software is out of beta testing, I'm hoping I will be allowed to use a British accent setting on American soil.
Countless public relations executives have been telling me almost every year since about 1997 at various conventions and tech shows that "THIS is THE year for speech technology".
Their carefully crafted demonstrations always seem to offer conclusive proof too - until I'm able to try it out alone.
So I was pleasantly surprised on a recent visit to the headquarters of speech technology company Nuance Communications near Boston when senior executive Daniel Faulkner was refreshingly honest about the accuracy of speech technology.
"It will never get to 100%. Humans are not 100%," he says.
"I can call my relatives and we'll have to repeat ourselves a number of times and that can just be a factor of what's going on in the background, where we are, it can be a bad line, so all of those things apply to any automated system as well."
Vocal coaching
But the past couple of years have seen startling improvements. Accuracy in many applications is now in the mid to upper nineties percentage wise. There are two developments that may accelerate research in the near future.
Firstly, mobile apps like Dragon Go and Siri, which have only appeared recently, are providing Nuance with a huge new data stream to study.
Every time you talk into your device, your words are uploaded and stored on servers. This means Nuance can analyze intonation, accents and languages in minute detail and constantly improve recognition algorithms.
Secondly people are becoming more used to speaking "correctly" to their phones and web browsers. They discover over time that specific phrases, background noise and pace all play a part in the success of a spoken inquiry.
But there are still areas of our lives where local processing is the only choice.
For example vehicles are rarely hooked up to the internet or to remote servers, and therefore the computer processor already installed by the car manufacturer handles speech recognition.
Unfortunately these have usually been the cheapest, slowest kind and were never designed for intensive operations like analyzing the spoken word.
That's changing according to Vlad Sejnoha, chief technology officer at Nuance, who says auto makers have had to re-invent themselves as consumer electronics manufacturers.
"They have to build a good car but they also have to appeal to the user whose expectations are permanent connectivity, access to the latest media and songs, and the ability to connect and communicate with their friends. Business people need constant connectivity and communication in their car."
Tower of babel
Of course makers of TVs, microwaves, fridges, vacuum cleaners etc will all have to look at adding voice capability to their devices. Assuming it works effectively speech is usually more convenient than pushing buttons and turning dials.
Making fancy devices respond to the spoken word is only one very small part of speech technology research.
A lot of time and money is being poured into global language support. For example Nuance has mapped 13 out of 22 possible languages spoken in India and is working on the other nine.
Speech can also be preferable in many applications in areas of high illiteracy. But the problem is how to collect the data. Algorithm development relies on a large data base of samples collected in real life situations.
In developed countries that's easy, thanks to the smart phone. But in places where people cannot afford such devices the opportunity for data collection is reduced.
And in some countries it's considered rude to interact with your customers on an automated voice system, which also takes away another source of sampling.
Yet a luxury hotel that installs a voice operated lift for example may want to incorporate every language of the world into the system, at the risk of alienating some of it's guests. The same reasoning could apply to a global airline that installs a speech driven check-in system.
Complex equations
Peter Mahoney, chief marketing officer at Nuance says speech technology is already having a big impact in certain areas. Ironically in occupations that have complex vocabularies, like medicine and law. Here software can differentiate between words extremely well.
"You are seeing a lot of people using a technology called voice writing. They use Dragon Dictate and they often use some kind of privacy microphone.
"They dictate everything that is going on in the courtroom proceeding. They say it very quickly and with special code so that they can identify who was saying what."
The advantage is that one person doing voice writing can create court records in real time. A traditional stenographer has an additional step of interpreting his/her notes at the end of the day and then creating a final record after leaving the courtroom.
But there's one aspect of speech technology that has proven the most difficult to advance - multiple voices.
If two people talk over the top of each other speech recognition is hopeless. If a number of people meet in the same room, speech technology is useless.
Researchers hope that one-day, it will be a reality but for now they are satisfied with trying to come as close as they can to 100% accuracy for a single speaker.
And that is a project that will take a leng leng time. Oops.