Speech I/O for Embedded Applications
Speech user interfaces are like the holy grail for computing. We talk to each other to communicate, and sci-fi stories—from HAL in 2001: A Space Odyssey to the ship's computer in Star Trek—point to talking computers as the inevitable future. But, creating speech interfaces that are natural and that people will use has proven to be difficult. Too often speech technology is provided, or even preinstalled (as with Microsoft Windows Speech Recognition), and never used, but there are glimmers of hope. The technology to do “decent” speech recognition and speech synthesis has existed for a while now, and users are trying it out, at least in some application categories.
It feels like the opportunity is ripe for someone to get the speech interface right. Maybe you're the one to invent a speech interface that makes your embedded application as cool and unique as the iPhone touch interface was when it first came out.
In some ways, embedded applications are particularly well suited for speech. An embedded device often is physically small and may not have a rich user interface. Almost by definition, embedded applications are not general-purpose, so it's okay if a speech interface has a limited vocabulary. Speech may be the only user interface provided, or it may augment a display and keyboard.
Mobile phones are one class of embedded applications where speech works as a user interface. Voice dialing (“dial home”) is almost a trivial interface that works very well on phones. If you're driving and want to send a text message, it's difficult (and in many places illegal) to use the phone's soft keyboard to enter the message and its destination. Speech recognition is good enough, and mobile phones are powerful enough computers, that sending text messages by voice is a valid use case people are starting to employ.
In this article, I examine technologies for speech synthesis and recognition and see how they fit with today's embedded devices. As an example application, and in step with the re-discovery of checklists as productivity tools (thanks to Atul Gawande's best-seller The Checklist Manifesto), we'll build a simple vocal checklist that you can use the next time you do surgery, like Dr Gawande (kids don't try this at home).
As with any other user interface, a speech interface has two components: input and output (or recognition and synthesis). The two technologies are closely related, sharing techniques, algorithms and data models. As mentioned, speech has been a very popular computing research topic, and I can't cover all the work here, but I take a quick look at some different approaches, investigate some open-source implementations and settle on input and output packages that seem well suited for embedded applications. You don't have to be a computerized speech expert (I certainly don't claim to be) to speech-enable your embedded application.
Naïvely, you might think “What's so hard about speech synthesis?” You envision a hashmap with English words as the keys and speech utterances as the values. But, it's not that easy. Any nontrivial TTS system needs to be able to understand things like dates and numbers that are embedded in the text and utter them properly. And, as any first-grader can tell you, English is full of words whose pronunciation is context-dependent (should “lead” be pronounced as rhyming with “reed” or “red”?). We also vary the pitch of our voices as we come to the end of a sentence or question, and we pause between clauses and sentences (called the prosody of the speech).
A lot of smart people have thought this over and have come up with a basic architecture for TTS:
A front end to analyze the text, replace dates, numbers and abbreviations with words, and emit a stream of phonemes and prosodic units that describes the utterance.
A back end, or synthesizer, that takes the utterance stream and converts it to sounds.
The front end, sometimes called text normalization, is not an easy problem. It's one of those pattern things that humans do easily and computers have a difficult time mimicking. The algorithms used vary from simple (word substitution) to complex (statistical hidden Markov models). For applications where the text to be spoken is relatively fixed (like our checklist), most TTS systems provide a way of marking up the text to give the normalizer hints about how it should be spoken (and, there is a standard Speech Synthesis Markup Language to do so; see Resources).
A variety of schemes have been developed to build speech synthesizers. The two most popular seem to be formant synthesis and concatenation.
Formant synthesizers can be quite small, because they don't actually store any digitized voice. Instead, they model speech with a set of rules and store time-based parameters for models of each phoneme. The prosodic aspects of speech are relatively easy to introduce into the models, so formant synthesizers are noted for their ability to mimic emotions. They also are noted for sounding “robotic”, but very intelligible. For our chosen application, intelligibility is more important than “naturalness”.
Concatenative synthesizers have a database of speech snippets that are strung together to create the final sound stream. The snippets can be anything from a single phoneme to a complete sentence. They are known for natural-sounding speech, although the technique can produce speech with distracting glitches, which can interfere with intelligibility, particularly at higher speeds. They also are typically larger than formant synthesizers, due to the large database required for a large vocabulary. The database can be minimized if the TTS is for a domain-specific application, but, of course, that limits its usefulness.
Rick Rogers has been a professional embedded developer for more than 30 years. Now specializing in mobile application software, when Rick isn't writing software for a living, he's writing books and magazine articles like this one.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- RSS Feeds
- New Products
- Using Salt Stack and Vagrant for Drupal Development
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Validate an E-Mail Address with PHP, the Right Way
- New Products
- Tech Tip: Really Simple HTTP Server with Python




2 hours 44 min ago
8 hours 43 min ago
9 hours 6 min ago
9 hours 16 min ago
9 hours 20 min ago
9 hours 50 min ago
12 hours 42 min ago
13 hours 17 min ago
13 hours 18 min ago
13 hours 19 min ago