Speech I/O for Embedded Applications
In contrast to TTS, there is one dominant algorithm for speech recognition, hidden Markov models (HMMs). If you haven't run into HMMs before, don't expect me to explain the math in detail here, because, frankly, I don't completely understand it. I do understand the idea behind HMMs, and that's more than what you need to know to use an HMM-based recognizer.
If you sample a speech waveform (say every 10ms) and do some fancy math on the resulting waveform that extracts frequency and amplitude components, you can end up with a vector of cepstral coefficients. You then can model connected speech as a series of these cepstral vectors. A Markov model is like a state machine where the probability of a particular state transition is dependent only on the current state. In our case, each state of the Markov model corresponds to a particular vector, and as a Markov model moves probabilistically through its states, a series of cepstral vectors and sounds are generated. A hidden Markov model is one where you can't see the details of the state transitions, you just see the output vectors.
The trick is to create a bunch of these HMMs, each trained to mimic the sounds from a bunch of human-generated speech samples. Again, the math is beyond me here, but the process is to expose a training algorithm to a lot of speech samples for the language desired. As the sea of HMMs is trained, they take on the ability to reproduce the sounds they “hear” in the training samples.
To use the HMMs to recognize speech, we use one last bit of mathematical wizardry. For appropriate sets of HMMs, there are algorithms that, given a waveform (that is hopefully speech), can tell you: 1) which sequences of HMMs might have generated that waveform and 2) the probability for each of those sequences.
So, HMMs won't give us a definitive answer of what words the speech represents, but they'll give us a list from which to choose and tell us which is most likely and by how much. How cool is that?
Many commercial TTS packages are available, but they don't concern us here. On the open-source side, there are still many candidates, with a few that seem more popular:
eSpeak is the TTS package that comes with Ubuntu and several other Linux distros. It is of the formant flavor and, therefore, small (~1.4MB), with the usual robotic formant voice. The eSpeak normalizer also can be used with a diphone synthesizer (MBROLA) if desired, but we won't take advantage of that for the checklist example here.
Flite is the embedded version of Festival, which is an open-source speech synthesis package originating at University of Edinburgh, with Flite done at Carnegie Mellon University. It is diphone-based concatenative, and as you would expect, it has a more natural voice. CMU also offers a set of scripts and tools for developing new voices, called FestVox.
Most speech recognition packages are commercial software for Windows and Mac OS X. I looked at two open-source speech recognition packages, both from the Sphinx group at Carnegie Mellon:
Sphinx-4 is a speech recognizer and framework that can use multiple recognition approaches, written in Java. It is intended primarily for server applications and for research.
PocketSphinx is a speech recognizer derived from Sphinx and written in C. As such, it is much smaller than Sphinx (but still around 20MB for a moderate vocabulary), and it runs in real time on small processors, even those with no floating-point hardware.
PocketSphinx is the obvious choice between the two implementations, so that's what we'll use here.
In the interest of flexibility and speed, I've chosen a rather high-end embedded platform for the example program. The Genesi LX is a nettop with a rather generous configuration for an embedded device:
Freescale i.MX515 (ARM Cortex-A8 800MHz).
3-D graphics processing unit.
WXGA display support (HDMI).
Multi-format HD video decoder and D1 video encoder.
512MB of RAM.
8GB internal SSD.
10/100Mbit/s Ethernet.
802.11 b/g/n Wi-Fi.
SDHC card reader.
2x USB 2.0 ports.
Audio jacks for headset.
Built-in speaker.

Figure 1. Genesi EFIKA MX Smarttop
On top of all that, there is a version of the Ubuntu 10.10 (Meerkat) distro available to load and run on the system, which makes installation and testing a lot easier. The download for Ubuntu and installation instructions are on the Genesi Web site. Installation from an SD card is straightforward through the U-boot bootloader.
Rick Rogers has been a professional embedded developer for more than 30 years. Now specializing in mobile application software, when Rick isn't writing software for a living, he's writing books and magazine articles like this one.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- New Products
- RSS Feeds
- Trying to Tame the Tablet
- What's the tweeting protocol?
- Dart: a New Web Programming Experience




4 hours 3 min ago
6 hours 25 min ago
23 hours 13 min ago
1 day 1 hour ago
1 day 3 hours ago
1 day 3 hours ago
1 day 4 hours ago
1 day 8 hours ago
1 day 9 hours ago
1 day 11 hours ago