Java Speech Development Kit: A Tutorial
The Speech for Java is a Java programming package, 100% object-oriented, that uses the speech recognition and synthesis technology known as ViaVoice, which was commercialized by IBM . This API makes it possible to build application interfaces that take advantage of speech.
The recognition and synthesis operations are not processed by the Java software. The kit is just a way to built speech-oriented interfaces. Applications "throw" speech processing to the software behind them, the IBM ViaVoice (commercial or free version).
In this way, just like any other interface programming models, the Speech for Java Development Kit (SDK) is also event-oriented. Now, however, these events are not fired by the mouse or the keyboard, they are created due to human speech via a microphone.
As mentioned the software is based upon two major features, speech synthesis and recognition. The recognition can be done through the use of grammars, which are entities providing information about how the recognition will be done. There are two kinds of grammars:
dictation: it's specially designed for continuous speech, when the software tries to determine what is being spoken using a large word database that associates a sound with each word. To increase accuracy and performance, contextual data is considered in order to establish the pronounced words. It is possible to use different dictation grammars for different domains, such as medicine, engineering or computer science.
rule: these grammars use rules, user's definitions of what might be spoken and how that will be interpreted by the application. The set of rules can take any size, but usually it is limited by the context in which the application is inserted. In this way, commands might be established in execution time to enhance responsiveness.
Voice synthesis might provide the application with simple string sentences as arguments to the method speak. An improved naturalness can be achieved via the use of a special markup language--Java Speech Markup Language (JSML). Through its use, properties like voice, frequency, rhythm and volume can be dynamically altered.
In figure 1, we can see that the sound hardware is controlled by the operating system. Right above it is the Engine.exe binary application, which is automatically initialized when the voice synthesis/recognition applications are started. The Engine is the heart of the IBM ViaVoice ( in the commercial or free versions). It is responsible for accessing all the ViaVoice's features.
Also, in figure 1, we can see the basic components of voice applications: the entities Recognizer and Synthesizer, the Central class and, especially, the Engine Interface, which is extended by the Synthesizer and Recognizer Interfaces. This interface has all the basic methods for controlling and accessing the ViaVoice processing engine.
The Engine Interface is important due to the fact that Java is a multiplatform language. For that reason, the same development kit is used on the UNIX, Linux and Windows systems, but each system has its own binary implementation of the IBM ViaVoice processing engine. The Engine interface hides the details of the platform-dependent software, offering proper access for the Recognizers and Synthesizers.
Now it is necessary to describe the Central class: it is in charge of implementing the Engine interface. The Central class is, in fact, responsible for abstracting platform details by providing the correct implementation of the Engine Interface. The recognizers and synthesizers extend the Engine Interface.
Below is a link to a code example that illustrates the most simple way to create an application with a recognizer and a synthesizer, with no working functionality.
The synthesizers, as their name indicates, are the entities responsible for speech synthesis. They are created through the use of the Central class, which implements the Engine interface and acts as a connection to the synthesis provided by the IBM ViaVoice technology.
Creating a voice synthesizer can happen in one of two ways, and both are use the Central class static method, createSynthesizer:
1. Accessing the default synthesizer of a determined locale is the most simple and common method. It usually establishes access to the synthesizer implementation distributed with the ViaVoice software. It might be done as shown with the code below.
Locale.setDefault("en","US"); Synthesizer sintethesizer = Central.createSynthesizer(null);
2. Accessing a synthesizer that satisfies the conditions defined through the arguments passed on by the createSynthesizer method is the second way. This method is used in cases where more than one synthesizer is available. The parameters are:
name of the engine
name of the mode in which it will be used
a locale supported by the engine
a Boolean value, a control flag of the engine
an array of objects Voice that will be used
These parameters are defined creating an object SynthesizerModeDesc that will be passed to Central.createSynthesizer, as seen here:
public SynthesizerModeDesc(String engineName, String modeName, Locale locale, Boolean running, Voice voices)
Remember that any of the attributes can be null, and the Central class will be responsible for identifying the best synthesizer to fit the conditions.
Synthesizing voice: Once we have created the synthesizer, we can access its functions with the speak method. A simple String argument is enough for the basic features of synthesis, but there are other possibilities to increase naturalness of computer speech. The most powerful one of them is JSML (Java Speech Markup Language, covered in the next section), which provides various techniques to make the speech more similar to human voice.
Table 1 shows all the forms of the speak method:
Table 1. Methods Used for Voice Synthesis
void speakPlainText(String text, SpeakableListener listener
Speak a plain text string. The text is not interpreted as containing the Java Speech Markup Language, so JSML elements are ignored.
void speak(Speakable JSMLtext, SpeakableListener listener)
Speak an object that implements the Speakable interface and provides text marked with the Java Speech Markup Language.
void speak(URL JSMLurl, SpeakableListener listener)
Speak text from a URL formatted with the Java Speech Markup Language. The text is obtained from the URL, checked for legal JSML formatting and placed at the end of the speaking queue.
void speak(String JSMLText, SpeakableListener listener)
Speak a string containing text formatted with the Java Speech Markup Language. The JSML text is checked for formatting errors, and a JSMLException is thrown if any are found.
The speakable objects are members of classes that implement the speakable interface. This interface has only one method, getJSMLText. This method specifies a JSML String to be returned when the object is submitted to the speak method. An example can be seen in the following sample code.
The SpeakableListener: To the methods of table 1, an extra element may be attached, a SpeakableListener. It will receive specific events for each pronounced word. Different events are generated during the synthesis process, and these events can be used to take control of the speech process, enabling a more interactive application. They indicate when a new word starts to be pronounced, if its synthesis was canceled and if it is over or was paused, among other events that allow the monitoring of the synthesis process.
The events are instances of the SpeakableEvent and are thrown by the synthesizer to be caught and treated by the listener. These entities carry information about the spoken word. More detail. The listeners are optional, and may be bypassed with a null argument to the synthesis methods.
The speakableListeners might be used in two ways:
associated with the listeners through the method speak, refer to table 1. This will define a listener for each item added to the items queue of the synthesizer. One listener might be shared by any number of queued items.
associated with the Synthesizer object through the addSpeakableListener method. This way the listener will receive the events of all the queued items in a determined synthesizer.
The listeners associated via the speak method will receive the events before the ones associated via the addSpeakable Listener.
The items queue: a synthesizer implements a queue of items provided to it through the speak and speakPlainText methods. The queue is "first-in, first-out" (FIFO)--the objects are spoken in the exactly order in which they were received. The object at the top of the queue is the object that is currently being spoken or about to be spoken. The QUEUE_EMPTY and QUEUE_NOT_EMPTY states of a Synthesizer indicate the current state of the speech output queue. The state handling methods inherited from the Engine interface (getEngineState, waitEngineState and testEngineState) can be used to test the queue state. The items on the queue can be checked with the enumerateQueue method, which returns a snapshot of the queue. The cancel methods allows an application to:
stop the output of an item currently on the top of the speaking queue.
remove an arbitrary item from the queue.
remove all items from the output queue.
The Voice: as an additional function of the synthesizers, we have the ability to choose the Voice that will be used in the synthesis. The parameters that must be provided are:
gender: GENDER_MALE, GENDER_FEMALE, GENDER_NEUTRAL and GENDER_DONT_CARE.
age: AGE_CHILD, AGE_DONT_CARE, AGE_MIDDLE_ADULT, AGE_NEUTRAL, AGE_OLDER_ADULT, AGE_TEENAGER and AGE_YOUNGER_ADULT.
To determine the association of the voice and the synthesizer, it is necessary to recover a SynthesizerProperties object through the getSynthesizerProperties method and determine the voice using setVoice. This can be more clearly understood in the following example code.
- A Switch for Your Pi
- Papa's Got a Brand New NAS
- Applied Expert Systems, Inc.'s CleverView for TCP/IP on Linux
- Returning Values from Bash Functions
- Simplenote, Simply Awesome!
- Tech Tip: Really Simple HTTP Server with Python
- Rogue Wave Software's TotalView for HPC and CodeDynamics
- Panther MPC, Inc.'s Panther Alpha
- Debugging Democracy
- NethServer: Linux without All That Linux Stuff