Quantcast
Username/Email:  Password: 

Java Speech Development Kit: A Tutorial

 in
The authors show how to get started developing voice-activated interfaces using the Speech for Java Development Kit.

The Speech for Java is a Java programming
package, 100% object-oriented, that uses the speech recognition and
synthesis technology known as ViaVoice, which was commercialized by
IBM . This API makes it possible to build application interfaces
that take advantage of speech.The recognition and synthesis operations are not processed by
the Java software. The kit is just a way to built speech-oriented
interfaces. Applications "throw" speech processing to the software
behind them, the IBM ViaVoice (commercial or free version).In this way, just like any other interface programming
models, the Speech for Java Development Kit (SDK) is also
event-oriented. Now, however, these events are not fired by the
mouse or the keyboard, they are created due to human speech via a
microphone.As mentioned the software is based upon two major features,
speech synthesis and recognition. The recognition can be done
through the use of grammars, which are entities providing
information about how the recognition will be done. There are two
kinds of grammars:

  • dictation: it's specially designed for continuous
    speech, when the software tries to determine what is being spoken
    using a large word database that associates a sound with each word.
    To increase accuracy and performance, contextual data is considered
    in order to establish the pronounced words. It is possible to use
    different dictation grammars for different domains, such as
    medicine, engineering or computer science.
  • rule: these grammars use rules, user's definitions
    of what might be spoken and how that will be interpreted by the
    application. The set of rules can take any size, but usually it is
    limited by the context in which the application is inserted. In
    this way, commands might be established in execution time to
    enhance responsiveness.

Voice synthesis might provide the application with simple
string sentences as arguments to the method speak. An improved
naturalness can be achieved via the use of a special markup
language--Java Speech Markup Language (JSML). Through its use,
properties like voice, frequency, rhythm and volume can be
dynamically altered.ArchitectureFigure 1. Simplified Architecture of
the IBM SDK Java Technology
In figure 1, we can see that the sound hardware is controlled
by the operating system. Right above it is the Engine.exe binary
application, which is automatically initialized when the voice
synthesis/recognition applications are started. The Engine is the
heart of the IBM ViaVoice ( in the commercial or free versions). It
is responsible for accessing all the ViaVoice's features.Also, in figure 1, we can see the basic components of voice
applications: the entities Recognizer and Synthesizer, the Central
class and, especially, the Engine Interface, which is extended by
the Synthesizer and Recognizer Interfaces. This interface has all
the basic methods for controlling and accessing the ViaVoice
processing engine.The Engine Interface is important due to the fact that Java
is a multiplatform language. For that reason, the same development
kit is used on the UNIX, Linux and Windows systems, but each system
has its own binary implementation of the IBM ViaVoice processing
engine. The Engine interface hides the details of the
platform-dependent software, offering proper access for the
Recognizers and Synthesizers.Now it is necessary to describe the Central class: it is in
charge of implementing the Engine interface. The Central class is,
in fact, responsible for abstracting platform details by providing
the correct implementation of the Engine Interface. The recognizers
and synthesizers extend the Engine Interface.Figure 2. Full Architecture of the
IBM SDK Java Technology
Below is a link to a code example that illustrates the most
simple way to create an application with a recognizer and a
synthesizer, with no working functionality.Example
Code
SynthesizersThe synthesizers, as their name indicates, are the entities
responsible for speech synthesis. They are created through the use
of the Central class, which implements the Engine interface and
acts as a connection to the synthesis provided by the IBM ViaVoice
technology.Creating a voice synthesizer can happen in one of two ways,
and both are use the Central class static method,
createSynthesizer:1. Accessing the default synthesizer of a determined locale
is the most simple and common method. It usually establishes access
to the synthesizer implementation distributed with the ViaVoice
software. It might be done as shown with the code below.

             Locale.setDefault("en","US");
             Synthesizer sintethesizer =
             Central.createSynthesizer(null);

Example
Code
2. Accessing a synthesizer that satisfies the conditions
defined through the arguments passed on by the createSynthesizer
method is the second way. This method is used in cases where more
than one synthesizer is available. The parameters are:

  • name of the engine
  • name of the mode in which it will be used
  • a locale supported by the engine
  • a Boolean value, a control flag of the
    engine
  • an array of objects Voice that will be used

These parameters are defined creating an object
SynthesizerModeDesc that will be passed to
Central.createSynthesizer, as seen here:public SynthesizerModeDesc(String engineName, String
modeName, Locale locale, Boolean running, Voice[] voices)Example
Code
Remember that any of the attributes can be null, and the
Central class will be responsible for identifying the best
synthesizer to fit the conditions.Synthesizing voice: Once we
have created the synthesizer, we can access its functions with the
speak method. A simple String argument is enough for the basic
features of synthesis, but there are other possibilities to
increase naturalness of computer speech. The most powerful one of
them is JSML (Java Speech Markup Language, covered in the next
section), which provides various techniques to make the speech more
similar to human voice.Table 1 shows all the forms of the speak method:Table 1. Methods Used for Voice SynthesisMethodFunctionvoid speakPlainText(String text, SpeakableListener
listenerSpeak a plain text string. The text is not interpreted
as containing the Java Speech Markup Language, so JSML elements are
ignored.void speak(Speakable JSMLtext, SpeakableListener
listener)Speak an object that implements the Speakable
interface and provides text marked with the Java Speech Markup
Language.void speak(URL JSMLurl, SpeakableListener
listener)Speak text from a URL formatted with the Java Speech
Markup Language. The text is obtained from the URL, checked for
legal JSML formatting and placed at the end of the speaking
queue.void speak(String JSMLText, SpeakableListener
listener)Speak a string containing text formatted with the Java
Speech Markup Language. The JSML text is checked for formatting
errors, and a JSMLException is thrown if any are found.Example
Code
The speakable objects are members of classes that implement
the speakable interface. This interface has only one method,
getJSMLText. This method specifies a JSML String to be returned
when the object is submitted to the speak method. An example can be
seen in the following sample code.The SpeakableListener: To
the methods of table 1, an extra element may be attached, a
SpeakableListener. It will receive specific events for each
pronounced word. Different events are generated during the
synthesis process, and these events can be used to take control of
the speech process, enabling a more interactive application. They
indicate when a new word starts to be pronounced, if its synthesis
was canceled and if it is over or was paused, among other events
that allow the monitoring of the synthesis process.The events are instances of the SpeakableEvent and are thrown
by the synthesizer to be caught and treated by the listener. These
entities carry information about the spoken word. More detail. The
listeners are optional, and may be bypassed with a null argument to
the synthesis methods.The speakableListeners might be used in two ways:

  1. associated with the listeners through the method
    speak, refer to table 1. This will define a listener for each item
    added to the items queue of the synthesizer. One listener might be
    shared by any number of queued items.
  2. associated with the Synthesizer object through the
    addSpeakableListener method. This way the listener will receive the
    events of all the queued items in a determined synthesizer.

The listeners associated via the speak method will receive
the events before the ones associated via the addSpeakable
Listener.The items queue: a
synthesizer implements a queue of items provided to it through the
speak and speakPlainText methods. The queue is "first-in,
first-out" (FIFO)--the objects are spoken in the exactly order in
which they were received. The object at the top of the queue is the
object that is currently being spoken or about to be spoken. The
QUEUE_EMPTY and QUEUE_NOT_EMPTY states of a Synthesizer indicate
the current state of the speech output queue. The state handling
methods inherited from the Engine interface (getEngineState,
waitEngineState and testEngineState) can be used to test the queue
state. The items on the queue can be checked with the
enumerateQueue method, which returns a snapshot of the queue. The
cancel methods allows an application to:

  • stop the output of an item currently on the top of
    the speaking queue.
  • remove an arbitrary item from the queue.
  • remove all items from the output queue.

Example
Code
The Voice: as an additional
function of the synthesizers, we have the ability to choose the
Voice that will be used in the synthesis. The parameters that must
be provided are:

  • gender: GENDER_MALE, GENDER_FEMALE, GENDER_NEUTRAL
    and GENDER_DONT_CARE.
  • age: AGE_CHILD, AGE_DONT_CARE, AGE_MIDDLE_ADULT,
    AGE_NEUTRAL, AGE_OLDER_ADULT, AGE_TEENAGER and
    AGE_YOUNGER_ADULT.

To determine the association of the voice and the
synthesizer, it is necessary to recover a SynthesizerProperties
object through the getSynthesizerProperties method and determine
the voice using setVoice. This can be more clearly understood in
the following example code.Example
Code
Java Speech Markup LanguageJava Speech Markup Language (JSML) is a markup language
derived from XML and is used to describe an audio entrance in the
synthesizers of the API. Its elements are able to supply detailed
information about how to pronounce the text provided to the
synthesizer. These elements describe the structure of a document
(paragraphs, phrases), the words pronounced, text markers and
prosodic elements that control phrasing, emphasis, pitch, speaking
rate and other important characteristics. Appropriate markup of
text improves the quality and naturalness of the synthesized
voice.JSML is used as arguments to the synthesizers and can be
accessed through one of the forms of the method speak. It might be
provided in a text file with the JSML formatted text or in small
phrases as arguments for the method. Table 4 shows the JSML
tags:Table 2. JSML TagsElementDescriptionExamplePARADetermines that the marked text must be pronounced as
a paragraph.<PARA>This is a short paragraph.</PARA>
<PARA>The subject has changed, so this is a new
paragraph.</PARA>SENTDetermines that the marked text must be pronounced as
a phrase.<PARA><SENT>So
long.</SENT><SENT>See you
later.</SENT></PARA>SAYASSpecifies how the marked text must be read.<SAYAS SUB="0x0">zero draw</SAYAS>EMPDetermines the emphasis of the pronounce. The values
might be: "strong", "moderate", "none", or "reduced".<EMP LEVEL="reduced">Whispering is not
polite.</EMP>BREAKDetermines a stop during synthesis. Milliseconds or
size can be used: "none", "small", "medium", or "large".<BREAK MSECS="300"/> or <BREAK
SIZE="small"/>PROSDetermines the speech properties: volume (VOL absolute
number between 0.0 and 1.0), rhythm (RATE in words per minute),
frequency (PITCH in hertz), frequency interval (RANGE in
hertz).<PROS RATE= "50" VOL="+80%">Í'll repeat
loud and clear, just one more time. </PROS>MARKERWhen this tag is reached a markerReached event is
triggered, this might be interpreted by a SpeakableListener. A tag
MARKER can be positioned anywhere in the JSML text, making it
possible to monitor the whole text being pronounced.Answer <MARKER MARK="yes_no"/> yes or no.ENGINEProvides information to the synthesizer identified by
the ENGID, that will pronounce what is defined by the DATA
parameter. If the synthesizer is not available, the marked text
will be pronounced instead.The frog does <ENGINE ENGID= "Frog_Sint1.0"
DATA="ribbit=1"> improper synthesizer</ENGINE>Example
Code
RecognizersThe recognizers are the entities responsible for speech
recognition. They are created through the Central class and are
initialized using the allocate method. They process the audio
entering in parallel to the main application.The recognition is done with the help of a set of words that
limit the universe of possibilities that can be pronounced by the
user. These sets are extensible, and new words can be added,
increasing the application functionality. The sets are known as
grammars, and it is necessary to have at least one grammar
associated with each recognizer.When audio enters the recognizers, they try to determine an
association between the spoken words and their associated grammars.
The recognition process culminates on the creation of objects named
results. Results have types ACCEPTED or REJECTED accordingly to the
confidence level determined by the setConfidenceLevel method (the
default value is 0.5).Creating a Voice RecognizerIn the same way as the synthesizers, the recognizers are
created through the Central class in two possible ways:1. Accessing the recognizer's default implementation that
better fits the current locale. It is the simplest manner and the
most common:

                 Locale.setDefault("en","US");
                 Recognizer recognizer =
                 Central.createRocognizer(null);

Example
Code
2. Accessing a recognizer that satisfies predetermined
conditions, this method is used in cases where there are more than
one recognizer implementation available. The required parameters
are:

  • name of the engine
  • name of the mode in which it will be used
  • a locale supported by the engine
  • a Boolean value, a control flag of the
    engine
  • a Boolean value, indicating whether the recognizer
    supports dictation grammar
  • an array of SpeakerProfile objects

Any of the parameters might be null, and the Central class
will try to make the best choice for them. These parameters are
defined via a RecognizerModeDesc object, as shown here:

 RecognizerModeDesc(String engine, String mode, Locale locale, Boolean running,
 Boolean dictationGrammarSupported, SpeakerProfile[] profiles)

Example
Code
RecognitionThe voice recognition is based upon a set of words (phonemes)
named grammars. A created recognizer can have associated grammars,
increasing its recognition domain according to the application
needs. Having associated grammars, the recognizers generate result
objects as answers to the recognition process. These results
contain information about what was understood. This information is
made available through ResultEvents fired by the results and caught
by ResultListener, where the implementation determines what to do
with the recognized data.To implement this schema the recognizer can have two types of
associated grammars, each one of them with its own word domain.
They have distinct size and elements and determine a different
recognition process:

  • Dictation
    grammars:
    these grammars are comprised of sets of words
    with thousands of elements. Their elements are determined by the
    function of their comprised domain, such as medicine, computer
    science or engineering. A general purpose grammar is available with
    the development kit.
    Dictation grammars analyze the context of what is being
    spoken and try to limit the set of words that can be said next.
    Their use is specific to continuous dictating speech, such as texts
    and sentences. Great processing capacity is needed, therefore, and
    its use requires well projected applications.The creation of a dictation grammar is done through a
    recognizer that supports its use. It is illustrated in here:

             DictationGrammar dictationGrammar =
             recognizer.getDictationGrammar(null);
             dictationGrammar.setEnabled(true);
  • Rule grammars:
    these are simpler than dictation grammars. They are comprised of
    rules that determine what can be said and how it will be
    represented inside the application. The rules are defined in a
    special format named JSGF (Java Speech Grammar Format) and can be
    loaded one at a time or many at once during execution.
    Rule grammars require less processing capacity than dictation
    grammars and its recognition accuracy is much higher, resulting in
    a bigger number of accepted results. This is possible because of
    their limited scope of recognition. It is a grammar well suited for
    voice-command definitions, determining a specific set of commands
    for the various contexts of an application.
  • There are two ways to create a rule grammar:

    • through the loadJSGF method, when a set of rules is
      loaded from a text file; or
    • through the newRuleGrammar method, which returns an
      empty rule grammar that must be specified during execution.

    Example
    Code
    Both methods belong to the recognizer interface, being the
    grammar automatically associated with the recognizer that created
    it.Java Speech Grammar FormatThe speech recognition systems make it possible for the
    computer to hear what a user says and "understand" what was spoken.
    Speech for Java does not yet support speech recognition without
    restrictions; the ability to hear any speech in any context and
    understand it accurately. To reach a reasonable accuracy and answer
    time, the recognition limits what might be said through the use of
    grammars, as seen in the previous section. The Java Speech Grammar
    Format (JSGF) language defines a way to describe what is named as a
    RuleGrammar. It uses a textual representation that might be read
    and edited by both developers and users through a simple text
    editor.The rules have the following structure:Figure 3. JSGF Rule
    Structure
    If something is said and the recognition is not above a
    confidence level--defined through the use of the method
    setConfidenceLevel (the default is de 0.5)--a resultRejected
    rejection event will be triggered. Otherwise a resultAccepted event
    is created, and the method getTags returns the identifier(s) of the
    correspondent rule. Table 3 shows the main JSGF features:Table 3. Main Uses of the JSGF LanguageUse ExampleWhat Might Be SaidSimple words <rule1> = open {open};"open"Whole phrases <rule2> = open the program {open};"open the program"Multiple choices <rule3> = open| open the program {open};"open" or "open the program"Optional words <rule5> = [please] open {open};"open" or "please open"Rule referencing <rule6> = <regra5> {open};The same as rule 5Enclosing operator * <rule7> = 13579* times;"1", "3", "5", "7", "9", "11", "13", "99", "93", or
    another number (including none) formed by 1, 3, 5, 7 and 9 in any
    order followed by timesEnclosing operator + <rule8> = 13579+ times;This time a number must be said following the pattern
    seen in rule7Here is an
    example illustrating the use of this
    technique.
    ListenersThe IBM ViaVoice Java technology is based on the same model
    used in graphic user interface (GUI) programming, that is, it
    depends on events generated by the user, events that are
    intercepted by listeners. These listeners are interfaces
    implemented by software developers, in this way any kind of
    processing can be triggered by specific speech events.All speech events are derived from the class SpeechEvent,
    each one with specific information of the occurrence that fired
    it.EngineListener: interface
    defining methods to be called when state-change events for a speech
    engine occur. To receive engine events an application attaches a
    listener by calling the addEngineListener method of an Engine. A
    listener is removed by a call to the removeEngineListener
    method.=> Associated event: EngineEvent or one of its derived
    classes.The main events treated are engineAllocated,
    engineAllocatingResources, engineDeallocated,
    engineDeallocatingResources, engineError, enginePaused and
    engineResumed.They can be understood by observing Figure 4.Figure 4. Engine
    States
    The Engine's working cycle is shown in Figure 4: first the
    engine is created in DEALLOCATED state, to initially use it is
    necessary to call the allocate method. Then the engine passes to
    the temporary ALLOCATING_RESOURCES state, staying in this condition
    for a machine-dependent time; finally it reaches the ALLOCATED
    state with its substates: RESUMED (default) indicating engine
    activity and PAUSED indicating suspended activities, the transition
    between these states is possible using pause and resume methods. By
    the end of the engine use, it is recommended that the hardware
    resources be explicitly freed through the use of the deallocate
    method. After that the engine enters the DEALLOCATING_RESOURCES
    state and finally gets back to its initial DEALLOCATED state. Since
    recognizer and synthesizers inherit from the Engine Interface, they
    also follow this cycle.For each of the five possible states a unidirectional
    transition causes an event to be fired, in a total of six events
    plus an engineError possible event. Namely, engineAllocated,
    engineAllocatingResources, engineDeallocated,
    engineDeallocatingResources, engineError, enginePaused and
    engineResumed.SynthesizerListener:
    listener dedicated to treat the events generated during the
    synthesizer working cycle, more precisely during the synthesis
    function of the engine (see Figure 5).=> Associated event: SynthesizerEvent or one of its
    derived classes.Through this listener it is possible to take control of the
    synthesizer's items queue, which changes its state when:

    • a new item is added to the queue via the speak
      method.
    • an item is removed from the queue via the cancel
      method.
    • an item audio output is finished.

    Figure 5. Synthesizer
    States
    The differences between the synthesizer cycle and engine
    cycle are the substates of the ALLOCATED state: ALLOCATED RESUMED
    QUEUE_EMPTY, ALLOCATED RESUMED QUEUE_NOT_EMPTY, ALLOCATED PAUSED
    QUEUE_EMPTY and ALLOCATED PAUSED QUEUE_NOT_EMPTY. It must be
    observed that the queue condition is independent of the synthesizer
    working condition; items might be added or removed from the queue
    in both states RESUMED and PAUSED, so these new substates are not
    part of a cycle but just indicators of the queue state.The possible events are:

    • queueEmptied: the speaking queue of the Synthesizer
      has emptied and the Synthesizer has changed to the QUEUE_EMPTY
      state. The queue may become empty because speech output of all
      items in the queue is completed or because the items have been
      canceled.
    • queueUpdated: the speech output queue has changed.
      This event may indicate a change in the state of the Synthesizer
      from QUEUE_EMPTY to QUEUE_NOT_EMPTY. The event may also occur in
      the QUEUE_NOT_EMPTY state without changing state. The
      enumerateQueue method of the Synthesizer will return a changed
      list. The speech output queue changes when:

    • a new item is placed on the queue with a call to
      one of the speak methods, or
    • when an item is removed from the queue with one of
      the cancel methods (without emptying the queue), or
    • when output of the top item of the queue is
      completed (again, without leaving an empty queue).

    This listener inherits from the EngineListener interface and
    therefore the same events treated by this listener might be treated
    by the SynthesizerListener.SpeakableListener: listener
    specially designed to treat events generated during speech
    synthesis. How to set it up:

    • Provide a SpeakableListener object when calling one
      of the speak or speakPlainText methods of a Synthesizer.
    • Attach a SpeakableListener to a Synthesizer with
      its addSpeakableListener method.

    => Associated event: SpeakableEvent or one of its derived
    classes.The events that might be treated by the listener are:

    • markerReached: issued when audio output reaches a
      marker contained in the JSML text of a speech output queue item.
      The event text is the string of the MARK attribute. The markerType
      indicates whether the mark is at the opening or close of a JSML
      element or is an attribute of an empty element (no close).
    • speakableCanceled: issued when an item on the
      synthesizer's speech output queue is canceled and removed from the
      queue. A speech output queue item may be canceled at any time
      following a call to speak. An item can be canceled even if it is
      not at the top of the speech output queue (other SpeakableEvents
      are issued only to the top-of-queue item). Once canceled, the
      listener for the canceled object receives no further
      SpeakableEvents.
    • speakableEnded: issued with the completion of audio
      output of an object on the speech output queue as the object is
      removed from the queue. A QUEUE_UPDATED or QUEUE_EMPTIED event also
      is issued when the speech output queue changes because the speech
      output of the item at the top of the queue is completed. The
      SpeakableEvent is issued prior to the SynthesizerEvent.
    • speakablePaused: issued when audio output of the
      item at the top of a synthesizer's speech output queue is paused.
      The SPEAKABLE_PAUSED SpeakableEvent is issued prior to the
      ENGINE_PAUSED event that is issued to the
      SynthesizerListener.
    • speakableResumed: issued when audio output of the
      item at the top of a synthesizer's speech output queue is resumed
      after a previous pause. The SPEAKABLE_RESUMED SpeakableEvent is
      issued prior to the ENGINE_RESUMED event that is issued to the
      SynthesizerListener.
    • speakableStarted: issued at the start of audio
      output of an item on the speech output queue. This event
      immediately follows the TOP_OF_QUEUE unless the Synthesizer is
      paused when the speakable text is promoted to the top of the output
      queue.
    • topOfQueue: issued when an item on the
      synthesizer's speech output queue reaches the top of the queue. If
      the Synthesizer is not paused, the TOP_OF_QUEUE event will be
      followed immediately by the SPEAKABLE_STARTED event. If the
      Synthesizer is paused, the SPEAKABLE_STARTED event will be delayed
      until the Synthesizer is resumed.
    • wordStarted: issued when a synthesis engine starts
      the audio output of a word in the speech output queue item. The
      text, wordStart and wordEnd parameters define the segment of the
      speakable string which is now being spoken.

    It must be observed that, SynthesizerListener extends the SDK
    Java API's EngineListener differently from the SpeakableListener
    that extends JDK API's EventListener. They have different natures;
    the former is related to synthesizer's working events (Engine
    application) and the latter is related to synthesis processing
    events.RecognizerListener: listener
    dedicated to treat events generated during the recognizer working
    cycle, or more precisely, during the recognizing function of the
    Engine. Through its use it is possible to take control of the
    recognizer processing events.=> Associated event: RecognizerEvent or one of its derived
    classes.Figure 6. Recognizer
    States
    The difference between the recognizer cycle (Figure 6) and
    the engine cycle are the substates of the ALLOCATED state:
    SUSPENDED, LISTENING and PROCESSING (Figure 7) that indicate the
    audio input processing state. SUSPENDED is the default state, which
    changes to the LISTENING state due to audio input. Then the
    PROCESSING state takes place.The SUSPENDED state is also reached via the recognizer's
    suspend method. In this state the audio input is buffered for later
    processing. To leave this SUSPENDED state the commitChanges method
    should be called, then the LISTENING state is reached again.Following we have the recognition cycle:Figure 7. Recognition
    Cycle
    The states FOCUS_ON and FOCUS_OFF are reached through
    requestFocus and releaseFocus methods. They indicate if the
    application has access to the Engine in the case where more than
    one application is running at the same time (just one can have
    access at the same moment).Besides the Engine events, the following events can be
    generated:

    • changesCommitted: a CHANGES_COMMITTED event is
      issued as a Recognizer changes from the SUSPENDED state to the
      LISTENING state and resumes recognition. The
      GRAMMAR_CHANGES_COMMITTED event is issued to the GrammarListeners
      of all changed grammars immediately following the CHANGES_COMMITTED
      event.
    • focusGained: a FOCUS_GAINED event is issued as a
      Recognizer changes from the FOCUS_OFF state to the FOCUS_ON state.
      A FOCUS_GAINED event typically follows a call to requestFocus on a
      Recognizer. The GRAMMAR_ACTIVATED event is issued to the
      GrammarListeners of all activated grammars immediately following
      this RecognizerEvent.
    • focusLost: a FOCUS_LOST event is issued as a
      Recognizer changes from the FOCUS_ON state to the FOCUS_OFF state.
      A FOCUS_LOST event may follow a call to releaseFocus on a
      Recognizer or follow a request for focus by another application.
      The GRAMMAR_DEACTIVATED event is issued to the GrammarListeners of
      all deactivated grammars immediately following this
      RecognizerEvent.
    • recognizerProcessing: a RECOGNIZER_PROCESSING event
      is issued as a Recognizer changes from the LISTENING state to the
      PROCESSING state.
    • recognizerSuspended: a RECOGNIZER_SUSPENDED event
      is issued as a Recognizer changes from either the LISTENING state
      or the PROCESSING state to the SUSPENDED state. A result
      finalization event (either a RESULT_ACCEPTED or RESULT_REJECTED
      event) is issued immediately following the RECOGNIZER_SUSPENDED
      event.

    The RecognizerListener extends the EngineListener interface,
    and therefore, the same events treated by it might be treated by
    the RecognizerListener.RecognizerAudioListener:
    listener used to take care of operational audio events.=> Associated event: RecognizerAudioEvent or one of its
    derived classes.We have:

    • audioLevel: the AUDIO_LEVEL event indicates a
      change in the volume level of the incoming audio. This volume
      ranges from 0.0 to 1.0.
    • speechStarted: the recognizer has detected the
      possible start of speech in the incoming audio. Applications may
      use this event to display visual feedback to a user indicating that
      the recognizer is listening.
    • speechStopped: the recognizer has detected the end
      of speech or noise in the incoming audio that it previously
      indicated by a SPEECH_STARTED event. This event always follows a
      SPEECH_STARTED event.

    GrammarListener: controls
    the events generated due to changes in grammar objects.=> Associated event: GrammarEvent or one of its derived
    classes.Its events:

    • grammarActivated: a GRAMMAR_ACTIVATED event is
      issued when a grammar changes state from deactivated to activated.
      The isActive method of the grammar will now return true. Grammar
      activation changes follow one of two RecognizerEvents:

    • a CHANGES_COMMITTED event in which a grammar's
      enabled flag is set true.
    • a FOCUS_GAINED event.

    The full details of the activation conditions under which a grammar
    is activated are described in the documentation for the grammar
    interface.

  • grammarChangesCommitted: a
    GRAMMAR_CHANGES_COMMITTED event is issued when a Recognizer
    completes committing changes to a grammar. The event is issued
    immediately following the CHANGES_COMMITTED event that is issued to
    RecognizerListeners. That event indicates that changes have been
    applied to all grammars of a Recognizer. The
    GRAMMAR_CHANGES_COMMITTED event is specific to each individual
    grammar. The event is issued when the definition of the grammar is
    changed, when its enabled property is changed or both.
  • grammarDeactivated: a GRAMMAR_DEACTIVATED event is
    issued when a grammar changes state from activated to deactivated.
    The isActive method of the grammar will now return false. Grammar
    deactivation changes follow one of two RecognizerEvents:
    • a CHANGES_COMMITTED event in which a grammar's
      enabled flag is set false.
    • a FOCUS_LOST event.

    The full details of the activation conditions under which a grammar
    is deactivated are described in the documentation for the grammar
    interface.

    ResultListener: this is the
    main listener of all. It is responsible for listening to the events
    generated by the result objects that are created by the recognizers
    that work together with grammar objects. Implementing this
    interface enables the developer to determine what processing must
    be done to answer a specific event. It is associated with the
    ResultEvent objects that carry information about the recognizer,
    the grammar and the event.The result objects carry information about the grammar to
    which it was associated, about the recognizer that created it, the
    strings named ResultToken representing what was said (when
    possible), information about the spoken sound on the form of an
    AudioClip and data that can be used for training the
    recognizer.The possible result object states are:

    • FINALIZED:

    • ACCEPTED: the audio item was understood and an
      association with one of the active grammars was determined.
    • REJECTED: the audio item was understood, but the
      recognizer considers a high possibility of a mistake having been
      made. That means the recognizer was able to understand what was
      said (to associate a string, or token, meaning the heard sound),
      but there was not enough information to be sure of the recognition
      due to poor sound quality, a bad pronunciation or even due to
      hardware problems. These results must be carefully treated by the
      application.
  • UNFINALIZED: the audio item was understood and it's
    been processed, but it was not possible to determine an association
    with one of the active grammars.
  • => Associated event: ResultEvent or one of its derived
    classes.

    • audioReleased: an AUDIO_RELEASED event is issued
      when the audio information associated with a FinalResult object is
      released. The release may have been requested by an application
      call to releaseAudio in the FinalResult interface or may be
      initiated by the recognizer to reclaim memory. The FinalResult
      isAudioAvailable method returns false after this event. The
      AUDIO_RELEASED event is only issued for results in a finalized
      state (getResultState returns either ACCEPTED or REJECTED).
    • grammarFinalized: GRAMMAR_FINALIZED is issued when
      the grammar matched by a result is identified and finalized. Before
      this event the getGrammar method of a result returns null.
      Following the event it is guaranteed to return non-null, and the
      grammar is guaranteed not to change. The GRAMMAR_FINALIZED event
      only occurs for a result that is in the UNFINALIZED state. A
      GRAMMAR_FINALIZED event does not affect finalized or unfinalized
      tokens.
    • resultAccepted: a RESULT_ACCEPTED event is issued
      when a result is finalized successfully and indicates a state
      change from UNFINALIZED to ACCEPTED. In the finalization
      transition, zero or more tokens may be finalized, and the
      unfinalized tokens are set to null. The isTokenFinalized and
      isUnfinalizedTokensChanged flags are set appropriately.
    • resultCreated: RESULT_CREATED is issued when a new
      result is created. The event is received by each ResultListener
      attached to the Recognizer. When a result is created, it is in the
      UNFINALIZED state. When created the result may have zero or more
      finalized tokens and zero or more unfinalized tokens. The presence
      of finalized and unfinalized tokens is indicated by the
      isTokenFinalized and isUnfinalizedTokensChanged flags.
    • resultRejected: a RESULT_REJECTED event is issued
      when a result is unsuccessfully finalized and indicates a change
      from the UNFINALIZED state to the REJECTED state. In the state
      transition, zero or more tokens may be finalized and the
      unfinalized tokens are set to null. The isTokenFinalized and
      isUnfinalizedTokensChanged flags are set appropriately. However,
      because the result is rejected, the tokens are quite likely to be
      incorrect. Since the result is finalized (rejected), the methods of
      Finalresult can be used.
    • resultUpdated: a RESULT_UPDATED event has occurred
      because a token has been finalized and/or the unfinalized text of a
      result has changed. The event is issued to each ResultListener
      attached to the Recognizer, to each ResultListener attached to the
      result, and if the GRAMMAR_FINALIZED event has already been
      released, to each ResultListener attached to the matched
      grammar.
    • trainingInfoReleased: a TRAINING_INFO_RELEASED
      event is issued when the training information for a finalized
      result is released. The release may have been requested by an
      application call to the releaseTrainingInfo method in the
      FinalResult interface or may be initiated by the recognizer to
      reclaim memory.

    Detailed information of result objects are presented in the
    next section.ResultsAs mentioned, the result objects are created as answers to
    the recognition performed by the SDK Java. They are responsible for
    generating recognition result events or ResultEvents that might be
    intercepted by ResultListeners. These events are implemented in
    such a way that it is possible for the application to have access
    to the object that generated it (following the Java standard)
    through the getSource method. (The getSource method belongs to the
    java.util.EventObject class interface of which ResultEvent
    inherits.) In this context we will have a result object, or a
    descendent, returned.The possible result object states are:

    • FINALIZED:

    • ACCEPTED: the audio item was understood and an
      association with one of the active grammars was determined.
    • REJECTED: the audio item was understood, but the
      recognizer considers a high possibility of a mistake having been
      made. That means, the recognizer was able to understand (to
      associate a string, or token, meaning the heard sound) what was
      said, but there was not enough information to be sure of the
      recognition, due to poor sound quality, a bad pronunciation or even
      due to hardware problems. These results must be treated carefully
      by the application.
  • UNFINALIZED: the audio item was understood and it
    is been processed, but it was not possible yet to determine an
    association with one of the active grammars.
  • Accept or Reject: rejection of a result indicates that the
    recognizer is not confident that it has accurately recognized what
    a user said. Rejection can be controlled through the
    RecognizerProperties interface with the setConfidenceLevel method.
    Increasing the confidence level requires the recognizer to have
    greater confidence to accept a result, so more results are likely
    to be rejected.Below is an illustration of the recognition cycle and some of
    the fired events:Figure 8.
    Recognition Cycle and Some of the Fired
    Events
    By looking at the figure, we can establish the relationship
    between the result object states and the events (ResultEvent) that
    the listeners (ResultListeners) are able to intercept.

    • A RESULT_CREATED event creates a result object. A
      new result is started in the UNFINALIZED state.
    • UNFINALIZED state: RESULT_UPDATED events indicate a
      change in finalized and/or unfinalized tokens; a GRAMMAR_FINALIZED
      event indicates that the grammar matched by this result has been
      identified.
    • The RESULT_ACCEPTED event finalizes a result by
      indicating a change in state from UNFINALIZED to ACCEPTED.
    • The RESULT_REJECTED event finalizes a result by
      indicating a change in state from UNFINALIZED to REJECTED.
    • In the finalized states (ACCEPTED and REJECTED),
      the AUDIO_RELEASED and TRAINING_INFO_RELEASED events also may be
      issued.

    The result objects are:

    • Result: the most primitive form of a created
      result. This form is used until the recognition cycle is finished,
      before all the information relating to a certain audio entrance is
      available. The substates might be FINALIZED or UNFINALIZED.
    • FinalResult: created when we have all the
      information relating to a certain audio entrance, that is, it is a
      complete result able to supply all the possible data as a
      consequence of a complete recognition. Its substates are ACCEPTED
      or REJECTED (always FINALIZED).

    The information available in a result is determined by the
    type of grammar to which it was associated. Therefore, completing
    the model, the FinalResult interface is inherited by two other
    interfaces exclusively implemented by the recognizer for a certain
    finalized result, they are:

    • FinalRuleResult: the final result in consequence of
      an audio entrance associated to a RuleGrammar, with the substates
      ACCEPTED or REJECTED (FINALIZED).
    • FinalDictationResult: the final result in
      consequence of an audio entrance associated to a DictationGrammar,
      with the substates ACCEPTED or REJECTED (FINALIZED).

    This schema is useful on the following situations:

    1. Makes it possible to have access to the created
      results before they are finalized, for that we use simple result
      objects.
    2. We can have both types of grammars associated to
      the same recognizer. And, in certain moments, we might need
      information that is not relevant either to an association with a
      rule or to a dictation grammar, but simply for a finalized result.
      For that we can use the FinalResult interface without having to
      test the nature of the result.
    3. Having two types of finalized results, we can have
      additional data referring to the type of the grammar associated,
      increasing the control of applications.

    Now we can
    see some sample code illustrating the
    concepts.
    The APIThe API that is being studied is composed by three packages
    that form the voice Speech for Java SDK:

    • package javax.speech: has the basic interfaces and
      implementations for the Engine interface. It also implements the
      Central class providing access to all the other functions for
      creating the most important entities of the technology, the
      synthesizers and recognizers. It is the basis for the remaining
      packages, remembering that the recognizer and synthesizer
      interfaces extend the Engine interface. This package also has
      resources for managing the audio received by the Engine and, as a
      last feature, its SpeechEvent class is inherited by all the other
      events of the API.
    • package javax.speech.synthesis: has all the classes
      and interfaces used for the synthesis process. It also provides the
      Voice class that determines how the synthesis will take
      place.
    • package javax.speech.recognition: the largest of
      the packages, it implements the recognizing and grammar classes,
      given their close relation. Also, there are the result classes that
      are very important in this architecture.

    The packages are presented in a jar file named ibmjs.jar and
    the compilation is possible only after adding its full filesystem
    pathname to the CLASSPATH system variable, as follows:

                       CLASSPATH = %CLASSPATH% ; .../ibmjs.jar
    

    After that and the proper installation of the IBM ViaVoice
    software, it is necessary to execute the following command:

                          java com.ibm.speech.util.Install
    

    ReferencesYou can download the free IBM ViaVoice SDK Java at
    www.alphaworks.ibm.com/tech/speech.You can download the free IBM ViaVoice SDK for Linux at
    www-4.ibm.com/software/speech/dev/sdk_linux.html.Java Speech Interface from Sun is available at
    java.sun.com/products/java-media/speech/.See
    java.icmc.usp.br for
    more information.The authors acknowledge the support for this work from the
    The Research Foundation of the São Paulo State-FAPESP
    .José Rodrigues
    Júnior
    is a graduate student at the ICMC-USP
    University of São Paulo. Dr. Dilvan
    Moreira
    is a lecturer at the ICMC-USP.

    email: dilvan@icmc.sc.usp.br

    email: dilvan@icmc.sc.usp.br

    ______________________

    Comments

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.

    This is really useful. I

    siva's picture

    This is really useful. I successfully completed my project voice controlled wheel chair with the help of this concept... It works well....

    Re: Java Speech Development Kit: A Tutorial

    Anonymous's picture

    Is there lib to use text to speak in portuguese language ?

    Re: Java Speech Development Kit: A Tutorial

    Anonymous's picture

    i have read your tutorial about the Java Speech Development Kit, it is truely very interesting, i would realy love to develop a programe of my own on this context. please help me im a graduate from the University of Botswana in Computer Science in Botswana, i realy ineterested in speech program but i dont know where to start and what i need. more especialy that i dont have any of the classies that i can use for statup training plz give me an advice.

    my Email is matikitim2@yahoo.cu.com

    thank you in advace

    Moathodi Excellent Matikiti

    Re: Java Speech Development Kit: A Tutorial

    Anonymous's picture

    I am in the same situation as Moathodi, were I would like to develope a program on this subject. I would really appreciate it if you could aid me in kicking it of as I am not sure were to start.

    My email is adil_rehman@hotmail.com

    Many Thanks

    Adil

    Speech recognition engines

    Mario's picture

    I have done a lot of research on the speech development architecture of java..but i am more interested in some package that might make the development of a speech to text application faster. Thanx...hope i get a reply...my emaill is mario_ramotar@yahoo.com

    Post new comment

    • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
    • Lines and paragraphs break automatically.
    • Use to create page breaks.

    More information about formatting options