Andrew
Ross MacNeill (andrew@aksel.com)
(originally written November 2001)
One technology
that appears every so often and then fades because it's just not ready for
prime-time is speech recognition. Sure, computers can talk. The Apple Macintosh
computer was introduced by speaking to an audience back in 1984. In the early
90's, companies introduced voice control over their computer but it didn't
catch on. Now with Office XP, Microsoft appears to be trying to push voice
recognition back into the main stream. It's not just for voice control either –
you can actually use it for full dictation. What does this mean for us? Well,
as with most things included in Office, developers can take advantage of the
same core technology, known as SAPI (Speech API) to offer recognition in our
own applications. The SDK can be downloaded from the MSDN web site
(http://msdn.microsoft.com) and then included in your Windows-based
applications.
Is it ready
for prime-time? Only time will tell but it is certainly is better than what it
used to be. SAPI 5.1 uses COM components so we can take advantage of it with
FoxPro. Let's take a look at some of the basic functionality provided with
SAPI.
There are two basic areas involved in voice
and speech: one is speaking; the other is listening. Let's start with speaking.
The following commands are all that's needed to get started:
loSpeak
= CREATEOBJECT("SAPI.SPVoice")
loSpeak.SPEAK("Hello",1)
Wasn't that easy? Well, there's a little
more to speech than just the above command. The second parameter is an
enumerated value that tells the voice how to speak and integrate with the
development environment. Table 1 shows the valid values.
|
Value |
Description |
|
0 |
Default – Synchronous. Control waits
until the speaking is done. |
|
1 |
Asynchronous. Control returns immediately
after the command has been accepted which may be BEFORE the text is actively
spoken. |
|
2 |
Purge Before Speak. All other text statements are purged
before it speaks. |
|
4 |
Is File Name. Instead of reading the text passed, opens
the file and reads the file specified. |
|
8 |
IsXML. You can send grammatical and
pronunciation rules to the Speech engine (see below). |
|
16 |
IsNotXML. By default, the variables are
not read as XML. |
|
32 |
PersistXML. Changes made in one speak
command will persist to other calls to Speak. |
|
64 |
SpeakPunctuation With this flag, punctuation is actually
spoken so the "." becomes the word "period" |
Table 1 – Speaking with Flags.
The second parameter of the Speak method is the SpeakText flag, which tells
SAPI how to react in the environment.
For example, instead of typing in the whole
text of what you want to read, you can pass it the name of a file.
loSpeak.Speak("C:\Speech.doc",4
+ 1)
The 4 flag indicates that the value is a
file name. The 1 makes the speech asynchronous, so that you have control as
soon as the command is executed. This is very important if you are reading
large amounts of information so you can then Pause and Resume the speech, using
those respective methods.
loSpeak.Speak("C:\Speech.doc",
4 + 1)
ON KEY LABEL F4 loSpeak.Pause
ON KEY LABEL F5 loSpeak.Resume
If you don't pass the asynchronous flag,
you have to wait until it's done.
Set the Rate property to change the rate of
the voice. The value may be set between –10 and 10 but realistically, a value
of 5 is probably as fast as you want to go. By default, the Rate is 0.
spVoice
speaks at the current volume setting of your speakers. Reduce the Volume
property to lower the voice. This property starts at 100.
spVoice
can be instantiated multiple times and executed either simultaneously or
sequentially. You can also use different voices. Microsoft provides 4 standard
voices: Mary, Mike, Sam and a sample TTS (Text to Speech) voice. Set the
desired voice by setting the Voice property to a valid item from the GetVoices
method. By default, SAPI uses the Microsoft Mary voice.
FOR EACH loVoice IN loSpeak.GetVoices()
loSpeak.Speak(loVoice.GetDescription)
ENDFOR
loSpeak.Voice = loSpeak.GetVoices().Item(2)
&& Sets to Microsoft Sam
The Speech engine can also speak in streams
in which case it does not speak text but rather sound files. You can use
streams to store the speech output into sound files that may then be re-used at
a later time. The SPFileStream object is used for dealing with streams. Let's
look at an example in which we take the speech from one voice and use it with
another. See table 2 for a list of the properties and events of the
spFileStream object.
|
Method/Property |
Description |
|
Open (cfile, nflag) |
Opens a WAV file for reading or
manipulation. The nFlag parameter is either 0 for reading or 3 for writing. |
|
Write (cbuffer) |
Writes the contents of cbuffer into the
wave file. |
|
Read (cBufferVar, nNumBytes) |
Reads the contents of the WAV file into
the buffer variable. |
|
Seek (nPosition) |
Moves ahead into the WAV file by the
number of bytes. |
|
Close |
Closes the Wav file |
Table 2 – The spFileStream's methods are
used for manipulated WAV files that contain the results of speech. Note how
similar it is to the low-level file functions of VFP.
loStream
= CREATEOBJECT("SAPI.SPFileStream)
loVoice
= CREATEOBJECT("Sapi.Spvoice")
** Create the Wav file
loStream.open("C:\SampleVoice.wav",
3)
** Specify that the Audio from the voice is
sent to the stream
loVoice.AudioOutputStream = loStream
loVoice.Speak("This
is me reading this information.")
loStream.Close
** Reset the output stream
loVoice.AudioOutputStream = .NULL.
** Open the WAV file for reading purposes
loStream.Open("C:\SampleVoice.wav")
** Change the voice to another
"person"
loVoice.Voice = loVoice.GetVoices("GENDER=male").item(0)
lovoice.speak("Hello")
loVoice.SpeakStream(loStream)
loVoice.Speak("Did
that really sound like me?")
This code shows a few other properties of
the spVoice. The AudioOutputStream property specifies where the audio output
actually goes. The default setting is NULL which means output to your speakers.
GetVoices isn't really a collection as much as it is a method that returns a
collection. You can request a filtered collection by passing it a condition.
The conditions are based on the attributes of the voice object. A voice object
has properties such as Gender, Age, Name, Language and Vendor. In the example,
I specify that I only want to retrieve Male voices. If nothing matches the
condition, GetVoices returns null.
Finally, the SpeakStream method lets you
pass a pointer to the stream object. This is the equivalent of running the WAV
file. The end result is that the "recorded" voice is that of Mary
while the last voice used is that of Mike.
SAPI also lets you identify priorities of a
voice. This is important when using spVoice asynchronously. If you make two
calls to Speak at the same time, SAPI will run them sequentially. Set the
Priority property to determine which is more critical. Priority is zero by
default. Set it to 1 (Alert) and SAPI will stop all other voices from speaking
when a voice with a priority of Alert speaks. Set it to 2 (Over) and SAPI will
let the voice speak "over" the other voices.
lovoice1 = Createobject("SAPI.spvoice")
lovoice2 = Createobject("SAPI.spvoice")
loAlert
= Createobject("SAPI.spvoice")
loOver
= Createobject("SAPI.spvoice")
loAlert.Priority = 1
loOver.Priority = 2
** These will both be done sequentially
** Note the use of 5 as the flag which is
file name and asychronous
loVoice1.Speak("C:\speech.txt",5)
loVoice2.Speak("C:\speech.txt",5)
ON KEY LABEL f3 DO InterruptMe
ON KEY LABEL F4 loOver.Speak("This
is a sample voice over")
FUNCTION InterruptMe
loAlert.Speak("WARNING
- This is CRITICAL!! Are you sure?")
loVoice1.Pause
loVoice2.Pause
ON KEY LABEL F5 loVoice1.Resume
ON KEY LABEL F6 loVoice2.Resume
Unless you are using very basic words,
pronunciation may get difficult when expecting the computer to speak.
Thankfully, you can use XML to write instructions for SAPI to know how to
speak. Table 3 shows the supported tags. For example, the following XML would
result in certain words receiving more emphasis and volume:
<volume level="50">
Speak louder, I say.
<emph>People</emph>
might <volume level="100">shout</volume> if you are too
soft.
</volume>
|
Tag |
Description |
|
Volume |
Changes the level attribute to indicate
the volume setting to a maximum of 100. |
|
Rate |
Changes the rate of speech. Set the Speed
to change the relative speed or set the AbsSpeed to set the absolute rate to
a maximum of 10. |
|
Pitch |
Changes the pitch of the voice. Set the
Middle to change the relative pitch of AbsMiddle to change the absolute pitch
to a maximum of 10. |
|
Emph |
Instructs the Voice to place emphasis on
a particular word or phrase. |
|
Spell |
Changes the Voice to spell the words as
opposed to speak them. |
|
Silence |
Inserts a pause based on the value of the
msec attribute, which is calculated in milliseconds. |
|
Pron |
Specifies the pronunciation of a word
with the sym attribute. The actual word is optional. Example:
|
Table 3 – Using XML to speak better. Here
are some of the XML tags that can be used in the string passed to the Speak
method.
Gotcha: Watch your Speaking Flags
If you don't pass an async flag to my
Goodbye call, SAPI waits for the voice to be unpaused before continuing.
loVoice1.Speak("Hello")
loVoice1.Pause
loVoice1.Speak("Goodbye")
In the above case, after the voice says
Hello, it pauses. If you don't provide a Resume mechanism, it will never
continue to the next line.
The Skip method can be used to skip ahead
sentences.
lcText
= "In September, I went to San Diego. It was a lot of fun." + ;
"I wonder
where it will be next year? " + ;
"It must be
on the east coast"
loVoice1.Speak(lcText,1)
loVoice1.Skip("Sentence",2)
This skips ahead from the current sentence
by two sentences.
To stop the speech completely, release the
spVoice object.
Voice recognition has come a long way in
the past few years. It may not be perfect enough for dictation of technical
manuals but it IS good enough for command syntax. There are lots of packages
out there today (Dragon, SpeakNow) that offer command control and dictation. But what about your application? Here's where SAPI comes in.
SAPI includes a number of different classes
for speech recognition but to get it working in your application, you only have
to know two of them: RecognitionContext and its associated event object. Thanks
to the VFP 7 Object Browser, it's easy to add this functionality in (figure 1
shows the Object Browser highlighting the necessary interface).

Figure 1 – The VFP 7 Object Browser makes
quick work of creating a link to the SAPI Events interface.
Drag the Interface for _IspeechRecoContextEvents into a piece of code and the code is
all there. I've cut out the extra code to make it easier to read below.
Gotcha: Deciding Which Interface To Use With the Object Browser
When working with the Object Browser,
figuring out which class to use IMPLEMENTS with can get tricky. The secret is
to choose the ones that start with the underscore.
Here is the base class definition generated
by the Object Browser, required to understand basic words:
DEFINE CLASS myclass AS session OLEPUBLIC
IMPLEMENTS _ISpeechRecoContextEvents IN ;
"c:\program files\common files\microsoft
shared\speech\sapi.dll"
PROCEDURE _ISpeechRecoContextEvents_Recognition(;
StreamNumber AS Number, ;
StreamPosition AS VARIANT, ;
RecognitionType AS VARIANT, ;
Result AS VARIANT) AS VOID;
HELPSTRING "Recognition"
?
Result.PhraseInfo.GetText
ENDPROC
ENDDEFINE
I haven't included all of the procedures
for the sake of brevity. The key method here is Recognition event. When a word
or phrase is recognized, the Recognition event fires, being passed where the
information is coming from (the stream) and the Resulting recognized phrase.
This next code hooks into this class using
VFP 7's EVENTHANDLER method.
PUBLIC oRecognize, oVFPObj,ogrammar
oVFPObj = CREATEOBJECT("myclass")
oRecognize = CREATEOBJECT("SAPI.spsharedrecocontext")
EVENTHANDLER(oRecognize,oVFPObj)
The class spSharedRecoContext is simply the
interface by which an application hooks into the Speech Recognition engine. It
also controls which words and phrases are available for the user to speak. This
code turns VFP into a Dictation machine.
oGrammar = oRecognize.CreateGrammar
oGrammar.DictationSetState(1)
The oGrammar object is what is actively
used to recognize the words. The call to DictationSetState activates the
Grammar object. Call DictationSetState with a 0 to deactivate it.
When this program is running, every word I
speak into my headset is recognized and then displayed on the screen. The first
time you do this, the accuracy leaves a little to be desired but you can
"train" SAPI for each voice so it gets smarter. The following code
starts making decisions based on what the user says.
lcCommand
= UPPER(Result.PhraseInfo.GetText)
DO CASE
CASE "PICKUP"$lcCommand
MESSAGEBOX("Pick up this load")
CASE "DELIVER"$lcCommand
MESSAGEBOX("Deliver up this load")
ENDCASE
That's all there is to getting started.
Using the SAPI Recognition engine, you can allow dictation into memo fields or
let your users run your application from their voice commands, all with just a
few lines of code added to your application.
The SAPI 5.1 SDK available from the MSDN
web site includes several Windows Installer Merge Modules for installation,
including support for English, Chinese and Japanese languages. This lets you
build your installation routine to ensure the SAPI objects have been registered
during installation. Be warned though - the file sizes are large – the download
for the Merge Modules is over 120 MB.
Making appplications easy isn't always
about following the standard interfaces and such – it's about separating
requirement from "cool for the sake of cool". For someone who's never
used a keyboard, the entire QWERTY approach isn't easy. Easy is a relative
term.
Speech recognition is now a technology that
can legitimately and efficiently be used in some industries. It isn't for every
application. I don't think anyone is ready to sit back and listen to a report
being read instead of looking at it on the screen. But for hands-free based
applications or simply as a way to make an application an easier tool to learn,
SAPI definitely offers developers an easy way to implement it.