You Talk and I'll Listen:

An Introduction to SAPI and Use of Speech In Applications

Andrew Ross MacNeill (andrew@aksel.com)

(originally written November 2001)

 

 

One technology that appears every so often and then fades because it's just not ready for prime-time is speech recognition. Sure, computers can talk. The Apple Macintosh computer was introduced by speaking to an audience back in 1984. In the early 90's, companies introduced voice control over their computer but it didn't catch on. Now with Office XP, Microsoft appears to be trying to push voice recognition back into the main stream. It's not just for voice control either you can actually use it for full dictation. What does this mean for us? Well, as with most things included in Office, developers can take advantage of the same core technology, known as SAPI (Speech API) to offer recognition in our own applications. The SDK can be downloaded from the MSDN web site (http://msdn.microsoft.com) and then included in your Windows-based applications.

 

Is it ready for prime-time? Only time will tell but it is certainly is better than what it used to be. SAPI 5.1 uses COM components so we can take advantage of it with FoxPro. Let's take a look at some of the basic functionality provided with SAPI.

 

Microsoft SAPI 5.1

 

There are two basic areas involved in voice and speech: one is speaking; the other is listening. Let's start with speaking. The following commands are all that's needed to get started:

 

loSpeak = CREATEOBJECT("SAPI.SPVoice")

loSpeak.SPEAK("Hello",1)

 

Wasn't that easy? Well, there's a little more to speech than just the above command. The second parameter is an enumerated value that tells the voice how to speak and integrate with the development environment. Table 1 shows the valid values.

 

Value

Description

0

Default Synchronous. Control waits until the speaking is done.

1

Asynchronous. Control returns immediately after the command has been accepted which may be BEFORE the text is actively spoken.

2

Purge Before Speak.

All other text statements are purged before it speaks.

4

Is File Name.

Instead of reading the text passed, opens the file and reads the file specified.

8

IsXML. You can send grammatical and pronunciation rules to the Speech engine (see below).

 

16

IsNotXML. By default, the variables are not read as XML.

32

PersistXML. Changes made in one speak command will persist to other calls to Speak.

64

SpeakPunctuation

With this flag, punctuation is actually spoken so the "." becomes the word "period"

Table 1 Speaking with Flags. The second parameter of the Speak method is the SpeakText flag, which tells SAPI how to react in the environment.

 

For example, instead of typing in the whole text of what you want to read, you can pass it the name of a file.

 

loSpeak.Speak("C:\Speech.doc",4 + 1)

 

The 4 flag indicates that the value is a file name. The 1 makes the speech asynchronous, so that you have control as soon as the command is executed. This is very important if you are reading large amounts of information so you can then Pause and Resume the speech, using those respective methods.

 

loSpeak.Speak("C:\Speech.doc", 4 + 1)

ON KEY LABEL F4 loSpeak.Pause

ON KEY LABEL F5 loSpeak.Resume

 

If you don't pass the asynchronous flag, you have to wait until it's done.

 

Set the Rate property to change the rate of the voice. The value may be set between 10 and 10 but realistically, a value of 5 is probably as fast as you want to go. By default, the Rate is 0.

 

spVoice speaks at the current volume setting of your speakers. Reduce the Volume property to lower the voice. This property starts at 100.

 

spVoice can be instantiated multiple times and executed either simultaneously or sequentially. You can also use different voices. Microsoft provides 4 standard voices: Mary, Mike, Sam and a sample TTS (Text to Speech) voice. Set the desired voice by setting the Voice property to a valid item from the GetVoices method. By default, SAPI uses the Microsoft Mary voice.

 

FOR EACH loVoice IN loSpeak.GetVoices()

loSpeak.Speak(loVoice.GetDescription)

ENDFOR

 

loSpeak.Voice = loSpeak.GetVoices().Item(2) && Sets to Microsoft Sam

 

The Speech engine can also speak in streams in which case it does not speak text but rather sound files. You can use streams to store the speech output into sound files that may then be re-used at a later time. The SPFileStream object is used for dealing with streams. Let's look at an example in which we take the speech from one voice and use it with another. See table 2 for a list of the properties and events of the spFileStream object.

 

 

 

Method/Property

Description

Open (cfile, nflag)

Opens a WAV file for reading or manipulation. The nFlag parameter is either 0 for reading or 3 for writing.

Write (cbuffer)

Writes the contents of cbuffer into the wave file.

Read (cBufferVar, nNumBytes)

Reads the contents of the WAV file into the buffer variable.

Seek (nPosition)

Moves ahead into the WAV file by the number of bytes.

Close

Closes the Wav file

Table 2 The spFileStream's methods are used for manipulated WAV files that contain the results of speech. Note how similar it is to the low-level file functions of VFP.

 

loStream = CREATEOBJECT("SAPI.SPFileStream)

loVoice = CREATEOBJECT("Sapi.Spvoice")

 

** Create the Wav file

loStream.open("C:\SampleVoice.wav", 3)

 

** Specify that the Audio from the voice is sent to the stream

loVoice.AudioOutputStream = loStream

 

loVoice.Speak("This is me reading this information.")

loStream.Close

 

** Reset the output stream

loVoice.AudioOutputStream = .NULL.

 

** Open the WAV file for reading purposes

loStream.Open("C:\SampleVoice.wav")

 

** Change the voice to another "person"

loVoice.Voice = loVoice.GetVoices("GENDER=male").item(0)

lovoice.speak("Hello")

loVoice.SpeakStream(loStream)

loVoice.Speak("Did that really sound like me?")

 

This code shows a few other properties of the spVoice. The AudioOutputStream property specifies where the audio output actually goes. The default setting is NULL which means output to your speakers. GetVoices isn't really a collection as much as it is a method that returns a collection. You can request a filtered collection by passing it a condition. The conditions are based on the attributes of the voice object. A voice object has properties such as Gender, Age, Name, Language and Vendor. In the example, I specify that I only want to retrieve Male voices. If nothing matches the condition, GetVoices returns null.

 

Finally, the SpeakStream method lets you pass a pointer to the stream object. This is the equivalent of running the WAV file. The end result is that the "recorded" voice is that of Mary while the last voice used is that of Mike.

 

SAPI also lets you identify priorities of a voice. This is important when using spVoice asynchronously. If you make two calls to Speak at the same time, SAPI will run them sequentially. Set the Priority property to determine which is more critical. Priority is zero by default. Set it to 1 (Alert) and SAPI will stop all other voices from speaking when a voice with a priority of Alert speaks. Set it to 2 (Over) and SAPI will let the voice speak "over" the other voices.

 

lovoice1 = Createobject("SAPI.spvoice")

lovoice2 = Createobject("SAPI.spvoice")

loAlert = Createobject("SAPI.spvoice")

loOver = Createobject("SAPI.spvoice")

loAlert.Priority = 1

loOver.Priority = 2

 

** These will both be done sequentially

** Note the use of 5 as the flag which is file name and asychronous

loVoice1.Speak("C:\speech.txt",5)

loVoice2.Speak("C:\speech.txt",5)

 

ON KEY LABEL f3 DO InterruptMe

ON KEY LABEL F4 loOver.Speak("This is a sample voice over")

 

FUNCTION InterruptMe

loAlert.Speak("WARNING - This is CRITICAL!! Are you sure?")

loVoice1.Pause

loVoice2.Pause

ON KEY LABEL F5 loVoice1.Resume

ON KEY LABEL F6 loVoice2.Resume

 

Using XML

 

Unless you are using very basic words, pronunciation may get difficult when expecting the computer to speak. Thankfully, you can use XML to write instructions for SAPI to know how to speak. Table 3 shows the supported tags. For example, the following XML would result in certain words receiving more emphasis and volume:

 

<volume level="50">

Speak louder, I say.

<emph>People</emph> might <volume level="100">shout</volume> if you are too soft.

</volume>

 

Tag

Description

Volume

Changes the level attribute to indicate the volume setting to a maximum of 100.

Rate

Changes the rate of speech. Set the Speed to change the relative speed or set the AbsSpeed to set the absolute rate to a maximum of 10.

Pitch

Changes the pitch of the voice. Set the Middle to change the relative pitch of AbsMiddle to change the absolute pitch to a maximum of 10.

Emph

Instructs the Voice to place emphasis on a particular word or phrase.

Spell

Changes the Voice to spell the words as opposed to speak them.

Silence

Inserts a pause based on the value of the msec attribute, which is calculated in milliseconds.

Pron

Specifies the pronunciation of a word with the sym attribute. The actual word is optional. Example:

 

<pron sym="h eh 1 l ow "/>

Table 3 Using XML to speak better. Here are some of the XML tags that can be used in the string passed to the Speak method.

 

Gotcha: Watch your Speaking Flags

 

If you don't pass an async flag to my Goodbye call, SAPI waits for the voice to be unpaused before continuing.

 

loVoice1.Speak("Hello")

loVoice1.Pause

loVoice1.Speak("Goodbye")

 

In the above case, after the voice says Hello, it pauses. If you don't provide a Resume mechanism, it will never continue to the next line.

 

The Skip method can be used to skip ahead sentences.

 

lcText = "In September, I went to San Diego. It was a lot of fun." + ;

"I wonder where it will be next year? " + ;

"It must be on the east coast"

 

loVoice1.Speak(lcText,1)

loVoice1.Skip("Sentence",2)

 

This skips ahead from the current sentence by two sentences.

 

To stop the speech completely, release the spVoice object.

 

 

But What Am I Saying?

 

Voice recognition has come a long way in the past few years. It may not be perfect enough for dictation of technical manuals but it IS good enough for command syntax. There are lots of packages out there today (Dragon, SpeakNow) that offer command control and dictation. But what about your application? Here's where SAPI comes in.

 

SAPI includes a number of different classes for speech recognition but to get it working in your application, you only have to know two of them: RecognitionContext and its associated event object. Thanks to the VFP 7 Object Browser, it's easy to add this functionality in (figure 1 shows the Object Browser highlighting the necessary interface).

 

Figure 1 The VFP 7 Object Browser makes quick work of creating a link to the SAPI Events interface.

 

Drag the Interface for _IspeechRecoContextEvents into a piece of code and the code is all there. I've cut out the extra code to make it easier to read below.

 

 

Gotcha: Deciding Which Interface To Use With the Object Browser

 

When working with the Object Browser, figuring out which class to use IMPLEMENTS with can get tricky. The secret is to choose the ones that start with the underscore.

 

 

Here is the base class definition generated by the Object Browser, required to understand basic words:

 

 

DEFINE CLASS myclass AS session OLEPUBLIC

 

IMPLEMENTS _ISpeechRecoContextEvents IN ;

"c:\program files\common files\microsoft shared\speech\sapi.dll"

 

PROCEDURE _ISpeechRecoContextEvents_Recognition(;

StreamNumber AS Number, ;

StreamPosition AS VARIANT, ;

RecognitionType AS VARIANT, ;

Result AS VARIANT) AS VOID;

HELPSTRING "Recognition"

 

? Result.PhraseInfo.GetText

 

ENDPROC

 

ENDDEFINE

 

I haven't included all of the procedures for the sake of brevity. The key method here is Recognition event. When a word or phrase is recognized, the Recognition event fires, being passed where the information is coming from (the stream) and the Resulting recognized phrase.

 

This next code hooks into this class using VFP 7's EVENTHANDLER method.

 

PUBLIC oRecognize, oVFPObj,ogrammar

 

oVFPObj = CREATEOBJECT("myclass")

oRecognize = CREATEOBJECT("SAPI.spsharedrecocontext")

 

EVENTHANDLER(oRecognize,oVFPObj)

 

 

The class spSharedRecoContext is simply the interface by which an application hooks into the Speech Recognition engine. It also controls which words and phrases are available for the user to speak. This code turns VFP into a Dictation machine.

 

oGrammar = oRecognize.CreateGrammar

oGrammar.DictationSetState(1)

 

The oGrammar object is what is actively used to recognize the words. The call to DictationSetState activates the Grammar object. Call DictationSetState with a 0 to deactivate it.

 

When this program is running, every word I speak into my headset is recognized and then displayed on the screen. The first time you do this, the accuracy leaves a little to be desired but you can "train" SAPI for each voice so it gets smarter. The following code starts making decisions based on what the user says.

 

lcCommand = UPPER(Result.PhraseInfo.GetText)

DO CASE

CASE "PICKUP"$lcCommand

MESSAGEBOX("Pick up this load")

 

CASE "DELIVER"$lcCommand

MESSAGEBOX("Deliver up this load")

ENDCASE

That's all there is to getting started. Using the SAPI Recognition engine, you can allow dictation into memo fields or let your users run your application from their voice commands, all with just a few lines of code added to your application.

 

Distribution

 

The SAPI 5.1 SDK available from the MSDN web site includes several Windows Installer Merge Modules for installation, including support for English, Chinese and Japanese languages. This lets you build your installation routine to ensure the SAPI objects have been registered during installation. Be warned though - the file sizes are large the download for the Merge Modules is over 120 MB.

 

Conclusion

 

Making appplications easy isn't always about following the standard interfaces and such it's about separating requirement from "cool for the sake of cool". For someone who's never used a keyboard, the entire QWERTY approach isn't easy. Easy is a relative term.

 

Speech recognition is now a technology that can legitimately and efficiently be used in some industries. It isn't for every application. I don't think anyone is ready to sit back and listen to a report being read instead of looking at it on the screen. But for hands-free based applications or simply as a way to make an application an easier tool to learn, SAPI definitely offers developers an easy way to implement it.