Four Things To Know About the iOS Speech Framework
Voice experiences have become more common through interactions with assistants like Alexa, Google Assistant and Siri. What if you wanted to create a voice experience directly in your iOS app? Here are key things to keep in mind when working with Apple’s Speech framework:
Requests may require a connection to Apple’s servers
Apple’s Speech framework may require access to the internet, depending on what language it will be used for. This is important to keep in mind if your app needs to be accessible offline and if some users don’t have access to a reliable internet connection. More information on how to address this issue can be found here.
Apple allows for speech recognition on prerecorded or live audio
The classes that allow for this are SFSpeechURLRecognitionRequest
and SFSpeechAudioBufferRecognitionRequest
. SFSpeechURLRecognitionRequest
is helpful when developing in a noisy environment since speech could be prerecorded and won’t have surrounding noise cause issues with speech recognition. These are the steps for setting up a SFSpeechURLRecognitionRequest
:
- Record your speech.
- Add the audio file to your Xcode project.
- Double check that the audio file has been added to the app target.
- Use the
Bundle
class to get the URL for the audio file. - Pass the URL to the
SFSpeechURLRecognitionRequest
initializer.
var recognitionRequest: SFSpeechURLRecognitionRequest?
if let audioURL = Bundle.main.url(forResource: "prerecorded-audio", withExtension: "m4a") {
recognitionRequest = SFSpeechURLRecognitionRequest(url: audioURL)
}
Check information after speech is final
SFTranscriptionSegment
provides a way to access information about individual words from the recognized speech. It can be used to check the confidence levels, timestamps, and duration of words. Later in this post is an example of printing out the confidence levels. In this case, a confidence level is a decimal between 0 and 1. The closer to 1 the more confident Apple is in the accuracy of the speech recognized. It’s important to make sure the recognition task is finalized before checking those values, though. If it isn’t, the values could change by the time the task is finalized. The reason for the changing values is most likely because calculating them as the user is speaking is computationally intensive, and they may change as more speech is coming in. I noticed confidence levels of 0% when the task wasn’t finalized, and confidence levels as high as 90% after it was finalized. How is a recognition task finalized? It’s done by calling finish()
on the SFSpeechRecognitionTask
instance.
Use a timer to stop speech recognition
What if you wanted to stop speech recognition after the user has stopped talking for a couple of seconds? A possible solution is to create a timer that calls finish()
on the recognition task after two seconds where no new speech is recognized. Since the task would be finalized and there’d be access to an instance of SFSpeechRecognitionResult, the confidence levels, timestamps, and durations values would be accurate. Here’s how that might look:
var timer: Timer?
var recognitionTask: SFSpeechRecognitionTask?
...
recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { [weak self] (result, _) in
guard let result = result else { return }
if result.isFinal {
// Check info on recognized speech here
let transcriptionSegments = result.bestTranscription.segments
let confidenceLevel = transcriptionSegments.reduce(0, { $0 + $1.confidence }) / Float(transcriptionSegments.count)
print(“\(confidenceLevel) - \(result.bestTranscription)”)
} else {
self?.timer?.invalidate()
self?.timer = Timer.scheduledTimer(timeInterval: 2,
target: self,
selector: #selector(finalizeSpeechRecogntion), userInfo: nil,
repeats: false
)
}
}
...
@objc func finalizeSpeechRecognition() {
recognitionTask?.finish()
}
Conclusion
The important things to remember when working with Apple’s Speech framework are:
- Internet connection may be necessary since some languages require reaching out to Apple’s servers.
- Apple allows for speech recognition on audio files and live recordings.
- Be sure that
recognitionTask.finish()
is called before checking info on the recognized speech. - Using a timer can help stop speech recognition after user has stopped speaking.