iOS SFSpeechRecognizer – On Device Recognition
SFSpeechRecognizer has been updated in iOS13 to allow recognition and analysis of speech on device, data-free, and offline
Apple showcased its advancements in the field of Machine Learning and Artifical Intelligence during WWDC 2019 this year. One such feature that sheds light on their ambitions is on-device speech recognition in iOS 13.
Scope
On-device speech recognition increases the user’s privacy by keeping their data off the Cloud. Apple strives to give voice-based AI a major boost through this enhanced speech recognition.
The newly upgraded speech recognition API lets you do a variety of things, like tracking voice quality and speech patterns using the voice analytics metrics.
From providing automated feedback based on recordings to comparing the speech patterns of individuals, there’s so much you can do in the field of AI using on-device speech recognition.
Of course, there are certain trade-offs to consider with this new On-Device Speech Recognition. There is no continuous learning like you have on the Cloud. This can lead to less accuracy on the device. Moreover, the language support is limited to about 10 languages currently.
Nonetheless, on-device support lets you do speech recognition for an unlimited amount of time. A big win over the previous one minute per recording limit the server had.
SFSpeechRecognizer
is the engine that drives speech recognition.
iOS 13 SFSpeechRecognizer is smart enough to recognize punctuations in your speech.
Saying a dot adds a full stop. Similarly, a comma, dash, and question mark would return the respective punctuations in the transcription: (, — ?).
Our Goal
Developing an on-device speech recognition iOS application that transcribes live audio. An illustration of what we’ll achieve by the end of this article is given below:
Did you notice?
The above screengrab was taken in flight mode.
Without wasting any more time, let’s tap into the microphones and begin our journey toward building an on-device speech recognition application.
In the following sections, we’ll be skipping the UI and aesthetics part and dive straight into the speech and audio frameworks. Let’s get started.
Adding Privacy Usage Description
For starters, you need to include the privacy usage description for microphone and speech recognition in your info.plist
as shown below.
Next, import Speech
in your ViewController
class to access the speech framework in your application.
Requesting permissions
We need to request authorization from the user in order to use speech recognition. The following code does that for you:
The SFSpeechRecognizer
is responsible for generating your transcriptions through SFSpeechRecognitionTask
. For this to happen, we must initialize our SFSpeechRecognizer
:
var speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en_IN"))
In the above code, you need to pass the locale identifier that’s used throughout your phone. It’s English(India) in my case.
Speech Recognition: Under the Hood
An illustration of how on-device speech recognition works is depicted below:
As can be seen from the above illustration, there are four pillars on which any speech recognition application hangs:
AVAudioEngine
SFSpeechRecognizer
SFRecognitionTask
SFSpeechAudioBufferRecognitionRequest
We’ll see the roles each of these plays in building our speech recognition application in the next sections.
Implementation
Setting up the audio engine
The AVAudioEngine
is responsible for receiving the audio signals from the microphone. It provides our input for speech recognition.
let audioEngine = AVAudioEngine() | |
let audioSession = AVAudioSession.sharedInstance() | |
try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers) | |
try audioSession.setActive(true, options: .notifyOthersOnDeactivation) | |
let inputNode = audioEngine.inputNode | |
inputNode.removeTap(onBus: 0) | |
let recordingFormat = inputNode.outputFormat(forBus: 0) | |
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in | |
self.recognitionRequest?.append(buffer) | |
} | |
audioEngine.prepare() | |
try audioEngine.start() |
The above code installs a tap on the inputNode
and sets the buffer size of the output.
Once that buffer size is filled(through your audio signals as you speak or record), it’s sent to the SFSpeechAudioBufferRecognitionRequest
.
Now let’s see how the SFSpeechAudioBufferRecognitionRequest
works with the SFSpeechRecognizer
and SFSpeechRecognitionTask
in order to transcribe speech to text.
Enabling on-device speech recognition
The following code enables on-device speech recognition on a phone:
recognitionRequest.requiresOnDeviceRecognition = true
Setting requiresOnDeviceRecognition
to false
would use the Apple Cloud for speech recognition.
Do note that on-device speech recognition works on iOS 13, macOS Catalina, and above devices only. It requires Apple’s A9 or new processor which is supported from iPhone6s and above devices in iOS.
Creating a speech recognition task
An SFSpeechRecognitionTask is used to run the SFSpeechAudioBufferRecognitionRequest
with the SFSpeechRecognizer
. In return, it provides the result instance from which we can access different speech properties.
recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest) { result, error in | |
if let result = result { | |
DispatchQueue.main.async { | |
let transcribedString = result.bestTranscription.formattedString | |
self.transcribedText.text = (transcribedString) | |
} | |
} | |
if error != nil { | |
self.audioEngine.stop() | |
inputNode.removeTap(onBus: 0) | |
self.recognitionRequest = nil | |
self.recognitionTask = nil | |
} | |
} |
In the above code, a lot is happening. Let's break it down into pieces.
Firstly, we cancel any previous recognition tasks when
startRecording
is pressed.Next, we create the recognition task using the SFSpeechRecognizer and recognition request.
Setting
shouldReportPartialResults
to true allows accessing intermediate results during each utterance.result.bestTranscription
returns the transcription with the highest confidence. InvokingformattedString
property over it gives the transcribed text.We can access other properties such as
speakingRate
,averagePauseDuration
, orsegments
.
SFVoiceAnalytics
SFVoiceAnalytics
is the newly introduced class that contains a collection of voice metrics for tracking features such as pitch, shimmer, and jitter from the speech result.
They can be accessed from the segments property of the transcriptions:
for segment in result.bestTranscription.segments { | |
guard let voiceAnalytics = segment.voiceAnalytics else { continue } | |
let pitch = voiceAnalytics.pitch | |
let voicing = voiceAnalytics.voicing.acousticFeatureValuePerFrame | |
let jitter = voiceAnalytics.jitter.acousticFeatureValuePerFrame | |
let shimmer = voiceAnalytics.shimmer.acousticFeatureValuePerFrame | |
} |
Start recording and transcribing
Now that we’ve defined each of the four components, it’s time to merge the pillars in order to start recording and displaying the transcriptions in a UITextView
. The following code snippet does that for you.
private let audioEngine = AVAudioEngine() | |
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest? | |
private var speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en_IN")) | |
private var recognitionTask: SFSpeechRecognitionTask? | |
func startRecording() throws { | |
recognitionTask?.cancel() | |
self.recognitionTask = nil | |
let audioSession = AVAudioSession.sharedInstance() | |
try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers) | |
try audioSession.setActive(true, options: .notifyOthersOnDeactivation) | |
let inputNode = audioEngine.inputNode | |
inputNode.removeTap(onBus: 0) | |
let recordingFormat = inputNode.outputFormat(forBus: 0) | |
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in | |
self.recognitionRequest?.append(buffer) | |
} | |
audioEngine.prepare() | |
try audioEngine.start() | |
recognitionRequest = SFSpeechAudioBufferRecognitionRequest() | |
guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") } | |
recognitionRequest.shouldReportPartialResults = true | |
if #available(iOS 13, *) { | |
if speechRecognizer?.supportsOnDeviceRecognition ?? false{ | |
recognitionRequest.requiresOnDeviceRecognition = true | |
} | |
} | |
recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest) { result, error in | |
if let result = result { | |
DispatchQueue.main.async { | |
let transcribedString = result.bestTranscription.formattedString | |
self.transcribedText.text = (transcribedString) | |
} | |
} | |
if error != nil { | |
self.audioEngine.stop() | |
inputNode.removeTap(onBus: 0) | |
self.recognitionRequest = nil | |
self.recognitionTask = nil | |
} | |
} | |
} |
Conclusion
The above implementation steps should return an outcome similar to the screengrab that was there at the start of this article. The full source code of the application is available in this Github Repository.
That sums up on-device speech recognition in iOS 13 from my side. This new upgrade would be handy when used in tandem with Sound Classifiers and NLPs.
I hope you enjoyed reading this. Now start building your own Voice-based AI applications using the new SFSpeechRecognizer
.