iOS Text Recognition Using Vision And Core ML

Our goal for today is to build an iOS Application that identifies texts in a still image.

Vision and Core ML frameworks were the highlights of WWDC 2017. Vision is a powerful framework used to implement computer vision features without much prior knowledge of algorithms.

Things such as barcode, face, object, and text detection can be easily done using Vision.
At the same time, Core ML allows us to integrate and run pre-trained models in our iOS Applications without digging too deep into Machine Learning.

Our Goal

Our goal for today is to build an iOS Application that identifies texts in a still image.

Just like when you search for keywords using, cmd + F all the matching strings get highlighted on the screen, we’ll be highlighting a few selected strings in an image.

Before getting down to the business end, let’s for once breeze through the things we’ll gonna cover.

Topics Covered

  • Capturing Image Using Camera or Gallery

  • Text Detection Using Vision

  • Text Recognition Using Core ML

  • Drawing Bounding Boxes on certain keywords

What we want to achieve

We wish to highlight some of the detected texts after recognizing their names in an image captured from the camera/gallery as shown below:

We’ll call this application as FindMyText. Inspired from the name FindMyIphone!

Without wasting any more time, let’s get started. Launch up your Xcode and create a Single View Application.

Image Picker Controller

We won’t be focusing on the Storyboard since it’s pretty basic (just a UIImage and a Button). The idea is to upload images containing texts from the photos library.

guard UIImagePickerController.isSourceTypeAvailable(.camera) else {
presentPhotoPicker(sourceType: .photoLibrary)
let photoSourcePicker = UIAlertController()
let takePhoto = UIAlertAction(title: "Camera", style: .default) { [unowned self] _ in
self.presentPhotoPicker(sourceType: .camera)
let choosePhoto = UIAlertAction(title: "Photos Library", style: .default) { [unowned self] _ in
self.presentPhotoPicker(sourceType: .photoLibrary)
photoSourcePicker.addAction(UIAlertAction(title: "Cancel", style: .cancel, handler: nil))
present(photoSourcePicker, animated: true)

presentPhotoPicker is used to launch the appropriate application. Once the image gets clicked we start the Vision Request.

extension ViewController: UIImagePickerControllerDelegate, UINavigationControllerDelegate {
func imagePickerController(_ picker: UIImagePickerController, didFinishPickingMediaWithInfo info: [UIImagePickerController.InfoKey: Any]) {
picker.dismiss(animated: true)
guard let uiImage = info[UIImagePickerController.InfoKey.originalImage] as? UIImage else {
imageView.image = uiImage
createVisionRequest(image: uiImage)
private func presentPhotoPicker(sourceType: UIImagePickerController.SourceType) {
let picker = UIImagePickerController()
picker.delegate = self
picker.sourceType = sourceType
present(picker, animated: true)

It’s time for some insights into the Vision Framework!

Vision Framework

Vision Framework had come up with iOS 11. It brings algorithms for image recognition and analysis which as per Apple, are more accurate that the CoreImage Framework. A significant contributor to this is the underlying use of Machine Learning, Deep Learning, and Computer Vision.

Implementing the framework consists of three important use cases:

  • Request - Create a request to detect the type of object. You can set more than one types to be detected.

  • Request Handler - This is used to process the results obtained from the request.

  • Observation - The results are stored in the form of observation.

Some important classes which are a part of the Vision framework are:

  • VNRequest - It consists of an array of requests which are used for image processing.

  • VNObservation - This gives us the output of the result.

  • VNImageRequestHandler - processes one or more VNRequest on a given image.

The following snippet shows how to create a Vision Image Request Handler.

func createVisionRequest(image: UIImage){
currentImage = image
guard let cgImage = image.cgImage else {
let requestHandler = VNImageRequestHandler(cgImage: cgImage, orientation: image.cgImageOrientation, options: [:])
let vnRequests = [vnTextDetectionRequest] .background).async {
try requestHandler.perform(vnRequests)
}catch let error as NSError {
print("Error in performing Image request: \(error)")

We could have passed multiple requests, but the goal of this article is text detection and recognition.

The vnTextDetectionRequest is defined in the below code:

var vnTextDetectionRequest : VNDetectTextRectanglesRequest{
let request = VNDetectTextRectanglesRequest { (request,error) in
if let error = error as NSError? {
print("Error in detecting - \(error)")
else {
guard let observations = request.results as? [VNTextObservation]
else {
var numberOfWords = 0
for textObservation in observations {
var numberOfCharacters = 0
for rectangleObservation in textObservation.characterBoxes! {
let croppedImage = crop(image: self.currentImage, rectangle: rectangleObservation)
if let croppedImage = croppedImage {
let processedImage = preProcess(image: croppedImage)
self.imageClassifier(image: processedImage,
wordNumber: numberOfWords,
characterNumber: numberOfCharacters, currentObservation: textObservation)
numberOfCharacters += 1
numberOfWords += 1
DispatchQueue.main.asyncAfter(deadline: .now() + 3, execute: {
self.drawRectanglesOnObservations(observations: observations)
request.reportCharacterBoxes = true
return request

There’s plenty of stuff going in the above code snippet.
Let’s break it down.

  • The observations are the results returned by the request.

  • Our goal is to highlight the detected texts with bounding boxes, hence we’ve typecasted the observations to

  • We crop the detected text part of the image. These cropped images act as micro-inputs for our ML model.

  • We feed these images to the Core ML model for classification after resizing them to the required input size.

The codes for the cropping and preprocessing are available in the ImageUtils.swift file attached at the end of this project.

Let’s take a look at Core ML and how it is relevant at how it’s relevant to us at this stage.

CORE ML Framework

Core ML is a framework that lets developers use ML Models easily in their applications.
With the help of this framework, the input data can be processed to return the desired output.

In this project, we’re using an alphanum_28X28 ml model.
This model requires input images of size 28*28 and returns the detected text.

Resizing the images happens in the preprocess function we just saw earlier.
observationStringLookup is a lookup dictionary that binds each Observation to its text predicted by the Core ML model.

To determine the text, we have our own Image Classifier that runs on the resized image input:

func imageClassifier(image: UIImage, wordNumber: Int, characterNumber: Int, currentObservation : VNTextObservation){
let request = VNCoreMLRequest(model: model) { [weak self] request, error in
guard let results = request.results as? [VNClassificationObservation],
let topResult = results.first else {
fatalError("Unexpected result type from VNCoreMLRequest")
let result = topResult.identifier
let classificationInfo: [String: Any] = ["wordNumber" : wordNumber,
"characterNumber" : characterNumber,
"class" : result]
self?.handleResult(classificationInfo, currentObservation: currentObservation)
guard let ciImage = CIImage(image: image) else {
fatalError("Could not convert UIImage to CIImage :(")
let handler = VNImageRequestHandler(ciImage: ciImage) .userInteractive).async {
do {
try handler.perform([request])
catch {
func handleResult(_ result: [String: Any], currentObservation : VNTextObservation) {
guard let wordNumber = result["wordNumber"] as? Int else {
guard let characterNumber = result["characterNumber"] as? Int else {
guard let characterClass = result["class"] as? String else {
if (textMetadata[wordNumber] == nil) {
let tmp: [Int: String] = [characterNumber: characterClass]
textMetadata[wordNumber] = tmp
} else {
var tmp = textMetadata[wordNumber]!
tmp[characterNumber] = characterClass
textMetadata[wordNumber] = tmp
DispatchQueue.main.async {
self.doTextDetection(currentObservation: currentObservation)
func doTextDetection(currentObservation : VNTextObservation) {
var result: String = ""
if (textMetadata.isEmpty) {
print("The image does not contain any text.")
let sortedKeys = textMetadata.keys.sorted()
for sortedKey in sortedKeys {
result += word(fromDictionary: textMetadata[sortedKey]!) + " "
observationStringLookup[currentObservation] = result
func word(fromDictionary dictionary: [Int : String]) -> String {
let sortedKeys = dictionary.keys.sorted()
var word: String = ""
for sortedKey in sortedKeys {
let char: String = dictionary[sortedKey]!
word += char
return word

textMetadata is used to store all the predicted words.
Now that the observationStringLookup is created, we can highlight the selected observations(the words vision, core ml were highlighted in the final output as we saw at the start of this article).

Vision And Bounding Boxes

Now we know the texts detected by Vision in VNTextObservations. Each observation has a bounding box property.

The labels of each of those observations were predicted by the Core ML Image classifier from the previous section.

So we can simply draw the rectangles on the texts.

The below method does that implementation for us and highlights the words “Vision” and “Core ML” in the image.

func drawRectanglesOnObservations(observations : [VNDetectedObjectObservation]){
DispatchQueue.main.async {
guard let image = self.imageView.image
print("Failure in retrieving image")
let imageSize = image.size
var imageTransform = CGAffineTransform.identity.scaledBy(x: 1, y: -1).translatedBy(x: 0, y: -imageSize.height)
imageTransform = imageTransform.scaledBy(x: imageSize.width, y: imageSize.height)
UIGraphicsBeginImageContextWithOptions(imageSize, true, 0)
let graphicsContext = UIGraphicsGetCurrentContext()
image.draw(in: CGRect(origin: .zero, size: imageSize))
graphicsContext?.setFillColor(red: 0, green: 1, blue: 0, alpha: 0.3)
var previousString = ""
let elements = ["VISION","COREML"]
observations.forEach { (observation) in
var string = observationStringLookup[observation as! VNTextObservation] ?? ""
let tempString = string
string = string.replacingOccurrences(of: previousString, with: "")
string = string.trim()
previousString = tempString
if elements.contains(where: string.contains){
let observationBounds = observation.boundingBox.applying(imageTransform)
graphicsContext?.drawPath(using: CGPathDrawingMode.fillStroke)
let drawnImage = UIGraphicsGetImageFromCurrentImageContext()
self.imageView.image = drawnImage

Note: Core ML model may not give correct results on texts with different fonts.

With iOS 13, the newly upgraded Vision Framework now stores the recognized text in the Observation instance itself.

That’s a wrap for now. The full source code of this article is available here.