Facial Recognition Systems: Here’s what you need to know | Engati

5 min readJan 31, 2020

Introduction

Building a robust Face Recognition system that is free of bias (racial or gender), is not an easy task. Algorithms don’t create bias. It comes from humans.

The last decade was full of new state-of-the-art algorithms and groundbreaking research in the field of Deep Learning. New Computer vision algorithms were introduced.

It all started when AlexNet, a Deep (Convolutional) Neural Network that achieved high accuracy on ImageNet dataset (dataset with more than 14 million images) in 2012.

How do humans recognise a face?

Probably, the neurons in their brain first identify the face in the scene (from person body and background), extract the facial features and by those features classify the person. We have been trained on an infinitely large dataset and infinitely big neural network.

Face Recognition in machines is implemented the same way. First, we apply a Face detection algorithm to detect faces in the scene, then extract facial features from the detected faces and use an algorithm to classify the person.

The workflow of a Face Recognition system is as follows:

Face Detection

Face detection is a specialized version of Object Detection, where there is only one object to detect — Human Face.

Just like computational time and space trade-off in Computer Science, Machine Learning algorithms also a trade-off between inference speed and accuracy. There are many object detection algorithms out there and different algorithms have their speed and accuracy trade-off.

We evaluated different state-of-the-art object detection algorithms:

OpenCV (Haar-Cascade)
MTCNN
YoloV3 and Yolo-Tiny
SSD
BlazeFace
ShuffleNet and Faceboxes

To build a robust face detection system we need an algorithm that is both accurate and fast to run on a GPU as well as a mobile device in real-time.

Accuracy

In real-time inference on streaming video people can have different poses, occlusions and lighting effects on their face. It is important to precisely detect faces in different lighting conditions as well as poses.

OpenCV (Haar-Cascade)

We started with Haar-cascade implementation of OpenCV, which is an open-source image manipulation library in C.

Pros: Since this library is written in C language. It is very fast for inference in real-time systems.

Cons: The problem with this implementation was that it was unable to detect side faces and performing poorly in different poses and lighting conditions.

MTCNN

This algorithm is based on Deep Learning methods. It uses Deep Cascaded Convolutional Neural Networks for detecting faces.

Pros: It had better accuracy then OpenCV Haar-Cascade method

Cons: Higher run time

YOLOV3

YOLO (You look only once) is the state-of-the-art Deep Learning algorithm for object detection. It has many convolutional neural networks, forming a Deep CNN model. (Deep means the model architecture complexity is very large).

The original Yolo model can detect 80 different object classes with high accuracy. We used this model for detecting only one object — Face.

We trained this algorithm on WiderFace (image dataset containing 393,703 face labels) dataset.

There is also a small version of the Yolo algorithm available, Yolo-Tiny. Yolo-Tiny takes less computation time by compromising its accuracy. We trained a Yolo-Tiny model with the same dataset but the boundary box results were not consistent.

Pros: Very accurate, without any flaw. Faster than MTCNN.

Cons:.Since it has very large Deep Neural Network layers so it needs more computational resources. Thus, it is slow to run on the CPU or mobile devices. On GPU it takes more VRAM because of its large architecture.

SSD

SSD (Single Shot Detector) is also a deep convolutional neural network model like YOLO.

Pros: Good accuracy. Can detect in various poses, illumination and occlusions. Good inference speed.

Cons: Inferior to YOLO model. Though, inference speed was good but still not adequate enough to run on CPU, low-end GPU or mobile devices.

BlazeFace

Like its name, it is a blazingly fast face-detection algorithm released by Google. It accepts 128×128 dimension image input. Its inference time is in sub-milliseconds. It is optimized to be used in mobile phones. The reasons it is so fast are:

It is a specialized face detector model, unlike YOLO and SSD which were originally created for detecting a large number of classes. Thus BlazeFace has a smaller Deep Convolutional Neural Network architecture than YOLO and SSD.
It uses Depthwise Separable Convolution instead of standard Convolution layers which leads to fewer computations.

Pros: Very Good inference speed and accurate face detection.

Cons: This model is optimised for detecting faces from mobile phone camera and thus it expects that face should cover most of the area in the image. It doesn’t work well when the face size is small. So in case of CCTV camera images it doesn’t perform well.

Faceboxes

The latest face detection algorithm we used is Faceboxes. Like BlazeFace, it is a Deep Convolutional Neural network with small architecture and designed just for one class — Human Face. Its inference time is real-time fast on CPU. Its accuracy is comparable to Yolo for face detection. It can detect small and large faces in an image, precisely.

Pros: Fast inference speed and good accuracy.

Cons: Evaluation is in progress.

Feature Extraction

After detecting faces in an image, we crop the faces and feed it to a Feature Extraction Algorithm which creates face embedding — a multi-dimensional (mostly 128 or 512 dimensional) vector representing features of the face.

We used FaceNet algorithm to create face-embeddings.

The embedding vectors represent the facial features of a person’s face. So embedding vectors of two different images of the same person will be closer and that of a different person will be farther. The distance between two vectors are calculated using Euclidean Distance.

Face Classification

After getting the face-embedding vectors we trained a classification algorithm, K-nearest neighbor (KNN), to classify the person from his embedding vector.

Suppose in an organization there are 1000 employees. We create face-embeddings of all the employees and use the embedding vectors to train a classification algorithm that accepts face-embedding vectors as input and returns the name of the person.

References

Study finds gender and skin-type bias in commercial artificial-intelligence systems: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212
Euclidean distance
https://en.wikipedia.org/wiki/Euclidean_distance
OpenCV
https://opencv.org/
YoloV3
https://arxiv.org/abs/1804.02767
SSD
https://arxiv.org/abs/1512.02325
MTCNN
https://arxiv.org/abs/1604.02878
BlazeFace
https://arxiv.org/abs/1907.05047
FaceBoxes
https://arxiv.org/abs/1708.05234
FaceNet
https://arxiv.org/abs/1503.03832

Originally published at https://blog.engati.com on January 31, 2020.