The most intriguing advancements brought by Deep Learning and Neural Networks is in the field of Computer Vision. We associate any problem that has an image or camera input to encompass problems with in the computer vision.

Breakthroughs in Computer Vision

Self driving cars, FMRI analysis, Mars exploration rovers, facial recognition systems, object detection and augmented reality are just a few breakthroughs in the field. In this blog we will take a look at a new type of Neural Network architecture called Masked Region based Convolutional Neural Networks, Mask-RCNN for short. And in the process highlight some key problems in Computer Vision as well.

Mask RCNN works towards the problem of instance segmentation, the process of detecting and delineating each distinct object of interest in an image. So instance segmentation is a combination of two sub-problems.

Object Detection

The first is Object Detection and this is the problem of finding and classifying variable number of objects in image. They are variable number because the number of objects detected in an image can vary image to image.

The second part on instance segmentation is Semantic Segmentation. Semantic Segmentation is the understanding of an image at pixel level. That is we want to assign an object class to each pixel in the image.

Instance Segmentation

In this figure with the motorcyclist, apart from recognizing the bike and the person riding it, we also have to delineate the boundaries of each object. Using object detection and semantic segmentation together, we get instance segmentation.

Instance Segmentation

In these images the bounding box is created from object detection and the shaded masks are the output of semantic segmentation.

Now that we have a high level intuition of instance segmentation, we’ll take a look at the architecture behind the Mask-RCNN.

Architecture of Mask RCNN

Since they are two phases we have to part. For object detection it uses an architecture similar to Faster RCNN.

Faster -RCNN Object detection architecture

For semantic segmentation it uses Fully CNN.

Fully Convolution Networks

So first off, What is an RCNN? It is an approach to bounding box object detection thus creating a number of object regions or regions of interest(ROIs). The next version Faster-RCNN performs a better job by incorporating an attention mechanism using a region proposal network(RPN).

Determining bounding boxes

Faster-RCNN performs object detection in two stages. First determine the bounding box and hence determining the ROIs. This is done using the RPN protocol.

Determining class label

And Second for each ROI we determine the class label of the object. This is done with ROI pooling.

Mask-RCNN does incorporate these tasks. But there’s a problem in data loss in ROI pooling. This Involves the kind of pooling using max pooling on a ROI. The bounding box created using object detection. Hence the name ROI pool.

Detected human mask

Mask-RCNN outputs the object mask using pixel to pixel alignment. This mask is a binary mask output for each ROI. The Facebook AI research uses Coco dataset on Mask-RCNN implementation. The Coco dataset comprises of more than 200,000 images on 1.5 million object classes.

In short Mask-RCNN is one of the best techniques for object detection right now. It has endless possibilities of usage. From every day IoT products to medical fields application.

In upcoming blogs we will be covering more of Mask-RCNN’s technical aspects and its implementations on Tensorflow and Pytorch.

  Feel free to shot up your questions and valuable comments down below. 🙂