Overall Explanation

Overview

For our previous exercises, we took a look at an image recognition methods where we generalized and classified objects within the image. Semantic segmentation bases its classification on image recognition but instead of classifying objects within the image, it classifies each pixels. This method is especially useful when it comes to environmental perceptions, thanks to its dense per-pixel classifications, the model can percieve many potential objects per scene, including scene foregorounds and backgrounds.

Semantic Segmentation acheives its per-pixel classification by convolutionalizing a pre-trained image recognition backbone, (for our case ResNet18) into a Fully Convolutional Network (FCN) capable of per-pixel labeling.

Semantic Segmentation is used and extended in many different applications, from self driving cars, to image segmentation and classification for MRI images.

As you may have noticed, we have 5 different segmentation methods and results on our “follow along” step. This is the result of our FCN-ResNet18 model being trained on different dataset for different applications.

Following are the datasets used and their applications:

Dataset Name	Contents	–network
Cityscapes	Dataset with urban street scenes	fcn-resnet18-cityscapes
DeepScene	Dataset with outdoor scenes including vegetations	fcn-resnet18-deepscene
Multi-Human	Dataset with different scenes with people	fcn-resnet18-mhp
Pascal VOC	Dataset with various object, animals and people	fcn-resnet18-voc
SUN RGB-D	Dataset with indoor scenes (indoor objects)	fcn-resnet18-sun

FCN-ResNet18

Our previous image recognition models such as ResNet18 uses fully-connected (fc) layer as the output layer. This is done to classify the convolutional output from the middle layer.

Having GC layer allows us to classify the entire image but it leads us to a big problems:

Location information of where the classified object is within the image is lost.

To solve this problem, convolutionalizing method is used.

Convolutionalizing

In order to retain the location information, we need to be able to conduct per-pixel classification. The convolutionalizing method allows us to replace the last fully-connected layer with convolutional layer. This last convolutional layer allows us to output a heatmap with each pixel classifified.

Within the convolutional output later, different deconvolutional methods are applied to approximate the segmentation to the real picture as close as possible.

reference https://medium.com/@msmapark2/fcn-%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0-fully-convolutional-networks-for-semantic-segmentation-81f016d76204

Here are all the datasets trained on FCN-ResNet18 with their labels:

Dataset Name	labels
Cityscapes
DeepScene
Multi-Human
Pascal VOC
SUN RGB-D