Coding/Image

Instance Segmentation + Mask RCNN

linguana 2021. 5. 7. 12:23

본 글은 [9]의 내용을 재구성한 내용임을 밝힘.

  단순히 모델을 돌리는 거라면 torchvision에서 모델 임포트해서 쓰거나, 여러 깃헙에서 구현한 것들 가져다 와 쓰면 된다. 하지만 내가 그것을 온전히 이해하고 내가 원하는대로 자유자재로 튜닝해서 쓰고 싶다면... 그게 될까?

  특히 FPN이나 RoIAlign 부분들이 특히 괴.랄.한데, 핵심을 이해하는데 더 방해만 될 뿐 도움은 안 되는 것 같다.

  [9]에 나온 사진인데, 너무나 공감되서 첨부한다.

<그림 1> joke

  논문 저자들이야, 본인들이 만들었으니까 훤히 구조가 보이겠지만, 뉴비가 볼 때는 정말이지 답이 없다.

  여하튼, Mask RCNN의 구조 핵심은 아래 그림과 같다.

<그림 2> core architecture of Mask RCNN

  러프하게 설명해보자면,

  1. CNN으로 feature map을 추출하고,
  2. 이 feature map을 RPN에 주입한다.
  3. RPN 출력을 resize 해주고 proposal이라고 쳐주자.
  4. 원하는 목적에 따라서 class, bounding box, mask branch에 돌려주면 끝.

 

  RPN에서는 Anchor에서
  if GT BBox 간의:
  (1) IOU 0.6 이상이면, Foreground anchor,
  (2) IOU 0.1 이하면,    Background anchor.

  FG + BG = Proposal Count
  choose our proposal count (PC), which tells us how many predictions of the RPN will contribute to the loss function.

Keep in mind that the RPN model predicts delta values and fg/bg labels to EACH anchor.

import tensorflow as tf
from tensorflow.keras.initializers import GlorotNormal
from tensorflow.keras.layers import Input, Conv2D, Lambda, Activation
from tensorflow.keras.models import Model

def rpn(featuremap):
    # RPN model
  
    initializer = GlorotNormal(seed=None)
    input_ = Input(shape=[None, None, featuremap.shape[-1]], name="rpn_INPUT")
  
    shared = Conv2D(512, (3,3), padding='same', activation='relu', strides=1, \
      name='rpn_conv_shared', kernel_initializer=initializer)(input_)
    # 5: different size; 1: anchor scale; 2: label probabilities
    x = Conv2D(5*1*2, (1,1), padding='valid', activation='linear', \
      name='rpn_class_raw', kernel_initializer=initializer)(shared)
    
  
    rpn_class_logits = Lambda(lambda t: tf.reshape(t, [tf.shape(t)[0], -1, 2]))(x)
    rpn_probs = Activation("softmax", name="rpn_class_xxx")(rpn_class_logits) # --> BG/FG

    # Bounding box refinement. [batch, H, W, depth]
    # 5*4:  5 different size * 1 scale anchor, 4 delta coordinates
    x = Conv2D(5*1*4, (1, 1), padding="valid", activation='linear', name='rpn_bbox_pred',\
      kernel_initializer=initializer)(shared)

    # Reshape to [batch, anchors, 4]
    rpn_bbox = Lambda(lambda t: tf.reshape(t, [tf.shape(t)[0], -1, 4]))(x)
    outputs = [rpn_class_logits, rpn_probs, rpn_bbox]
    rpn = Model(input_, outputs, name="RPN")

    return rpn

For this 20 anchors we know the ground truth labels, and the ground truth deltas. The deltas are 4 coordinates: (Δx, Δy, ΔW, ΔH). The first two shows how much we need to slide the centerpoint of the anchor, the latter two shows how much we need to change the width and the height of the anchor.


RPN Loss

RPN loss = class_loss + boundingbox_loss

PC number of predictions contribute to the class_loss,
FG number of predictions contribute to the boundingbox_loss.

def rpn_loss(rpn_logits, rpn_deltas, gt_labels, gt_deltas, indices, batchlen):
    '''
    rpn_logits, rpn_deltas: the predicted logits/deltas to all the anchors
    gt_labels, gt_deltas: the correct labels and deltas to the chosen training anchors
    indices: the indices of the chosen training anchors
    '''
    
    predicted_classes = tf.gather_nd(rpn_logits, indices)
    foregroundindices = indices[gt_labels.astype('bool')] #labels: 0:BG  1:FG
    
    predicted_deltas = tf.cast(tf.gather_nd(rpn_deltas, foregroundindices),tf.float32) #only the foreground anchors contribute to the box loss
    gt_deltas = tf.cast(tf.gather_nd(gt_deltas, foregroundindices),tf.float32)


    lf = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    classloss = lf(gt_labels, predicted_classes)
    classloss = tf.reduce_mean(classloss)
    
    deltaloss = smooth_l1(gt_deltas, predicted_deltas)
    deltaloss = tf.reduce_mean(deltaloss)
    
    return classloss, deltaloss

 

Now we choose the anchors we predict as foregrounds, and move them with the corresponding predicted deltas. So we will have NumberOfForegrounds * 4 coordinates, these are our ROIs. We cut these regions from the featuremap, and resize them to the same size: these are the proposals. These will be the inputs for our class, and boxrefinement head, and our maskhead. CRNN. (These NN models will need to have a fixed inputshape, for this we can always pad our porposals with zeros.)

As we have at maximum two objects on are input images, there is a high chance, that our RPN moved different anchors to overlap with the same object. We will use non-max supression to filter out highly overlapping proposals, as they are unnecessary. We will only keep the bests.


Reference

[1] Instance segmentation with OpenCV - PyImageSearch

[2] Mask R-CNN with OpenCV - PyImageSearch

[3] Keras Mask R-CNN - PyImageSearch

[4] 논문 리뷰 시리즈: Mask R-CNN 논문(Mask R-CNN) 리뷰 (tistory.com)

[5] Semantic Segmentation 정리 논문: [1704.06857] A Review on Deep Learning Techniques Applied to Semantic Segmentation (arxiv.org)

[6] Mask RCNN 논문: [1703.06870] Mask R-CNN (arxiv.org)

[7] ipynb 구현 (cv2, keras): Mask-RCNN_TF/Mask_RCNN(RPN).ipynb at master · Kanghee-Lee/Mask-RCNN_TF (github.com)

maskrcnn.pdf
7.37MB

 

Mask R-CNN_Korean.pdf
6.47MB

 

[8] Mask RCNN for dummies: Building a Mask R-CNN from scratch in TensorFlow and Keras | by Franciska Rajki | Towards Data Science 원본 있는 그대로 구현하지 않았지만, 기본적인 구성을 이해하기 위한 글. → FPN, ROI Align 구현되지 않음. rajkifranciska/maskrcnn-from-scratch: Building a maskrcnn from scratch using tensorflow and keras (github.com) 깃헙 코드.