12

In the Faster RCNN paper when talking about anchoring, what do they mean by using "pyramids of reference boxes" and how is this done? Does this just mean that at each of the W*H*k anchor points a bounding box is generated?

Where W = width, H = height, and k=number of aspect ratios * num scales

link to paper: https://arxiv.org/abs/1506.01497

BadProgrammer
  • 457
  • 1
  • 4
  • 12

2 Answers2

12

Anchors Explained

Anchors

For the time being, ignore the fancy term of "pyramids of reference boxes", anchors are nothing but fixed-size rectangles to be fed to the Region Proposal Network. Anchors are defined over the last convolutional feature map, meaning there are $(H_{featuremap}*W_{featuremap})*(k)$ of them, but they correspond to the image. For each anchor then the RPN predicts the probability of containing an object in general and four correction coordinates to move and resize the anchor to the right position. But how does the geometry of anchors have to do anything with the RPN?

Anchors Actually Appear in the Loss function

When training the RPN, first a binary class label is assigned to each anchor. Anchors with Intersection-over-Union (IoU) overlap with a ground-truth box, higher than a certain threshold, are assigned a positive label (likewise anchors with IoUs less than a given threshold will be labeled Negative). These labels are further used to compute the loss function:

RPN Loss Function

$p$ is the classification head output of the RPN that determines the probability of the anchor to contain an object. For anchors labeled as Negative, no loss is incurred from regression — $p^*$, the ground-truth label is zero. In other words the network does't care about the outputted coordinates for negative anchors and is happy as long as it classifies them correctly. In case of positive anchors, regression loss is taken into account. $t$ is the regression head output of the RPN, a vector representing the 4 parameterized coordinates of the predicted bounding box. The parameterization depends on the anchor geometry and is as follows:

enter image description here

where $x, y, w,$ and h denote the box’s center coordinates and its width and height. Variables $x, x_a,$ and $x^*$ are for the predicted box, anchor box, and ground-truth box respectively (likewise for $y, w, h$).

Also notice anchors with no label are neither classified nor reshaped and the RPM simply throws them out of computations. Once the RPN's job is done, and the proposals are generated, the rest is very similar to Fast R-CNNs.

Mahan Fathi
  • 136
  • 1
  • 5
  • @Fathi What about if we have many classes? As far as I know, in Fast R-CNN each training RoI is assigned one ground truth class. So, I guess something similar happens here? – thanasissdr Aug 14 '17 at 10:37
  • @Fathi I totally agree with what you are saying, so I suppose you agree with me. I mean the authors of the original paper for Faster R-CNN have used only two classes (background/ object) for simplicity, trying to explain how RPN works, right? So, instead of having only two classes, I could have more than just two and I guess I could take the known cross entropy loss function, right? – thanasissdr Aug 14 '17 at 12:31
  • @thanasissdr The fundamental idea behind Faster R-CNN was that "when neural nets are so good at everything else, why not use them for region proposals too?". Comparing Fast R-CNN to standard R-CNN, the only difference is that RoI proposals -- which again are made using same old techniques, e.g. SelectiveSearch or EdgeBoxes -- are mapped from the raw image to the convolutional features, and then fed to the FCs. This way the forward pass process of each RoI through CNN is omitted. – Mahan Fathi Aug 14 '17 at 12:38
  • In Faster R-CNN, the RPN _learns_ to propose proper regions. Once the RPN is done, the rest is similar to Fast R-CNN, and FCs classify and regress the proposals. – Mahan Fathi Aug 14 '17 at 12:39
  • @thanasissdr Yes. We are on the same page. I suppose you can classify in RPN, but that would be unnecessary since the FC net does the classification again, and has no difficulty rejecting junk proposals. Also think about the pipeline, how are you going to use the classification scores, and how they'd be of help? My final stand is, (background/object) classification is a cornerstone in Faster R-CNN. – Mahan Fathi Aug 14 '17 at 12:50
  • @Fathi Maybe I have misunderstood a few things. What I want to say is that all positive anchors are going to be used in the training (classification) process, right? So, in order for the weighs to be adjusted (let's say in the Fast R-CNN network as part of the Faster R-CNN network), they need to have a classification ground truth. No? – thanasissdr Aug 14 '17 at 13:14
  • @thanasissdr yes and yes. Anchors with high **IoU**s are inputted to FC layers when regressed, and then classification is done over classes (e.g. bird, car, etc.). – Mahan Fathi Aug 14 '17 at 13:45
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/63793/discussion-between-thanasissdr-and-mahan-fathi). – thanasissdr Aug 14 '17 at 13:54
  • If the RPN layer has a classifier and regressor. Why does the final RoI outputted go through another regressor at the end of the whole NN? (I understand that RPN classifier only classifies between object or no object, but why regress the bounding boxes twice?) – CMCDragonkai Mar 28 '18 at 03:11
  • Another question, why do they use log of width and height instead of just the width and height? – CMCDragonkai Mar 28 '18 at 04:10
1

I read this paper yesterday and, at first sight, it was confusing to me too. After re-reading I came to this conclusion:

  • The last layer of the original network (ZF or VGG-16) serves as input for the Region Proposal Network and the RoI pooling. In case of the VGG-16 this last conv layer is a 7x7x512 (HxWxD).
  • This layer is mapped to a 512 dimensional layer with a 3x3 conv layer. The output size is 7x7x512 (if padding is used).
  • This layer is mapped to a 7x7x(2k+4k) (e.g. 7x7x54) layer with a 1x1 conv layer for each of the k anchor boxes.

Now according to Figure 1 in the paper you can have a pyramid of input images (the same images with a different scale), a pyramid of filters (filters of a different scale, in the same layer) or a pyramid of reference boxes. The latter one refers to the k anchor boxes at the last layer of the region proposal network. Instead of filters with different sizes that are stacked on top of each other (the middle case), filters with a different size and aspect ratio are stacked on top of each other.

In short, for each anchor point (HxW, e.g. 7x7) a pyramid of reference boxes (k, e.g. 9) is used.

Pieter
  • 1,847
  • 9
  • 23
  • but what exactly is an anchor box? Is the purpose of each anchor box: used as input to the RPN to predict a delta in the anchor box's width and height for each anchor box that is considered to be part of the foreground? – BadProgrammer Mar 12 '17 at 04:28
  • The RPN predicts both the delta shift of the foreground location and the objectness score. The latter tries to explicitly predicts if it is background or foreground (also see footnote 3). – Pieter Mar 12 '17 at 20:23
  • Could you explain how a `3x3` conv layer translates to `7x7`? In the prototxt, it says the padding is 1 on the last VGG16 layer. – Klik Nov 03 '17 at 23:46