Output:
If only one object is needed to be detected -> add FC layer to the Net pretrianed on ImageNet
apply a CNN to many different crops of the image, CNN classifies each crop as object / backgroud
but too many windows!! and may detect repeatedly
we need region proposals to find a small set of boxes that are likely to cover all the objects
“Selective Search” quick to generate 2000 regions
I o U = Area of Intersection Area of Union IoU = \frac{\text{Area of Intersection}}{\text{Area of Union}} IoU=Area of UnionArea of Intersection
I o U > 0.5 IoU > 0.5 IoU>0.5 is decent
I o U > 0.7 IoU > 0.7 IoU>0.7 pretty good
I o U > 0.9 IoU > 0.9 IoU>0.9 perfect
run detector on all test images + NMS
for each category, computer AP = area under precision vs Recall Curve
1. for each detection (high -> low)
1. If it matches some GT(Ground-Truth) box with IoU>0.5 mark it as positive and eliminate the GT
2. otherwise mark is as nagative
3. plot a point on PR curve
2. AP = area under PR Curve
mAP = average of AP for each category
COCO mAP: compute mAP for each IoU threshold and take average
How to get AP = 1.0 -> hit all GT boxes with IoU > 0.5, no false positive ranked above any true positive
Rol Align -> better align to avoid snapping
Insert Region Proposal Network (RPN) to predict proposals from features
after the backbone network -> RPN -> regional proposals
Imagine an anchor box of fixed size at each point in the feature map
At each point predict whether the corresponding anchor contains an object
for positive boxes, also predict a box transform to regress from anchor box to object box
Use k different anchor boxes at each point
just use anchor to make classification and object boxes predictions
Input -> Convolutions -> Scores C * H * W -> argmax H * W
use cross-entropy loss of every pixel to train the network
Downsampling : Pooling, strided convolution
Bed of nails : fill 0
Nearest Neighbour: same numbers in small blocks
f x , y = ∑ i , j f i , j max ( 0 , 1 − ∣ x − i ∣ ) max ( 0 , 1 − ∣ y − j ∣ ) f_{x,y} = \sum_{i,j}{f_{i,j} \max(0, 1-|x-i|) \max(0,1-|y-j|)} fx,y=∑i,jfi,jmax(0,1−∣x−i∣)max(0,1−∣y−j∣)
i,j in Nearest neighbours
Use two closest neighbours in x and y to construct linear approximations
three closest neighbours in x and y to construct cubic approximation
Just add Conv layers to predict a mask for each of C classes on the region proposals
speperate different objects in the same category
Represent the pose of a human by locating a set of keypoints
-> General Idea: Add Per-Region “Heads” to Faster / Mask R-CNN
Dense captioning -> nlp -> visual reasoning
3D shape prediction …