State of the art object detection networks, such as RetinaNet, Faster R-CNN, and YOLO, use a coordinate encoding where the bounding box regression is given relative to the anchor box:
Centers:
$t_x = (x-x_a)/w_a$ and $t_y = (y-y_a)/h_a$
Height and width offsets:
$t_w = \log(w/w_a)$ and $t_h = \log(h/h_a)$
Why is the width and height prediction in logarithmic format? Is there a optimization reason for this?