I read the paper and I understand that anchoring one image and select corresponding semi-hard positives and negatives is an efficient way of generating samples.
However, I don't understand why the distinction between the anchor and the positive still exists in the loss function. In other words, given a triplet that's already chosen, both the anchor and the positive corresponds to the same person. Why not just add the loss of the distance between the positive and the negative as well? Is there a reason behind this or it's just an alternative?
Formally, the triplet loss is defined as:
$$\mathcal{J} = \sum^{m}_{i=1} \large[ \small \mid \mid f(A^{(i)}) - f(P^{(i)}) \mid \mid_2^2 - \mid \mid f(A^{(i)}) - f(N^{(i)}) \mid \mid_2^2 + \alpha \large ] \small_+ $$
and why not use: $$\mathcal{J} = \sum^{m}_{i=1} \large[ \small \mid \mid f(A^{(i)}) - f(P^{(i)}) \mid \mid_2^2 - \mid \mid f(A^{(i)}) - f(N^{(i)}) \mid \mid_2^2 - \mid \mid f(P^{(i)}) - f(N^{(i)}) \mid \mid_2^2+ \alpha \large] \small_+ $$
Conceptually, the original loss function "pushes" the anchor towards the positive and away from the negative. Isn't pushing both the positive and the anchor away from the negative a good thing?