4

I am trying to understand the ArcFace Implementation and I am stuck at one condition.

If the $ \cos(t) > \cos(\pi -m)$ then $t + m > \pi$. In this case the way how we're computing $\cos(t+m)$ is changed into $cos(t+m) = \cos(t) - m * \sin(m)$. Could you explain this step?

I was looking for the solution and I've found in the github issue that suggests, the term $\cos(t) - m*\sin(m)$ is the Taylor expansion of $\cos(t+m)$ but still I don't understand the benefit real of that. If we're using the Taylor expansion then $cos(t) - m*sin(m)$ is an approximation of $\cos(t+m)$, so what is the advantage?

Here is the fragment of the code on which I based:

def call(self, embds, labels):
    self.cos_m = tf.identity(math.cos(self.margin), name='cos_m')
    self.sin_m = tf.identity(math.sin(self.margin), name='sin_m')
    self.th = tf.identity(math.cos(math.pi - self.margin), name='th')
    self.mm = tf.multiply(self.sin_m, self.margin, name='mm')

    normed_embds = tf.nn.l2_normalize(embds, axis=1, name='normed_embd')
    normed_w = tf.nn.l2_normalize(self.w, axis=0, name='normed_weights')

    cos_t = tf.matmul(normed_embds, normed_w, name='cos_t')
    sin_t = tf.sqrt(1. - cos_t ** 2, name='sin_t')

    cos_mt = tf.subtract(
        cos_t * self.cos_m, sin_t * self.sin_m, name='cos_mt')

    cos_mt = tf.where(cos_t > self.th, cos_mt, cos_t - self.mm)

    mask = tf.one_hot(tf.cast(labels, tf.int32), depth=self.num_classes,
                      name='one_hot_mask')

    logists = tf.where(mask == 1., cos_mt, cos_t)
    logists = tf.multiply(logists, self.logist_scale, 'arcface_logist')

    return logists

I've checked how the network will perform with this condition and when assign $\cos(t)$ when $t + m > \pi$. I've trained the network with ArcLoss on MNIST with embedding size = 2 and I've plotted the embeddings. The plots are really similar and I cannot observe the impact of using $cos(t) - m * \sin(m)$ instead of $\cos(t)$.

I understand (and see) that we should modify the matrix cos_mt when $m+t > \pi$ but I don't understand why it's done in that way. Could you help me?

10 clusters 10 clusters

pawols
  • 141
  • 4

1 Answers1

3

Using the sum-to-product formulae for trigonometric functions you have the exact equation:

$$\begin{align} \cos (t+m) &= \cos ((t+\tfrac{m}{2})+\tfrac{m}{2}) \\[6pt] &= \cos((t+\tfrac{m}{2})-\tfrac{m}{2}) - 2 \sin(\tfrac{m}{2}) \sin(t+\tfrac{m}{2}) \\[6pt] &= \cos(t) - 2 \sin(\tfrac{m}{2}) \sin(t+\tfrac{m}{2}). \\[6pt] \end{align}$$

Now, if $m$ is small then you have $\sin(t+\tfrac{m}{2}) \approx \sin(t)$ and $\sin(\tfrac{m}{2}) \approx \tfrac{m}{2}$, giving the approximation:

$$\cos (t+m) \approx \cos(t) - m \sin(t).$$

This is different to the approximation you give in your question; I have been unable to derive the latter. In any case, as to the advantage of using this approximation, it approximates a nonlinear function in $m$ by an affine function, which is much simpler to deal with. In the context of the ArcFace paper you are looking at, you have model parameters appearing in one of the cosine terms, so this approximation gives you an affine function with respect to the model parameters. This makes it easier to compute estimates of the model parameters.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • Thanks for your answer, but I still don't see the benefit of such approximation for $t + m > \pi$. Why we cannot put in such a case just a value of $\cos(t)$? I thought about this problem, and as I understand the goal of this approximation is to keep the function monotonic in $[0, \pi]$ but according to AdaCos paper https://arxiv.org/pdf/1905.00292.pdf (It shows the embedding for the different class are perpendicular to each other) assigning $cos(t)$ in most cases should be enough. – pawols Mar 01 '21 at 08:18
  • I also fail to see the benefit. As I noted in my answer, the approximation I derived is different to the one in your question (though it is somewhat similar). – Ben Mar 01 '21 at 10:45