In the text I read the following:
I’m confused on the dimensions of the bias vector. How can we add a(m,1)
vector to a(1, p)
vector? Is w0
shaped correctly? Or should w1
be shaped (n, P)
to account for P
classes, and the we broadcast w0
?
Note: I assume w1
should be (n, P)
so that our matrix multiplication yields a row of unnormalized logits for each class prediction for each observation. Then does it make sense to add a per-class bias and broadcast that to the number of samples in our data?
I feel foolish for even asking but walking through the example I couldn’t reconcile...