I've been going over many material in classification algorithms, and it seems that under the constraint that the covariance matrices are the same for a two-class problem then classifying a vector $x$ to $C_1$ or $C_2$ will depend on:
$(\mu_1-\mu_2)^T|\sum|^{-1}x-(1/2)(\mu_1-\mu_2)^T|\sum|^{-1}(\mu_1-\mu_2)>Ln(\frac{p(C_2)}{p(C_1)})$
where $\mu_i$ and $\sum_i$ are the mean and covariance matrices of $C_i$. Note that given the assumption $\sum_1 = \sum_2 =\sum$. My first question is under what rule should I calculate $\sum$? In some texts I see that standard practice is to perform a weighted average on $\sum_1$ and $\sum_2$. In other cases, I just see that $\sum$ might as well be the identity matrix $I$, given that all the features in the vector are independent.
My second question is that the limit for calculating the decision boundary for the Bayesian classifier with the previous equation seems straightforward. However, it does not look so clear for Fisher's LDA (unless I'm missing something).
The Fisher LDA constraint wants us to spread interclass means and reduce variance, and this leads us to optimize $J(w)$, where:
$J(w) = \frac{w^T S_B(w) w}{w^T S_W(w) w}$
leading us to $w \propto |S_w|^{-1}(\mu_1-\mu_2)$, for the classification rule:
$y=w^T x + w_0$, where if $y\geq 0, x\rightarrow C_1$, and $x\in C_2$ otherwise.
Thus, how do I calculate $w_0$? I can try to find it through a "Bayesian = LDA" equivalence by making $w = k S_w^{-1} (\mu_1-\mu_2)$:
$w_0 = (k/2)(\mu_1-\mu_2)^T|\sum|^{-1}(\mu_2-\mu_1)$
but now $w_0$ depends on my auxiliary constant $k$, or am I supposed to go with manually fine tuning $w_0$ for best performance on my validation data?