I'm trying to understand the math behind Transformers, specifically self-attention. This link, and many others, gives the formula to compute the output vectors from the input embeddings as:
$$Q=XW_Q,\;\;\;K=XW_K,\;\;\;V=XW_V$$ $$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt d_k})V$$
But this eventually becomes
$$Attention(Q,K,V)=softmax(X\frac{W_QW_K^T}{\sqrt d_k}X^T)V$$
If $W_Q$ and $W_K$ are only ever used in the form $\frac{W_QW_K^T}{\sqrt d_k}$, why do we initialize both matrices at all? why not just define and initialize a single matrix $W_{QK}$, skip the matrix multiplication, and get rid of the redundant weights?