It's common to see people using a tf-idf representation of words for text classification, but I don't understand why not just use tf.
Say $tf(t,d)$ is the $tf$ of term $t$ in a document $d$ and $idf(t)$ is the $idf$ of term $t$. $idf(t)$ will be the same for every document, so it is just a number by which we multiply every $tf(t,d)$. If we fit a linear model, for example, won't multiplying the $tf(t,d)$ by $idf(t)$ just shrink the learned weights by $idf(t)$? If we further normalize or standardize the columns, won't the two representations end up being the same?
I saw the same question here, but I'm not sure if I agree with the accepted answer. It argues that better features help learning, but it doesn't explain how this factor effectively leads to better features when we deal to models invariant to scaling.