I'm currently reading Bishop's Pattern Recognition and Machine Learning. In the chapter on kernel methods, he's very clear that kernels must be "valid", that is: be representable as scalar products in some feature space (no matter what that might actually be).
Why is this scalar product criterion so important? Why is it invalid to just define the kernel as some arbitrary, non-product distance function of its arguments?