Is feature crossing actually useful in deep learning, which uses activation functions?

Question

So i know feature crossing is a way to transform the data such that it can be linearly separated, which makes it useful for things like classification. But in a DNN, activation functions replicate non-linearity as well. So is feature crossing really useful in Deep Learning?

EDIT: this question is not asking about feature crossing with traditional ml methods, like linear-regression or SVM. I am asking about deep neural networks, which automatically transform features and generate non-linearity

Related: https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture — DifferentialPleiometry, Jul 27 '21 at 18:55
@Firebug Its basically creating a new feature by multiplying two existing features together(hence the name crossing). People do it as a way to transform data so that it can be linearly separable. — Abhiraam Eranti, Jul 27 '21 at 18:59
@Galen but do you need to do that for DNN's? Don't they do that themselves? — Abhiraam Eranti, Jul 27 '21 at 19:00
DNN's can learn their own features (linear or not) at the cost of more parameters, although some degree of parameter sharing usually occurs. — DifferentialPleiometry, Jul 27 '21 at 19:03
That's usually called an "interaction", never heard the term "crossing" on this context before — Firebug, Jul 27 '21 at 19:04
See https://stats.stackexchange.com/questions/491338/which-ml-algorithm-can-learn-non-linear-interaction-effects/491484#491484 — Firebug, Jul 27 '21 at 19:05
Some examples of hard-to-learn problems for vanilla FFC NNs https://stats.stackexchange.com/questions/349155/why-do-neural-networks-need-feature-selection-engineering/349202#349202 — Sycorax, Jul 27 '21 at 19:41
As usual, if you expect a certain characteristic, it will be more efficient to use that expectation that to make your model discover it. Why make the model have to discover that $x_1 x_2$ matters when you can tell it? Likewise, why apply a Fourier transform (wavelet, whatever) to a time series if a neural network is a universal approximator? I say that this is because you can get this transform by doing it explicitly, rather than having to spend parameters (overfitting risk) to discover that such a transformation is useful. // I am a huge fan of that answer linked by Sycorax! — Dave, Jul 27 '21 at 19:50
@Firebug, never heard "feature interaction" before, but heard feature crossing a lot. Good to know about this term as well :) — Abhiraam Eranti, Jul 27 '21 at 20:33

Tim · Accepted Answer · 2021-07-27T20:21:16.947

Quoting my other answer regarding feature engineering in general

So while in many cases you could expect from the algorithm to find the solution, alternatively, by feature engineering you could simplify the problem. Simple problems are easier and faster to solve, and need less complicated algorithms. Simple algorithms are often more robust, the results are often more interpretable, they are more scalable (less computational resources, time to train, etc.) and portable. [...]

Moreover, don't believe everything the machine learning marketers tell you. In most cases, the algorithms won't "learn by themselves". You usually have limited time, resources, computational power, and the data has usually the limited size and is noisy, neither of these helps.

Yes, deep learning models can learn feature crosses (aka interaction terms) alike features by themselves. However by providing them by yourself, you simplify the problem to be solved, so you can expect it to converge faster. The fact alone that neural networks can learn from nearly raw data does not mean that we should drop all attempts for feature engineering.

A similar argument can be made about unsupervised learning algorithms: if we can learn without labels, why bother? We bother because learning from labeled data is easier, faster, needs simpler algorithms, less data, it's easier to debug, it's less tricky to train since by providing the labels you point the algorithm into the desired direction. The same applies to feature engineering.

A simple example is provided in my other answer referred above, it's learning XOR function from the data. With feature cross, it can be solved using a trivial model, while without you would need to build a much more complicated model (e.g. multi-layer network).

(+1) I strongly agree on considering the limitations on resources. While there are certain mathematical results that hold under certain assumptions, in practice the best model is the one that works... best in practice. In other words, empirically testing what works well for a given use case is important. — DifferentialPleiometry, Jul 27 '21 at 20:45

Is feature crossing actually useful in deep learning, which uses activation functions?

1 Answers1