When does it make sense to train against intermediate target variables instead of a single final variable?

Question

For example, there are self-driving car companies that take the approach of training several neural nets to recognize sign posts, pedestrians, cars, road marks etc. and then compose those intermediate neural nets together.

On the other hand, there are other self driving car companies that focus on an end-to-end approach - taking the raw image data and mapping it to a particular action (go left, right, speed etc.).

How do you know when to train against several intermediate variables and composing them together vs training directly against the final target variable?

What are the advantages of each approach?

This is a practice of transfer learning and multi-task learning. Probably there is no universal answer to this. Might be relevant https://stats.stackexchange.com/questions/557769/confusion-about-the-training-procedure-while-using-transfer-learning/558997#558997 — msuzen, Jan 13 '22 at 12:13

score 1 · Answer 1 · answered Jan 12 '22 at 10:02

An intriguing question, lead me to check a recent Karpathy presentation https://www.youtube.com/watch?v=FwT4TSRsiVw. This is a crazily difficult task, and the presenter gives some great insights on how to make things work!

You may see from there the video, if one would try to do everything end-to-end (from image(s) to actions) one would face several more problems. Some examples:

engineering problem #1: a single model/network may be too hard or expensive to process onboard (see that backbones),
engineering problem #2: similarly you don't have infinite memory even for the single components (check how context knowledge is injected when choosing when to push to the feature queue at 18m),
research problem #1: building intermediate knowledge. It is clear that they are dissecting the self-driving task and solving different aspects, seeing the evolution from 1 to many cameras. How could you possibly know the right amount of cameras from an end-to-end training?
research problem #2: even if it were technologically possible to train and run inference on a full end-to-end system, you would hardly have any insights about how it is interpreting reality. Creating components has a huge advantage: disentangling pieces of the system allows for failure analysis.

When does it make sense to train against intermediate target variables instead of a single final variable?

1 Answers1