Say, a balanced training set containing images that depict either a cat, dog, horse, or panda is given. One trains a machine learning model (e.g., a neural network) to classify whether an image depicts a cat. Then, one wants a model that approximates the data generating process (DGP) that generates the distribution of cat images. One can do the same for other animals by training further one-vs-rest, binary classifiers.
Now suppose one trains a multi-class classification model to distinguish whether a cat, dog, horse, or panda is depicted on an image (e.g., a neural network with four neurons in the output layer). What exactly does this model (should) approximate in this case? Does the model embed approximations of the DGPs of each of the four image distributions?
From a statistical view, what is the underlying difference between a multi-class classifier and an ensemble of one-vs-rest, binary classifiers when both perform the same task and use the same training set?