How to handle dependent, multidimensional output in machine learning

Question

I have some data where X is n x p and Y is n x d, where d = 36. To recreate it I am currently training 36 independent models to take X and recreate Y one column at a time. It works okay, but it strikes me that the columns of Y are not perfectly independent. Y describes the radii around some center point of a somewhat smooth shape, so I should expect neighboring columns of Y will have a relationship and be more related than columns farther apart.

Is there a way to exploit this extra information to make my model better at reproducing Y?

Pavel Komarov · Accepted Answer · 2022-03-01T17:44:06.823

After a lot more reading (Borchani, 2015), I believe the story is more hopeful than Sam and I thought, but it is still incomplete.

First, there are numerous ways to build a multivariate model out of univariate ones. Borchani, et al. calls them "Problem Transformation methods". I call them hacky; some of them had actually crossed my mind before I even asked this question, but I went looking for a more theoretically solid way instead of trying them. Essentially, these methods consist of layering univariate models so some take the outputs of others as inputs. The resulting architectures can be characterized as either "stacking" or "chaining". There are some interesting ideas and sophisticated structures, but ultimately:

"Considering the model's predictive performance as a comparison criterion, the benefits of using multi-target regressor stacking [MTRS] and regressor chains [RC] (or ERC and the corrected versions) instead of the baseline single-target approach [where you train independent univariate learners on each output dimension] are not so clear. In fact, in Spyromitros-Xioufis et al., an extensive empirical comparison of these methods is presented, and the results show that single-target methods outperform several variants of MTRS and ERC...In particular, the benefits of MTRS and RC methods seem to derive uniquely from the randomization process and from the ensemble model."

So that's not great. What can be done?

Thankfully there are other methods the author calls "Algorithm Adaptation methods", which can be further broken down in to "Statistical methods" and true ML algorithm adaptations, the kind of thing I was originally looking for.

Statistical methods are really a generalization of linear regression:

Statistical methods could improve notably [on] the performance ...[of] single-target regression, but only if...a relation among outputs truly exists and a linear output-output relationship (in addition to a linear input-output relationship) is verified. Otherwise, using these statistical models could produce a detriment [to] predictive performance. In particular, if we assume that the $d × p$ [where $X$ lives in $ℝ^{p}$ and $Y$ in $ℝ^{d}$] matrix of regression coefficients has reduced rank $r < min(d, p)$ [as is a common feature to the statistical models discussed] when in reality it possesses a full-rank, then we are obviously wrongly estimating the relationship, and we lose some information.

Like statistical models, Multi-Output Support Vector Regression "mainly rel[ies] on the idea of embedding the output space. [Both MO-SVR and statistical models] assume that the space of the output variables could be described using a subspace of lower dimension than $ℝ^{d}$." Note embedding makes sense if (1) $p < d$ ("an embedding is certain") or (2) the output space has structure (in a subspace or on a manifold). But unlike vanilla statistical models, MO-SVR can handle nonlinear cases. (Though one should use the higher-bias, lower variance vanilla models when dealing with a linear problem, as nonlinear models like to overfit in such scenarios.)

"MO-SVR [is], in general, designed to achieve a good predictive performance where linearity can not be assumed."

The other major, successful true ML adaptation is Multi-Target Regression Trees, which is a lot like ordinary decision trees except they use a different notion of "purity" to decide splits and calculate multidimensional answers at the leaves. Scikit-Learn actually implements this in DecisionTreeRegressor and ExtraTreeRegressor. I am not yet convinced these models are actually better at making predictions, but they have the advantage of being simpler, since they can be composed of fewer trees or even a single tree.

"Multi-target regression trees...are based on finding simpler multi-output models that usually achieve good performance (i.e. comparable with single-target approach)."

There are a couple of other things: "Kernel methods" are related to MO-SVR, and "Rule methods" are related to trees, but that's it. According to the authors: "To the best of our knowledge, there is no other review paper addressing the challenging problem of multi-output regression." That was 2015. The story may have changed slightly, but I have not been able to discover any methods aside from these, which is shocking. In Algebra II we learned about basic regression, coming up with a function to map a single input to a single output. At the institute we learned how Statisticians and Computer Scientists have come up with multifarious methods to map from many inputs to a single output. The next stage of the evolution is clear, but most algorithms remain un-generalized.

Neural nets actually handle multi-dimensional output pretty naturally too: https://stats.stackexchange.com/questions/261227/neural-network-for-multiple-output-regression/305242#305242 — Pavel Komarov, Jun 13 '19 at 02:37

score 1 · Answer 2 · answered Aug 30 '17 at 22:44

1

I just had a discussion with my supervisor about this today, as a result of which I believe there is not necessarily an answer to your question at this point.

What you are describing, is being able to exploit some "smoothness" on the output, to get better results. That means, instead of, for example, minimizing each squared error individually, you try to minimize some multivariate Gaussian output.

In my opinion (aka, I have nothing to back it up), training performance would always be worse, since you are imposing a more "regularized" model, by saying the outputs are somehow related, as opposed to completely constraint free. This might help you remove some variance in the model, while imposing some bias (which can increase the generalization performance).

If you are looking for an implementation, I believe Python's scikit-learn has a multi-output functionality, however I'm not sure if they just fit individually, or somehow link the outputs. Additionally, there exist multivariate versions of the squared loss, which you might want to use to fit your model.

answered Aug 30 '17 at 22:44

Sam

758
3
14

I wouldn't be adding bias that doesn't already exist in the generating probability distribution, so I think enforcing an effectively regularizing constraint would just be a pure win. It looks like scikit-learn's multioutput stuff effectively just does the same thing I am doing now http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html – Pavel Komarov Aug 31 '17 at 15:53
What I really want is something that can do "structured" output. Something more like this https://en.wikipedia.org/wiki/Structured_support_vector_machine. Are there other modified models for other kinds of learners? – Pavel Komarov Aug 31 '17 at 16:01
well this bias doesn't necessarily exist yet, since you are going from a univariate model on the errors (with no constraints on the multivariate behaviour), to a multivariate model. If you think about it in terms of Gaussians, in the univariate case, you could have whatever relationship you want between outputs, and the marginal is Gaussian, while in the multivariate case, your marginals are still Gaussian, but the relationship between outputs is now also multivariate Gaussian. But what I meant was that we believe this doesnt exist yet for the general case, due to a lack of multivariate losses – Sam Aug 31 '17 at 16:18
"**Simple regression** pertains to **one** dependent variable ($y$) and **one** independent variable ($x$): $y = f(x)$ **Multiple (or multivariable) regression** pertains to **one** dependent variable and **multiple** independent variables: $y = f(x_1, x_2, ..., x_n)$ **Multivariate regression** pertains to **multiple** dependent variables and **multiple** independent variables: $y_1, y_2, ..., y_m = f(x_1, x_2, ..., x_n)$. You may encounter problems $Y = f(X)$" https://stats.stackexchange.com/questions/2358/explain-the-difference-between-multiple-regression-and-multivariate-regression – Pavel Komarov Aug 31 '17 at 20:18
well yes, but what you are leaving out, is the noise (currently your model is deterministic). In the multivariate regression you'd have a vector-valued function f, which is basically the same as having m individual f, but your error would be multivariate (I'm guessing), which would impose some smoothness constraints between the $y_1,...,y_m$, but as I said, I believe that is not a well-researched area. Multivariate regression is subject to multivariate Gaussian noise, which you can probably fit using some multivariate mean square loss – Sam Aug 31 '17 at 22:47
This is disappointing. There should be extensions of models that generalize to the multivariate case. I don't want to just link together error across dimensions to enforce smoothness; I want the generation of the output to mirror the linkages in the underlying generating function. But the only model like this in The Elements of Statistical Learning is Linear Regression, and then it really only works because the loss is separable. https://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf, section 2.2.8 talks about an extension of SVMs, but it is not perfectly clear to me. – Pavel Komarov Sep 01 '17 at 15:08
What is DecisionTreeRegressor in scikit learn doing that it can handle multidimensional output? (Sorry to abuse the comments. This is my last one.) – Pavel Komarov Sep 01 '17 at 15:13
1

We might have slightly different viewpoints on this - I personally don't see how f should be anything else but separate algorithms, and I wouldn't know how you'd enforce some type of smoothness on the set of used functions (really - no clue). The real dependence for me should come from the error, and a multivariate error should really do the trick. Re: DecisionTreeRegressor - no clue, I've only ever used it for multivariate. About the comments - I don't know if we're supposed to do that, but no one complained so far :P – Sam Sep 01 '17 at 15:52
This presents some loss functions for the multivariate case. It's pretty dense. http://www.cs.put.poznan.pl/kdembczynski/pdf/multi-target_prediction.pdf – Pavel Komarov Sep 15 '17 at 23:45

How to handle dependent, multidimensional output in machine learning

2 Answers2

Linked