3

I'm running a regression analysis with independent variables $X_{1}, X_{2}, \cdots, X_{n}$ and dependent variable $Y$. There is a constraint among some of the independent variables, say, $X_{1} + X_{2} + X_{3} = 1$. What kinds of regression models (or other data science techniques) could be used in this scenario?

  • Maybe compositional data. – user2974951 Sep 02 '19 at 12:16
  • With just one dependent variable, this is multiple rather than multivariate. No one has yet given the simplest response which is that if some of your predictors add to a total, then you can dispense with one of them. Anything more complicated is not essential on that ground alone. To see this, imagine a two-category classification yielding say fraction which are red and fraction which are not red. There is no more information in not red than in red, so choose one of the two variables. – Nick Cox Sep 02 '19 at 13:18
  • 1
    @NickCox ok I think I got your point. But here to me, if you wish to interpret the constraint on predictors alone (not the predictors time the beta), it does not mean that for every sample you have the constraint holding, so that you can get rid of one predictor in the whole model. In my personal opinion (which we can debate clearly) it rather means that you regress iff the selected predictors satisfy that relationship. So you just consider the samples where such relationship holds. Is it correct Nick? Or maybe Am I missing your point? – Fr1 Sep 02 '19 at 13:30
  • Thank you @NickCox, I changed the title to multiple regression. – waynelee1217 Sep 02 '19 at 13:42
  • And I think you are right, I might only need $X_{1}$ and $X_{2}$ in this case – waynelee1217 Sep 02 '19 at 13:49
  • 1
    'As @user2974951 hints briefly. data now often called _compositional_ are the fractions (proportions, percents, whatever) of mutually exclusive categories, so constrained in principle to add to 1 (100%). . Examples that are common are: texture of materials; chemical composition of materials; categories of expenditure; etc. – Nick Cox Sep 02 '19 at 13:59

1 Answers1

1

Maybe the constraint is on coefficients of such independent variables, or the sum of the products $x_{1} \beta_{1} + x_{2} \beta_{2} + x_{3} \beta_{3} $?

If this is the case you can use the Constrained OLS (look at this) which is a minimization of the sum of squared residuals subject to a constraint or set of constraints solved through the Lagrangian. So it is the constrained-optimum version of the typical unconstrained minimization of squared residuals of OLS. Notice that the principle of optimizing a cost function or maximizing a target function subject to a constraint be extended outside OLS estimators. For example you can perform Constrained Maximum Likelihood like this and this.

Some R examples ex1 ex2

Fr1
  • 1,348
  • 3
  • 10
  • Question seems clear to me. OP says that the constraint is on the values of the predictors, not their coefficients or the predictions. – Nick Cox Sep 02 '19 at 13:17
  • 1
    In that case I am eager to know the answer as well. I never saw one, which does mean it does not exist. But from the logical standpoint what are you doing here? Are you regressing y on the row x only when the row x meets the constraint? It may be.. however it is not very usual, For this reason I interpreted the question as the far more common case where you put a constraint on the beta or fitted values. Did you hear of it Nick? – Fr1 Sep 02 '19 at 13:21
  • Thank you @Fr1. I'm running a dataset with each row meets the constraint I mentioned above. I believe I'm not constraining $\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{3}=1$. Or should I have no constraint and just interpret the values of $\beta_{1}, \beta_{2}, \beta_{3}$? – waynelee1217 Sep 02 '19 at 13:47
  • 1
    So.. if your desired constraint is as you said initially X1+...+Xn=1 and you are saying that in your dataset each row is meeting such constraint (right?), then I believe that your regression will be a common OLS using that dataset (as constraints are already satisfied by all rows). If this is the case for your data and your intentions, well in that case you must not impose the b1X1+...+bnXn constraint. Let’s hear Nick’s opinion. – Fr1 Sep 02 '19 at 13:52
  • One predictor is redundant if the predictors are constrained to have constant sum. I wouldn't call that an opinion! – Nick Cox Sep 02 '19 at 14:01
  • 1
    @NickCox no I was referring to “listening to your opinion” just to say “let’s wait for Nick comments because they are valuable”. Having said that, what you are saying is right as long as all the observations (I.e. all the samples for the multiple predictors) ALREADY satisfy the constraints. IF NOT, then you have to filter the dataset and consider the only rows where the constraint is satisfied. Do you agree? Then ONLY those rows will be included into the regression dataset. And clearly for THOSE rows, you can use one less predictor. Not for ALL initial rows. This is what I am saying. – Fr1 Sep 02 '19 at 14:09
  • Thank you again @Fr1. I think I'll just follow Nick's suggestion, delete the redundant variable. – waynelee1217 Sep 02 '19 at 14:17
  • If for all rows your constraint ALREADY holds then yes, you can and you should. If not, I think you should first filter the dataset as explained in the previous comments. Then you get to a point where your constraint holds for each row of the new filtered dataset. And you drop one of the predictors which at that point is indeed a function of the others (let’s say X1 is such that X1=1-X2-X3) – Fr1 Sep 02 '19 at 14:19
  • It's not a black and white issue unless exceptionally the percents are all integers. Even with something simple such as proportions 0.111 and 0.889 the data as held within memory as binary approximations aren't guaranteed to give 1 as a total. But regression software usually has some threshold for collinearity and doesn't insist on exactness. Sure, rounding error and measurement error may bite in practice. Also, if some observations don't satisfy a known constraint non-trivially then such data don't imply a different analysis; they should not be used as they come. – Nick Cox Sep 02 '19 at 14:38
  • @NickCox I agree, 100%. However if you include the rows where the constraint does not hold, then you are estimating the betas by using observations outside the constraint. Which would mean that you are not minimizing the sum of squared errors subject to X1+X2+X3=1 and your estimated sensitivities will be affected by cases where the sum is different from 1. So they will not be sensitivities if y to Xs for the only cases where X1+X2+X3=1 as required by the problem – Fr1 Sep 02 '19 at 14:43
  • Notice the typo in my previous comment “they will not be sensitivities OF y to the Xs for the only cases where..” – Fr1 Sep 02 '19 at 14:51
  • I'd suggest to the OP that they check that 1 - (X1 + X2 + X3) is close to 0. – Nick Cox Sep 02 '19 at 19:18